HyperAIHyperAI

Command Palette

Search for a command to run...

Hacker News Vector Search Dataset with ClickHouse: Building Scalable Semantic Search and Generative AI Applications

The Hacker News vector search dataset provided by ClickHouse contains 28.74 million postings with vector embeddings generated using the SentenceTransformers model all-MiniLM-L6-v2. Each embedding has a dimension of 384. The dataset is available as a single Parquet file in an S3 bucket and is designed to help users explore the design, sizing, and performance of large-scale vector search applications built on user-generated textual data. To begin, create a table in ClickHouse to store the data using the MergeTree engine and order by the id field. The table includes fields such as id, doc_id, text, vector (an array of Float32 values), node_info (a tuple with start and end timestamps), metadata, type (an enum for story, comment, poll, pollopt, or job), by (a low-cardinality string for the author), time (a DateTime), title, post_score, and flags for dead and deleted items, as well as length. The vector field stores the semantic embeddings used for similarity search. To perform a search, generate an embedding for a query using the same all-MiniLM-L6-v2 model. This query vector can then be passed to the cosineDistance function in a ClickHouse SELECT query to find the most similar documents. A sample application demonstrates a generative AI use case: users input a topic, the system generates a query embedding, retrieves relevant posts via vector similarity search, and then uses the LangChain library with OpenAI’s gpt-3.5-turbo model to summarize the results. The retrieved text is passed as context to the language model, which produces a concise summary. The code example shows how to set up the environment, read user input, generate embeddings, query ClickHouse, split the text for token limits, and run a summarization chain. It automatically switches between 'stuff' and 'map_reduce' chain types depending on the token count to stay within the model’s limits. This workflow illustrates a practical enterprise-grade application of vector search combined with generative AI, applicable to domains such as customer sentiment analysis, technical support automation, legal document review, medical records, meeting transcripts, financial reporting, and more. The example highlights how vector databases like ClickHouse can serve as a foundation for intelligent, context-aware AI systems.

Related Links