HyperAI
Back to Headlines

How to Train and Finetune Sparse Embedding Models with Sentence Transformers v5 for Efficient Hybrid Search

4 days ago

Sentence Transformers, a Python library for training and using embedding and reranker models, has recently seen significant updates in version 5, focusing on fine-tuning sparse embedding models. These models are crucial for applications like retrieval augmented generation, semantic search, and paraphrase mining. This summary explains the concepts, advantages, and practical steps involved in fine-tuning sparse embedding models using Sentence Transformers. What are Sparse Embedding Models? Sparse embedding models convert text into high-dimensional vectors where most values are zero. Each non-zero dimension typically corresponds to a specific token in the model's vocabulary, making them interpretable. For instance, the model naver/splade-v3 produces 30,522-dimensional vectors, where the non-zero dimensions represent the most relevant tokens. Query and Document Expansion A key feature of neural sparse embedding models is query/document expansion, where the model automatically adds semantically related terms. For example, "The weather is lovely today" might expand to include "beautiful," "cool," "pretty," and "nice." This enhances the model's ability to match related content, manage misspellings, and resolve vocabulary mismatches, often outperforming traditional lexical methods like BM25. Why Use Sparse Embedding Models? Sparse embedding models bridge the gap between traditional lexical methods and dense models. They offer the interpretability and efficiency of sparse representations while leveraging the semantic understanding of neural models. This combination makes them ideal for tasks where query latency and scalability are critical, such as large-scale search applications. Why Fine-Tune? While out-of-the-box sparse models can recognize general synonyms, they may not perform well in specific domains or languages. For example, the term "cephalalgia" may not be expanded to "headache," but fine-tuning can teach the model to make such domain-specific associations. This improves the model's relevance and performance in specialized contexts. Training Components Model The SparseEncoder class in Sentence Transformers can be used to either fine-tune a pre-trained sparse encoder or train a new one from scratch. For example: ```python from sentence_transformers import SparseEncoder, models from sentence_transformers.sparse_encoder.models import MLMTransformer, SpladePooling Pre-trained model model = SparseEncoder("naver/splade-cocondenser-ensembledistil") Train from scratch mlm_transformer = MLMTransformer("distilbert-base-uncased") splade_pooling = SpladePooling(pooling_strategy="max") model = SparseEncoder(modules=[mlm_transformer, splade_pooling]) ``` Dataset Datasets can be loaded from the Hugging Face Datasets Hub or local files. It's important to ensure the dataset format matches the model's requirements, especially regarding the presence of a "label" column for certain loss functions. Loss Function Loss functions guide the model's optimization process. For sparse models, SpladeLoss or CSRLoss are typically used, with the main loss function provided as a parameter. For example: ```python from sentence_transformers.sparse_encoder.losses import SpladeLoss, SparseMultipleNegativesRankingLoss loss = SpladeLoss( model=model, loss=SparseMultipleNegativesRankingLoss(model=model), query_regularizer_weight=5e-5, document_regularizer_weight=3e-5 ) ``` Training Arguments The SparseEncoderTrainingArguments class specifies parameters that influence training performance and debugging. Key arguments include output_dir, num_train_epochs, per_device_train_batch_size, and learning_rate. For instance: ```python from sentence_transformers import SparseEncoderTrainingArguments args = SparseEncoderTrainingArguments( output_dir="models/splade-uncased", num_train_epochs=1, per_device_train_batch_size=16, learning_rate=2e-5 ) ``` Evaluator Evaluators provide metrics to assess the model's performance during training. Various evaluators are available, such as SparseEmbeddingSimilarityEvaluator and SparseTripletEvaluator. For example: ```python from sentence_transformers.evaluation import SimilarityFunction from sentence_transformers.sparse_encoder.evaluation import SparseEmbeddingSimilarityEvaluator eval_dataset = load_dataset("sentence-transformers/stsb", split="validation") dev_evaluator = SparseEmbeddingSimilarityEvaluator( sentences1=eval_dataset["sentence1"], sentences2=eval_dataset["sentence2"], scores=eval_dataset["score"], main_similarity=SimilarityFunction.COSINE ) ``` Trainer The SparseEncoderTrainer class integrates all components to train the model. Here's an example script: ```python import logging from datasets import load_dataset from sentence_transformers import SparseEncoder, SparseEncoderTrainingArguments, SparseEncoderTrainer from sentence_transformers.models import Router, MLMTransformer, SpladePooling from sentence_transformers.sparse_encoder.losses import SpladeLoss, SparseMultipleNegativesRankingLoss from sentence_transformers.sparse_encoder.evaluation import SparseNanoBEIREvaluator logging.basicConfig(format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO) Load model and components mlm_transformer = MLMTransformer("distilbert-base-uncased") splade_pooling = SpladePooling(pooling_strategy="max") router = Router.for_query_document(query_modules=[SparseStaticEmbedding(tokenizer=mlm_transformer.tokenizer, frozen=False)], document_modules=[mlm_transformer, splade_pooling]) model = SparseEncoder(modules=[router]) Load dataset train_dataset = load_dataset("sentence-transformers/natural-questions", split="train").select(range(100_000)) eval_dataset = train_dataset.train_test_split(test_size=1_000)["test"] Define loss function loss = SpladeLoss( model=model, loss=SparseMultipleNegativesRankingLoss(model=model), query_regularizer_weight=0, document_regularizer_weight=3e-3 ) Set up training arguments args = SparseEncoderTrainingArguments( output_dir="models/splade-uncased", num_train_epochs=1, per_device_train_batch_size=16, per_device_eval_batch_size=16, learning_rate=2e-5 ) Initialize evaluator dev_evaluator = SparseNanoBEIREvaluator() Create and train the model trainer = SparseEncoderTrainer( model=model, args=args, train_dataset=train_dataset, eval_dataset=eval_dataset, loss=loss, evaluator=dev_evaluator ) trainer.train() Evaluate the model dev_evaluator(model) Save the trained model model.save_pretrained("models/splade-uncased/final") Push to the Hugging Face Hub (optional) model.push_to_hub("splade-uncased-final") ``` Evaluation Evaluating the performance of the trained model on datasets like NanoMSMARCO shows that combining sparse and dense retrieval methods significantly improves search performance compared to using either method alone. For instance, combining sparse and dense rankings in the NanoMSMARCO dataset resulted in a 12.3% increase in NDCG@10 and an 18.7% increase in MRR@10 over the dense baseline. Adding a reranker further enhances performance to around 66.3 NDCG@10. Training Tips Evaluate Sparsity: Monitor the sparsity of the embeddings to ensure they are efficient to store and retrieve. Distillation: Consider using distillation techniques with a teacher model to enhance performance, as detailed in the SPLADE-v3 paper. Vector Database Integration Deploying trained models in production often involves integrating them with vector databases like Qdrant, OpenSearch, or Elasticsearch. Qdrant, for example, provides efficient storage and fast retrieval of sparse vectors. Here’s an example of setting up Qdrant for sparse vector search: ```python import time from datasets import load_dataset from sentence_transformers import SparseEncoder from sentence_transformers.sparse_encoder.search_engines import semantic_search_qdrant Load dataset and queries dataset = load_dataset("sentence-transformers/natural-questions", split="train") num_docs = 10_000 corpus = dataset["answer"][:num_docs] queries = dataset["query"][:2] Load pre-trained model sparse_model = SparseEncoder("naver/splade-cocondenser-ensembledistil") Encode and index corpus corpus_embeddings = sparse_model.encode_document(corpus, convert_to_sparse_tensor=True, batch_size=16) Encode queries and perform search while True: if corpus_index is None: query_embeddings = sparse_model.encode_query(queries, convert_to_sparse_tensor=True) else: query_embeddings = sparse_model.encode_query(queries, convert_to_sparse_tensor=True, index=corpus_index) results, search_time, corpus_index = semantic_search_qdrant( query_embeddings, corpus_index=corpus_index, corpus_embeddings=corpus_embeddings, top_k=5 ) # Output results print(f"Search time: {search_time:.6f} seconds") for query, result in zip(queries, results): print(f"Query: {query}") for entry in result: print(f"(Score: {entry['score']:.4f}) {corpus[entry['corpus_id']]}, corpus_id: {entry['corpus_id']}") # Prompt for new queries queries = [input("Please enter a question: ")] ``` Industry Insights Sparse embedding models are becoming increasingly popular due to their balance of semantic understanding and computational efficiency. Meta's significant investment in Scale AI highlights the growing importance of high-quality training data in AI development. Companies like Scale AI are addressing the challenge of generating domain-specific and high-precision data, which is crucial for fine-tuning models like those used with Sentence Transformers. Additional Resources For more detailed training examples and documentation, refer to the Sentence Transformers library’s official resources: - Training Examples - Model Documentation - Advanced Pages

Related Links