HyperAI

Google has introduced EmbeddingGemma, a new small, efficient, and multilingual embedding model designed for on-device use. With just 308 million parameters and a 2,048-token context window, the model delivers state-of-the-art performance on the Massive Multilingual Text Embedding Benchmark (MMTEB), ranking as the top text-only multilingual embedding model under 500 million parameters. When quantized, it uses less than 200 MB of RAM, making it ideal for mobile and edge applications. Built on the Gemma3 transformer architecture, EmbeddingGemma uses bidirectional attention—transforming the model into an encoder—enabling it to capture richer semantic representations than traditional causal decoders. The model processes input through a mean pooling layer and two dense layers to produce 768-dimensional embeddings. It was trained on a 320-billion-token multilingual corpus, carefully filtered to exclude harmful or low-quality content. A key innovation is Matryoshka Representation Learning (MRL), which allows the 768-dimensional output to be truncated to 512, 256, or 128 dimensions without significant performance loss. This enables faster, more memory-efficient downstream tasks like retrieval, clustering, and classification. EmbeddingGemma is open-source and compatible with major AI frameworks. It integrates seamlessly with Sentence Transformers, LangChain, LlamaIndex, Haystack, txtai, Transformers.js, Text Embedding Inference (TEI), and ONNX Runtime. Users must include specific prompts such as "task: search result | query: " for queries and "title: none | text: " for documents to ensure optimal performance. The model can be fine-tuned for domain-specific tasks. In a demonstration, researchers fine-tuned EmbeddingGemma on the Medical Instruction and Retrieval Dataset (MIRIAD), creating a version called sentence-transformers/embeddinggemma-300m-medical. This specialized model achieved a 0.8862 NDCG@10 score on medical retrieval tasks—surpassing larger general-purpose models and even outperforming models twice its size. Training was completed in about 5.5 hours on an RTX 3090 GPU, using a combination of Cached Multiple Negatives Ranking Loss and a custom evaluator. The fine-tuned model is available on Hugging Face and can be deployed via Docker, ONNX, or web-based frameworks like Transformers.js. With its compact size, multilingual support, high performance, and broad compatibility, EmbeddingGemma represents a major step forward in making high-quality embeddings accessible for real-world, on-device AI applications.

Google Unveils EmbeddingGemma: A Compact, Multilingual Embedding Model for On-Device AI

Related Links