Google Unveils EmbeddingGemma: A Lightweight, High-Performance Embedding Model for On-Device AI
Built on the Gemma 3 architecture and trained on over 100 languages, it can run on less than 200MB of RAM with quantization.
Google has launched EmbeddingGemma, a compact open embedding model designed for on-device AI, delivering state-of-the-art performance despite its small size. With 308 million parameters, the model is tailored for applications like Retrieval Augmented Generation (RAG) and semantic search, running efficiently on everyday hardware without requiring an internet connection.
According to Google, “EmbeddingGemma is the highest ranking open multilingual text embedding model under 500M on the Massive Text Embedding Benchmark (MTEB).”
Introducing EmbeddingGemma, our newest open model that can run completely on-device. It's the top model under 500M parameters on the MTEB benchmark and comparable to models nearly 2x its size – enabling state-of-the-art embeddings for search, retrieval + more.
— Sundar Pichai (@sundarpichai) September 4, 2025
Built on the Gemma 3 architecture and trained on over 100 languages, it can run on less than 200MB of RAM with quantization, making it accessible for mobile phones, laptops, and desktops.
The model supports Matryoshka Representation Learning, offering flexible output dimensions from 768 down to 128 for speed and storage efficiency. With inference times under 15 milliseconds for short inputs on EdgeTPU, it enables real-time interactions for RAG pipelines, chatbots, and personalised search.
EmbeddingGemma integrates seamlessly with popular developer tools including sentence-transformers, llama.cpp, MLX, Ollama, LiteRT, transformers.js, LMStudio, Weaviate, Cloudflare, LlamaIndex, and LangChain. Developers can use it for multilingual search, document retrieval, classification, clustering, and offline chatbot applications.
By processing data directly on-device, the model ensures greater privacy while maintaining high-quality embeddings. It also shares the tokenizer with Gemma 3n, reducing memory requirements in AI workflows.
The model is now available for download on Hugging Face, Kaggle, and Vertex AI, with integration guides and quickstart examples provided in the Gemma Cookbook.
Comments ()