HyperAIHyperAI

Command Palette

Search for a command to run...

Extract Structured Data from Long Documents Using Google’s LangExtract and Gemma 3

Using Google’s LangExtract and Gemma 3 for structured data extraction from complex documents like insurance policies, medical records, and compliance reports is a powerful way to transform unstructured text into actionable insights. These documents are often lengthy and filled with dense language, making it difficult for users to quickly locate critical information such as coverage limits, exclusions, or obligations. Large language models have emerged as essential tools for tackling this challenge. They can identify key entities and relationships within text, converting messy, free-form content into clean, structured data. In this guide, we explore how Google’s open-source LangExtract framework and the Gemma 3 LLM work together to deliver accurate and efficient extraction, with a hands-on example using a motor insurance policy. LangExtract is a Python library designed to simplify structured information extraction from unstructured text. It allows developers to define what data to extract using natural language instructions. The framework supports named entity recognition and relationship extraction, enabling the model to link clauses to their conditions. Its ease of use—requiring just a few lines of code—makes it accessible even to non-experts. At the core of this setup is Gemma 3, a family of lightweight, high-performance open models from Google. The 4B parameter version used here runs efficiently on a single GPU and supports inputs up to 128K tokens, making it suitable for processing long documents. It is deployed locally via Ollama, a tool that simplifies running LLMs offline without relying on cloud services. Under the hood, LangExtract uses several advanced techniques to maintain accuracy across long documents. First, its intelligent chunking strategy splits text into logical segments based on sentence and paragraph boundaries, avoiding awkward cuts that could disrupt context. Second, it supports parallel processing, allowing multiple chunks to be analyzed simultaneously, which improves performance without sacrificing quality. Third, it employs multiple extraction passes—running the model several times with different random seeds—to increase recall. Since LLMs can miss certain entities in a single run, repeating the process helps surface more data. Results from each pass are merged, with earlier passes taking priority in case of overlap. To demonstrate, we processed a 10-page motor insurance policy from MSIG Singapore. After installing LangExtract and setting up Ollama to run Gemma 3 locally, we used PyMuPDF to extract text from the PDF. The document was then parsed into manageable chunks and passed through LangExtract. We crafted a system prompt that specified a JSON output format, which is essential because Gemma 3 does not enforce structured output by default. Without this, the model might return unstructured text or malformed JSON, breaking downstream processing. We also included few-shot examples using LangExtract’s ExampleData class to show the model how to map policy language to structured entries. The extraction ran successfully in under 10 minutes on an 8GB VRAM GPU. The output was saved and post-processed to improve readability. The final result included structured entries for each exclusion clause, with the original text, a plain-English explanation, and metadata like source line numbers. This approach turns complex legal language into clear, interpretable information. It enables faster decision-making, better compliance tracking, and easier data integration into downstream systems. In summary, LangExtract and Gemma 3 together offer a robust, efficient, and scalable solution for extracting structured data from long, unstructured documents. By combining smart chunking, parallel processing, and iterative extraction, they deliver high accuracy and recall—making them ideal for real-world applications in insurance, healthcare, and regulatory compliance. For those interested in trying this out, the full code and example are available in the GitHub repository. Follow along, experiment with your own documents, and explore the power of structured extraction with open tools.

Related Links