HyperAIHyperAI

Command Palette

Search for a command to run...

GovScape Searches Millions of End of Term Web Archive PDFs

A University of Washington research team has introduced GovScape, an artificial intelligence-driven search system designed to navigate the End of Term Web Archive, a massive repository preserving federal government websites and documents from each presidential administration since 2008. Currently indexed to cover ten million PDFs from Donald Trump’s first term, the platform enables researchers, journalists, and the public to efficiently locate specific government records across a sprawling digital landscape. GovScape supports three distinct search methodologies. Standard keyword queries return exact text matches, while semantic search identifies topically relevant documents even when precise terms are absent. A novel multimodal capability allows users to filter documents by visual attributes, such as identifying redacted pages, aerial photography, or infographics. To process the files, the system employs an automated pipeline that converts each PDF page into an image, extracts embedded text, and generates numerical embeddings using highly optimized AI models. These embeddings function like a dynamic classification system, grouping pages by combined textual and visual similarity to accelerate retrieval. The project’s computational efficiency stands out as a key achievement. By leveraging streamlined AI architectures, the team processed the entire ten-million-page dataset for under $1,500, equating to approximately one dollar per 47,000 pages. This efficiency drastically undercuts commercial alternatives, which typically charge around one dollar for merely one hundred pages of AI-powered parsing. Led by Benjamin Charles Germain Lee, an assistant professor in the University of Washington Information School, the research addresses the growing challenge of information retrieval within expanding digital archives. As platforms like the Internet Archive approach a trillion archived pages, the ability to systematically filter and locate critical data becomes essential for historical preservation and public accountability. The team plans to expand GovScape’s indexing to encompass the full seventy million PDFs archived from 2008 through 2024. Future iterations may also integrate non-PDF government files, including spreadsheets, raw images, and HTML records. The complete research findings will be presented on July 5 at the Annual Meeting of the Association of Computational Linguistics in San Diego, with a preprint currently available on arXiv. By transforming unsearchable government repositories into accessible knowledge bases, GovScape aims to bolster democratic transparency and streamline the workflow for professionals relying on federal data.

Related Links