Learning Metadata-Agnostic Representations for Text-to-SQL In-Context Example Selection

In-context learning (ICL) is a powerful paradigm where large language models(LLMs) benefit from task demonstrations added to the prompt. Yet, selectingoptimal demonstrations is not trivial, especially for complex or multi-modaltasks where input and output distributions differ. We hypothesize that formingtask-specific representations of the input is key. In this paper, we propose amethod to align representations of natural language questions and those of SQLqueries in a shared embedding space. Our technique, dubbed MARLO -Metadata-Agnostic Representation Learning for Text-tO-SQL - uses querystructure to model querying intent without over-indexing on underlying databasemetadata (i.e. tables, columns, or domain-specific entities of a databasereferenced in the question or query). This allows MARLO to select examples thatare structurally and semantically relevant for the task rather than examplesthat are spuriously related to a certain domain or question phrasing. When usedto retrieve examples based on question similarity, MARLO shows superiorperformance compared to generic embedding models (on average +2.9\%pt. inexecution accuracy) on the Spider benchmark. It also outperforms the next bestmethod that masks metadata information by +0.8\%pt. in execution accuracy onaverage, while imposing a significantly lower inference latency.