Automatic Metadata Extraction for Text-to-SQL

How top Text-to-SQL performance was achieved

Source: Automatic Metadata Extraction for Text-to-SQL — arXiv:2505.19988 PDF [1](https://arxiv.org/pdf/2505.19988)

Benchmark

BIRD benchmark used for evaluation [2](https://arxiv.org/pdf/2505.19988)

Approach

Profiling, query log analysis, SQL→text metadata [3](https://arxiv.org/pdf/2505.19988)

Result

#1 on BIRD during multiple periods in 2024–2025 [4](https://arxiv.org/pdf/2505.19988)

// Core idea
// Make databases self-describing with automatically extracted metadata [5](https://arxiv.org/pdf/2505.19988)
// Then let an LLM compose SQL with reduced ambiguity [6](https://arxiv.org/pdf/2505.19988)

Problem context

Text-to-SQL systems struggle because real databases often lack accurate documentation and have evolving schemas, inconsistent formats, and complex join paths [7](https://arxiv.org/pdf/2505.19988).

Documentation can be missing or outdated, and SMEs may not fully know current contents [8](https://arxiv.org/pdf/2505.19988).
Fields can have multiple inconsistent formats, e.g., person-name layouts vary across rows [9](https://arxiv.org/pdf/2505.19988).
Related date fields and conditional joins complicate correct query construction [10](https://arxiv.org/pdf/2505.19988).

Key idea: Automatic metadata extraction

The system aggregates three kinds of metadata: classic profiling, query log analysis, and SQL→text generation using an LLM [11](https://arxiv.org/pdf/2505.19988).

Profiling: derive statistics, value patterns, keys, and relationships to clarify column semantics [12](https://arxiv.org/pdf/2505.19988).
Query log analysis: mine past joins, filters, and groupings to infer practical usage of schema elements [13](https://arxiv.org/pdf/2505.19988).
SQL→text: convert queries to natural-language paraphrases to build question-level hints and candidate selectors [14](https://arxiv.org/pdf/2505.19988).

For BIRD test databases, only profiling was used because query logs are not provided, and no fine-tuned model was required beyond GPT-4 [15](https://arxiv.org/pdf/2505.19988).

Empirical performance

The approach reached the highest score on BIRD from Sept 1–23, 2024 and Nov 11–23, 2024, regained #1 on Mar 11, 2025, and remained #1 as of May 2025 [16](https://arxiv.org/pdf/2505.19988).

The evaluation highlights that robust metadata alone can materially reduce LLM ambiguity in text-to-SQL [17](https://arxiv.org/pdf/2505.19988).

Why it works

Profiling captures ground truth about values, keys, and distributions, enabling precise column selection and filter formation from natural language prompts [18](https://arxiv.org/pdf/2505.19988).

SQL→text paraphrases create a bridge between query structure and user intent, helping candidate generation and selection workflows [19](https://arxiv.org/pdf/2505.19988).

When query logs exist, mined join paths and predicates encode production best practices for composing complex joins [20](https://arxiv.org/pdf/2505.19988).

// Example: improving question quality
// The paper shows cases where supplied questions mismatch SQL
// and metadata-guided paraphrases clarify intent [21](https://arxiv.org/pdf/2505.19988)

Illustrative examples

One example asks: In posts with 1 comment, how many of the comments have 0 score? with SQL counting comments joined to posts where CommentCount=1 and Score=0, and metadata-guided rewording increases accuracy [22](https://arxiv.org/pdf/2505.19988).

SELECT COUNT(T1.id)
FROM comments AS T1
JOIN posts AS T2 ON T1.PostId = T2.Id
WHERE T2.CommentCount = 1 AND T2.Score = 0 -- matches reworded question [23](https://arxiv.org/pdf/2505.19988)

Another example shows average SAT participation for schools opened in 1980 in Fresno County, where the generated question improves clarity compared to a vague original [24](https://arxiv.org/pdf/2505.19988).

Cases also include mismatches like asking about carcinogenic status when SQL only returns a bond type, illustrating why metadata-grounded paraphrasing matters [25](https://arxiv.org/pdf/2505.19988).

Positioning vs. related work

Similarity-based few-shot on Spider achieved top ranking with tuned LLM and vector-DB retrieval [26](https://arxiv.org/pdf/2505.19988).
CHESS used LLM-driven schema linking and candidate selection for BIRD, peaking at #1 [27](https://arxiv.org/pdf/2505.19988).
Distillery argued strong LLMs reduce need for explicit schema linking and reached #1 on BIRD before dropping in rank [28](https://arxiv.org/pdf/2505.19988).
IBM marketing cites extractive schema linking with a tuned Granite LLM for their BIRD results [29](https://arxiv.org/pdf/2505.19988).
Chase used query-shape diversity and a tuned Gemini for competitive results, trading #1 on BIRD [30](https://arxiv.org/pdf/2505.19988).
XiYan-SQL ensembles candidate SQL from multiple models and uses a selector model to pick outputs, leading the benchmark [31](https://arxiv.org/pdf/2505.19988).

Security and operations

This is a static site with no form inputs and no data collection, minimizing attack surface and protecting user privacy [32](https://arxiv.org/pdf/2505.19988).

If extended, apply input validation, authentication for any private endpoints, rate limiting, and content security policies appropriate for public hosting [33](https://arxiv.org/pdf/2505.19988).