How top Text-to-SQL performance was achieved

Source: Automatic Metadata Extraction for Text-to-SQL — arXiv:2505.19988 PDF [1](https://arxiv.org/pdf/2505.19988)

Benchmark

BIRD benchmark used for evaluation [2](https://arxiv.org/pdf/2505.19988)

Approach

Profiling, query log analysis, SQL→text metadata [3](https://arxiv.org/pdf/2505.19988)

Result

#1 on BIRD during multiple periods in 2024–2025 [4](https://arxiv.org/pdf/2505.19988)

// Core idea
// Make databases self-describing with automatically extracted metadata [5](https://arxiv.org/pdf/2505.19988)
// Then let an LLM compose SQL with reduced ambiguity [6](https://arxiv.org/pdf/2505.19988)

Problem context

Text-to-SQL systems struggle because real databases often lack accurate documentation and have evolving schemas, inconsistent formats, and complex join paths [7](https://arxiv.org/pdf/2505.19988).

Key idea: Automatic metadata extraction

The system aggregates three kinds of metadata: classic profiling, query log analysis, and SQL→text generation using an LLM [11](https://arxiv.org/pdf/2505.19988).

For BIRD test databases, only profiling was used because query logs are not provided, and no fine-tuned model was required beyond GPT-4 [15](https://arxiv.org/pdf/2505.19988).

Empirical performance

The approach reached the highest score on BIRD from Sept 1–23, 2024 and Nov 11–23, 2024, regained #1 on Mar 11, 2025, and remained #1 as of May 2025 [16](https://arxiv.org/pdf/2505.19988).

The evaluation highlights that robust metadata alone can materially reduce LLM ambiguity in text-to-SQL [17](https://arxiv.org/pdf/2505.19988).

Why it works

Profiling captures ground truth about values, keys, and distributions, enabling precise column selection and filter formation from natural language prompts [18](https://arxiv.org/pdf/2505.19988).

SQL→text paraphrases create a bridge between query structure and user intent, helping candidate generation and selection workflows [19](https://arxiv.org/pdf/2505.19988).

When query logs exist, mined join paths and predicates encode production best practices for composing complex joins [20](https://arxiv.org/pdf/2505.19988).

// Example: improving question quality
// The paper shows cases where supplied questions mismatch SQL
// and metadata-guided paraphrases clarify intent [21](https://arxiv.org/pdf/2505.19988)

Illustrative examples

One example asks: In posts with 1 comment, how many of the comments have 0 score? with SQL counting comments joined to posts where CommentCount=1 and Score=0, and metadata-guided rewording increases accuracy [22](https://arxiv.org/pdf/2505.19988).

SELECT COUNT(T1.id)
FROM comments AS T1
JOIN posts AS T2 ON T1.PostId = T2.Id
WHERE T2.CommentCount = 1 AND T2.Score = 0 -- matches reworded question [23](https://arxiv.org/pdf/2505.19988)

Another example shows average SAT participation for schools opened in 1980 in Fresno County, where the generated question improves clarity compared to a vague original [24](https://arxiv.org/pdf/2505.19988).

Cases also include mismatches like asking about carcinogenic status when SQL only returns a bond type, illustrating why metadata-grounded paraphrasing matters [25](https://arxiv.org/pdf/2505.19988).

Positioning vs. related work

Security and operations

This is a static site with no form inputs and no data collection, minimizing attack surface and protecting user privacy [32](https://arxiv.org/pdf/2505.19988).

If extended, apply input validation, authentication for any private endpoints, rate limiting, and content security policies appropriate for public hosting [33](https://arxiv.org/pdf/2505.19988).