Structured DB Search vs ChatGPT: How Close Are We Really?
Table of Contents
- Introduction: The Real Test for “ChatGPT-like” Database Search
- Why This Matters
- What’s Missing in Current Benchmarks
- How the Researchers Evaluated Tursio (and Fairly Compared It)
- Results: Relevance Is Comparable—But Not Everything Is Equal
- Failure Modes: The Bottleneck Isn’t the Model—It’s the Data
- What to Do Next: Better Evaluation + Better Data Coverage
- Key Takeaways
- Sources & Further Reading
Introduction: The Real Test for “ChatGPT-like” Database Search
Most people are already living in a world where asking questions in natural language is normal. If you want an answer about anything, you try a chatbot. If you want the latest information, you try an AI search engine. So it’s totally reasonable to ask: can we offer the same experience for enterprise databases—the stuff where the real business truth lives?
That’s the core question behind new research from the original paper: “Tursio Database Search: How far are we from ChatGPT?” The authors argue that a lot of existing evaluations don’t measure what business users actually care about. They focus on “is the SQL correct?” or “does the model answer a factual question?” but they don’t evaluate the end-to-end search experience: whether the response is relevant, safe, conversationally useful, and complete enough to act on.
To tackle this, the researchers build an evaluation framework designed specifically for structured database search. Then they use it to compare Tursio (a database search platform) against ChatGPT and Perplexity, using a realistic credit-union banking schema. The headline result? Tursio’s answer relevance is statistically comparable to both baselines, even though it’s answering from structured database records while the others use open web information.
Why This Matters
Right now, companies are racing to make “AI search” a product feature. But there’s a subtle risk: teams may optimize for the wrong metric. They might get impressive demo answers—until a real analyst asks something slightly abstract, multi-step, or tied to internal definitions that aren’t available on the open web.
This research matters now because the market is shifting from “AI can draft text” to “AI can do work.” In enterprises, “work” often means: find the right slice of operational data and summarize it in a way a business user can trust. That’s hard for two reasons:
- Enterprise schemas are messy and implicit. Column names can be cryptic, table relationships may not be obvious, and domain conventions might not be documented.
- Business questions are not “SQL-shaped.” Users ask intent, not syntax. They talk like managers, analysts, or compliance officers—not like database engineers.
A scenario you could apply today
Imagine a credit union’s risk team asks a question like:
“What behavioral signals predict early warning when accounts move from 30 DPD to 90 DPD?”
If your system is basically a glorified NL-to-SQL converter, it may either:
- generate an incorrect query,
- or (worse) hallucinate an answer using patterns it “thinks” are plausible.
Tursio’s evaluation framework is trying to answer a more practical question: Does the system return an answer that’s relevant enough to the KPI and persona, and does it behave safely when data is missing? That’s exactly the kind of behavior you need before trusting AI outputs in regulated environments.
How this builds on earlier AI research
Classic work in NL-to-SQL and benchmark datasets like BIRD have pushed the field forward. But the paper highlights a key mismatch: benchmarks often assume near-literal translations from question to SQL. The researchers show that real production-like questions don’t behave that way—they can be 30–55× longer in SQL tokens than in question tokens (based on their analysis). That gap explains why “it works on benchmarks” doesn’t always survive contact with real enterprise workflows.
In other words: this isn’t just another model comparison. It’s an argument for evaluating the right thing: the user’s perceived search quality, not just intermediate correctness.
What’s Missing in Current Benchmarks
The paper makes a pretty sharp point: most benchmark categories fail to measure end-to-end enterprise search quality.
Open-domain QA benchmarks ignore schema reality
Open-domain QA datasets test whether a model can find or recall facts in a generic way. They typically don’t require:
- joins across normalized tables,
- aggregation logic,
- interpretation of domain-specific metrics.
So even if a model scores well, it may be solving a different problem entirely.
Text-to-SQL benchmarks measure SQL correctness—not user usefulness
Text-to-SQL benchmarks evaluate whether the generated SQL matches a reference (often by execution accuracy). But business users don’t live in the SQL world. They need:
- the right metric,
- the right filters,
- the right time windows,
- the right “meaning,” not just correct computation.
Even the paper’s token-length analysis is telling. They look at the ratio of SQL token-length to question token-length as a proxy for how “literal” the task is.
- In BIRD, ratios are near 1× → questions are largely SQL-aware and literal.
- In BEAVER, ratios go up dramatically (up to 55× on one dataset split; 5× on another).
- In production Tursio logs, the long tail reaches 30×.
So real business queries require inferring implicit structure—tables, join paths, metric definitions—not simply translating words into SQL.
The gap: no benchmark tests the “search experience” itself
This is the heart of the paper’s contribution. They want to evaluate whether a system gives the right end result in natural language, with:
- relevance,
- safety/bias awareness,
- and conversational usefulness.
And that’s what they implement next.
How the Researchers Evaluated Tursio (and Fairly Compared It)
This paper doesn’t just run a model and report numbers. It builds a full evaluation pipeline, because otherwise you can’t trust the comparison.
Step 1: Generate realistic banking questions across difficulty levels
They start with 15 manually curated “golden” questions, then expand them into about 150 synthetic examples using an LLM-based pipeline.
They control three dimensions:
- Persona (PP): 13 banking personas, each combining two or more roles (e.g., Risk & Credit Analytics Manager; CRO; Compliance Officer).
- Difficulty (DD): three levels
- Simple: single-table metric, straightforward counts/percentages/rankings
- Medium: filtering, grouping, segmentation, small joins, time windows
- Hard: multi-step analyses (trends, concentration risk, behavioral risk patterns)
- Example context (NN): sampled examples per persona-difficulty setup
Why personas and KPIs matter: it prevents the system from being evaluated on generic “bank facts.” Instead, questions are tied to KPIs like 90+ DPD Rate, Delinquency Ratio, and Non-Performing Loan %—so relevance can be judged in context.
Step 2: Map proprietary-schema questions to public-web equivalents
Here’s a fairness challenge: ChatGPT and Perplexity can’t query the proprietary Symitar credit-union database.
So the researchers use another LLM (Claude Sonnet 4.5) to rephrase each synthetic banking question into an open-domain equivalent that a web-based system could answer.
This mapping step is crucial: it tries to preserve analytical intent and difficulty while making it answerable using public sources.
They also apply quality checks using:
- an LLM review pass (GPT-4),
- and deterministic filters (duplicate detection, similarity filtering, length thresholds, diversity checks, etc.).
Step 3: Ask three systems to answer (single-turn) and evaluate with LLM-as-judge
They test:
- Tursio: answers using the structured Symitar schema
- ChatGPT (Dec 2025): answers open-domain mapped questions
- Perplexity (Dec 2025): answers open-domain mapped questions
All systems are constrained to 3–5 sentences to keep output comparable, and responses are collected in a single-turn setup (no clarification allowed).
For evaluation, they use DeepEval with GPT-4.1 as the judge, scoring answers on a [0, 1] scale and using a baseline success threshold of τ = 0.5.
Metrics include:
- Answer Relevancy (primary)
- Safety (reported as 1 − bias for interpretability)
- Conversation Completeness
- plus custom conversation metrics:
- Focus
- Engagement
- Helpfulness
- Voice
They then compute success rates (fraction of outputs scoring ≥ 0.5) and also look at score distributions—not just pass/fail.
Results: Relevance Is Comparable—But Not Everything Is Equal
Let’s talk numbers, because that’s what everyone will ask about first.
Answer relevancy success rates (the main headline)
On simple questions:
- Tursio: 97.8%
- Perplexity: 96.7%
- ChatGPT: 98.1%
On medium questions:
- Tursio: 90.0%
- Perplexity: 90.0%
- ChatGPT: 100.0%
On hard questions:
- Tursio: 89.5%
- Perplexity: 100.0%
- ChatGPT: 100.0%
At first glance, Tursio looks slightly behind on medium/hard, especially relevance. But the paper then does the important part: statistical testing.
Statistical significance: Tursio is “indistinguishable” on relevancy
They run pairwise chi-square tests comparing success/failure counts for relevancy, with Yates’ correction.
Across all comparisons (Tursio vs ChatGPT, Tursio vs Perplexity; across simple/medium/hard), no differences are statistically significant at α = 0.05.
Even on hard questions, where ChatGPT and Perplexity hit 100% and Tursio is at 89.5%, the sample size isn’t large enough to claim significance.
Conclusion: Tursio’s answer relevancy is statistically comparable to consumer AI chat/search systems—despite a completely different data access model.
So where do differences show up?
They dig into per-metric performance and score distributions.
For Tursio, many metrics stay high (often 80–100%), with safety consistently perfect (100%) and voice typically strong (90–100%).
But Conversation Completeness is where Tursio struggles:
- 93.3% on simple
- 45.0% on medium
- 52.6% on hard
Meanwhile, Perplexity shows a different kind of issue: it can be relevant, but conversational metrics can tank—especially on simpler domain questions.
ChatGPT, as expected, is an upper-bound: it scores near-perfect on most metrics across difficulties.
Failure Modes: The Bottleneck Isn’t the Model—It’s the Data
This is where the paper gets genuinely interesting, because it challenges a common assumption: “LLMs will fail because they can’t understand.”
Tursio’s two main failure categories
They identify two root causes:
1) Missing data coverage (dominant on medium/hard)
When the database doesn’t contain the columns or records needed, Tursio returns a fallback like:
“Unfortunately no data points were found…”
This is correct behavior—Tursio does not hallucinate. But the evaluator still scores it low on relevancy because the answer fails to address the analytical intent.
Example failure:
- Asking about behavioral traits predicting early warning signals for transitions 30 DPD → 90 DPD
- Response: no data points found
- Relevancy score: 0.0
The paper says this pattern accounts for the majority of medium/hard failures.
2) Semantic mismatch (less frequent)
Sometimes Tursio retrieves relevant data but answers something slightly different.
Example:
- Question asks distribution of credit card balances
- Response discusses credit card limits
- Relevancy: 0.0
So interpretation errors exist, but they’re not the main limiter.
Completeness drops because real questions are multi-part
Conversation Completeness is correlated with difficulty. Medium/hard questions often require:
- multiple breakdowns,
- trend comparisons,
- time-window reasoning,
- concentration or behavioral patterns.
If the database only partially supports those sub-pieces, Tursio may provide a partial answer that stays relevant (so relevancy score can remain high) but fails completeness expectations.
So the paper’s thesis holds:
- Model comprehension isn’t the primary bottleneck
- database completeness is
Perplexity’s “open-data bifurcation”
Perplexity’s performance pattern is almost the opposite of what you might expect:
- On simple questions: relevancy can be good, but completeness/focus/helpfulness are very low (often around 20–30%).
- On hard questions: those conversational metrics jump to ~90%+.
Why? The web has plenty of material for more research-like, analytic questions, but not necessarily the exact domain-specific details needed for comprehensive “simple” operational metrics in banking.
Also, response length constraints (3–5 sentences) interact with completeness evaluation. A short, accurate explanation might still be scored as incomplete.
ChatGPT as an upper bound
ChatGPT hits near-perfect scores, including safety and conversational criteria.
This doesn’t mean it’s “better at databases.” It means that in this setup:
- it can leverage broad pretraining knowledge and web-access patterns (depending on its configuration),
- and it’s not blocked by missing columns in a specific schema.
It’s an upper-bound baseline, not a drop-in replacement for grounded database answers.
What to Do Next: Better Evaluation + Better Data Coverage
The paper ends with concrete directions. And they’re the right kind of directions—not just “try a new model.”
1) Upgrade the evaluation methodology
Three specific improvements:
Multi-turn assessment
Real users refine queries:
- “No, I meant last quarter”
- “Break it down by region”
- “Compare to the previous period”
Future evaluations should test 3–5 exchanges, not single-turn answers.
Remove format artifacts
The 3–5 sentence cap can systematically penalize completeness for any system that needs to cover multiple sub-results. They suggest experimenting with unconstrained lengths to disentangle “bad completeness” from “too-short response.”
Reduce judge bias
LLM-as-judge can be biased. They suggest incorporating reward model approaches (they mention RewardBench) to better align scoring with human preferences.
2) Fix the real bottleneck: data coverage
If database completeness drives performance, then work must shift to:
- auditing coverage across query patterns,
- enriching missing fields (especially account-level and temporal data),
- improving fallback responses so “no data” answers still help the user (e.g., suggesting what filters to change, or clarifying which fields are missing).
3) Automate the pipeline for continuous evaluation
They note the current pipeline has manual steps limiting scalability.
Future work should automate:
- question generation,
- answer collection,
- metric computation,
- dashboards for tracking regressions/improvements over time.
This matters because systems and schemas evolve. Without automation, evaluation becomes a one-off research exercise instead of an operational tool.
And if you want the full picture, the evaluation framework and experiments are described in detail in the original paper: Tursio Database Search: How far are we from ChatGPT?.
Key Takeaways
- Tursio’s answer relevancy is statistically comparable to ChatGPT and Perplexity across simple, medium, and hard enterprise-style banking questions—even though it queries a structured database.
- The biggest difference isn’t “the model can’t understand.” The biggest bottleneck is database completeness: missing columns/data cause partial or fallback responses.
- Tursio performs especially well on precision-style metrics (relevance, safety, voice) but can struggle with conversation completeness on multi-part analytical queries.
- Perplexity shows an open-data bifurcation: sometimes highly relevant but conversationally shallow on domain-specific simple questions; stronger on harder analytical questions.
- Future progress likely comes from:
- better end-to-end evaluation (multi-turn, less format bias, less judge subjectivity),
- improved schema/data coverage and smarter “no data” fallbacks,
- and automated continuous evaluation pipelines.
Sources & Further Reading
- Original Research Paper: Tursio Database Search: How far are we from ChatGPT?
- Authors: Authors:
Sulbha Jain,
Shivani Tripathi,
Shi Qiao,
Alekh Jindal