When Your Spreadsheets Speak Clearly: A Hybrid Engine that Reads Headers, Not Just Cells
If you’ve ever tried to ask a spreadsheet a question and ended up chasing the right cell through a tangle of merged headers, unit lines, and multi-row headings, you’re not alone. Real spreadsheets aren’t tidy databases; they’re living documents that carry meaning in their structure as much as in their numbers. That’s what makes accurate question answering (QA) over spreadsheets so hard—and why a new approach named SQuARE is such a big deal.
SQuARE (Structured Query & Adaptive Retrieval Engine For Tabular Formats) tackles the problem with a simple idea: don’t force every table through one back-end hammer. Instead, it looks at how a sheet is built, decides whether the core meaning lies in the header structure or in the numeric data itself, and then routes the question to the best retrieval method. It keeps header paths, time labels, and units intact, so the answers you get aren’t just correct in numbers but faithful to the original cells and their context. Think of it as giving the spreadsheet a voice and a reliable translator.
In this post, we’ll break down what SQuARE does, why it matters in real-world workflows, and what that could mean for you if you’re building or using AI-assisted data tools.
Why spreadsheets confuse QA—and how SQuARE fixes that
Spreadsheets are the de facto workspace for quantitative thinking across industries—from corporate finance to public datasets. But QA over them is tricky for two big reasons:
- Naive chunking loses structure. If you slice a table into fixed-size text chunks, you might break apart a header path or a unit line. A subtotal vs. a grand total, or “USD (millions)” – these pieces of meaning live in the layout, not just in the cells. When you separate them, you lose the thread that lets you answer with confidence.
- Rigid SQL expects tidy schemas. Traditional SQL queries assume stable columns and straightforward headers. Real-world sheets with multi-row headers, merged cells, and parenthetical units don’t play nice with a single, flat schema. You can end up querying the wrong column or ignoring a unit line entirely.
SQuARE’s answer is to be smart about structure. It uses a sheet-level, complexity-aware routing decision: if a sheet has deep headers and many merges (a Multi-Header sheet), it routes to structure-preserving chunk retrieval. If a sheet is relatively flat, it can also use a constrained SQL path over an automatically inferred relational view. An agent sits on top to refine results when confidence is low, merging evidence from both paths when helpful. The result is a system that preserves the integrity of header hierarchies, units, and time labels while still delivering fast, auditable answers.
To give a mental model: imagine you’re searching for “Total equity in FY2023” in a balance sheet that has three header rows and a line that says “USD (millions)”. The correct answer depends on following the full header path and respecting the unit line. A fixed-window chunk approach might mix the header with the data or miss the unit context. A pure SQL approach might map to the wrong column (or ignore the unit row). SQuARE’s hybrid strategy preserves what matters and uses the right tool for the job.
How SQuARE works, step by step
SQuARE isn’t a single trick; it’s a modular framework that can swap in different embeddings, vector stores, and databases while keeping a stable control flow. Here’s the gist of the core ideas, translated into plain language.
1) A structural complexity score guides the routing decision
Key idea: look at a sheet’s layout to decide how “hard” it is to interpret.
It quantifies two observable features:
- Header depth: how many nested header rows exist.
- Merge density: how many cells in the header region are merged or split (i.e., non-simple layouts).
The system then uses a sheet-normalized boundary to label sheets as:
- Multi-Header: complex structures where header paths and units matter.
- Flat: simpler, more SQL-friendly layouts.
This classification isn’t a blunt, one-size-fits-all threshold. Instead, SQuARE uses a threshold that scales with how wide the header span is, so it’s not biased toward small or large tables alone.
Why this matters: by knowing which path to trust for a given sheet, SQuARE avoids brittle behavior. It doesn’t waste time forcing a complex sheet through SQL or a simple sheet through heavy structure-preserving methods when one path is clearly safer.
2) Two retrieval paths, chosen by path, not by whim
Structure-preserving chunk retrieval (for Multi-Header sheets)
- Segmentation happens at header boundaries, not at arbitrary fixed rows.
- Each segment carries rich metadata: the header path (outer to inner labels), time labels (years, quarters), and the unit string if present.
- Small “descriptions” accompany each block to summarize what it contains, so embeddings can reason about content without indexing raw cells wholesale.
- At query time, the system fetches a small set of top blocks (based on cosine similarity between the query and block descriptions), and uses those blocks as context to answer. This keeps the structure intact and ensures numerical fidelity.
Structure-aware SQL over a relational view (for Flat sheets)
- The sheet is turned into a cleansed schema: column names, types, and any detected units are recorded.
- An LLM proposes a SQL query (restricted to a safe, whitelisted set of operations: SELECT, FROM, WHERE, GROUP BY, etc.). The system then runs the query against a small, inferred relational view.
- The SQL path is careful and deterministic, providing exact, auditable results for filters and aggregations. It also has a built-in guard to avoid dangerous or unsupported operations.
A lightweight agent oversees routing and fallbacks
- The agent uses cues from the question and the sheet type to decide whether to fetch chunks, run SQL, or blend both.
- If confidence is low, it can merge evidence from both paths or summarize to fit a token budget, always returning the exact rows/blocks that were used to answer.
3) Quality gates and confidence checks keep results honest
- For the chunk path, the system evaluates how well the retrieved blocks align with the query (e.g., cosine similarity, header/unit consistency) and whether the evidence covers the needed years or units.
- For the SQL path, it checks whether the result is non-empty, whether the results make sense given the schema, and whether the coverage of columns is adequate.
- If both paths produce weak evidence, SQuARE can abstain rather than guessing. If one path looks good, it uses that; if both look usable, it can merge them.
This “confidence-aware fallback” is a big win for reliability. It reduces the chance of hallucinated or mis-contextual answers and keeps an audit trail of exactly where the answer came from.
4) Evidence-first results, with auditable traces
No matter which path is used, SQuARE returns the exact cells, rows, or blocks that contributed to the answer. That makes verification straightforward: you can trace back to the precise pieces of the sheet that support the claim.
5) A modular, swap-friendly architecture
The system is designed so that embeddings, vector stores, and databases can be swapped without changing the overall control flow. The emphasis is on numerical precision and provenance, not on locking you into a single tech stack.
Real-world datasets and what the results look like
The researchers tested SQuARE across three broad categories to cover real-world spreadsheet variety:
- Complex multi-header corporate spreadsheets (think balance sheets with several header rows, units sitting between headers and data, etc.)
- A large World Bank workbook with many heterogeneous sheets (mixed styles, merged cells, various time labels)
- Flat, single-header public tables (more conventional, easier for SQL paths)
In total, they assembled hundreds of QA pairs per category, focusing on questions that require header paths, unit strings, and year columns. The aim was to test both retrieval quality and the ability to return exact evidence.
Key takeaways from the results:
- On complex multi-header balance sheets, the hybrid approach shined. A Gemma-based instance of SQuARE reached about 91% exact-match accuracy, outperforming a competing LLaMa-based variant and a public ChatGPT-4o baseline by a wide margin (96% vs. 29% in one challenging setting, for example). The big delta here isn’t just raw accuracy; it’s the fidelity to header paths and units that lets the model give auditable answers.
- On the merged World Bank workbook, SQuARE with the same Gemma backbone hit around 86% accuracy, again ahead of the other baselines. This underscores the value of preserving cross-sheet structure when you have diverse layouts in one project.
- On flat datasets (Health, Debt, GEP, Energy, Education), SQuARE achieved a strong overall performance (about 93%), with a notable bump in the Hard tier where precise filters and multi-column reasoning are essential. Traditional, generic language models (like the public ChatGPT-4o) lag more in these harder cases.
- Retrieval-path ablations showed the routing decisions matter. Removing the fallback mechanism or restricting to one path reduced accuracy by several points. The option to merge contexts or fall back to the alternative path is not just a nicety—it’s a substantial accuracy and reliability booster.
- Retrieval quality is pathway-dependent:
- The chunk path excels at complex, header-driven questions where unit and time context matter.
- The SQL path shines on flat tables requiring precise filters, groupings, and deterministic calculations.
- The agent-guided hybrid approach provides the best of both worlds, especially on mixed or ambiguous questions.
Latency and resource use were kept practical. The team ran tests on modest hardware (quantized models on a T4 GPU with 15 GB VRAM). The end-to-end latency of SQuARE was generally in the same ballpark as a tool-free ChatGPT-4o run, with slight slowdowns on the most complex spreadsheets, and improvements when moving to more powerful GPUs. The key point: you don’t need a heavyweight, enterprise-scale setup to get a meaningful uplift in accuracy and verifiability.
Why this matters for real-world workflows
- Accuracy you can audit. For finance teams, auditors, and analysts, being able to point to the exact cells or rows used to generate an answer is invaluable. It simplifies audits, defends metrics in reports, and reduces the back-and-forth with stakeholders who question numbers.
- Safer, more reliable QA on messy data. Real-world spreadsheets aren’t cleaned up before a model sees them. A system that adapts to layout complexity—preserving headers and units when they matter—reduces the risk of wrong conclusions caused by layout quirks.
- Practical deployment with flexible tooling. The modularity means you’re not locked into one embedding model or one database approach. Swap in newer TFMs, bring your own vector store, or plug in your favorite SQL engine as the sheets and questions evolve.
- Clearer prompts, better results. For prompt engineers and AI practitioners, SQuARE shows the value of a structure-aware orchestration layer. Instead of trying to force a single model to reason over all variations, you let smart routing decide which tool is best for the job, and you use safeguards to keep the outputs trustworthy.
Limitations and avenues for the future
No system is perfect, and SQuARE points to several interesting directions for improvement:
- Router learning and calibration. Today, the routing agent is prompt-based. A lightweight, learned router with uncertainty estimates could reduce fallbacks and sharpen decisions, especially as you scale to more table types.
- Handling layout variability and OCR’d data. The current setup works well on structured spreadsheets. When you introduce scanned documents or tables with unusual layouts, you’ll need table detectors and layout parsers before the retrieval backbone can take over.
- Strengthening SQL robustness. Schema drift and header aliasing can trip up SQL generation. Future work could add stricter schema alignment, better alias handling, and safer join discovery.
- Multi-sheet and cross-workbook queries. Real analyses often span several sheets with joins. Extending SQuARE to handle cross-sheet reasoning while preserving evidence provenance will be a natural step.
- Broader evaluation. Expanding baselines to include tool-augmented assistants and more perturbation tests would give a fuller picture of robustness in diverse environments.
There’s also a note about the evolving landscape of large language models. As newer models arrive, swapping in tabular foundation models (TFMs) as the structure/encoding layer is a promising path. The core orchestration—RAG plus SQL with an agentic router—would still hold, but the encoding quality and reasoning capabilities could improve even further.
Real-world takeaways: what this means for you
- If you’re building AI tools for spreadsheet QA, a hybrid retrieval approach that respects table structure is worth the extra design effort. It can dramatically improve accuracy and make results auditable.
- Start by assessing structure: compute a simple complexity score based on header depth and merge density. Use that to decide whether to route to structure-preserving retrieval or to a SQL-based path.
- Consider an agent to manage routing and fallbacks. A lightweight decision-maker that can switch paths or merge evidence helps handle ambiguous questions and keeps latency predictable.
- Make evidence the default. Returning the exact cells/rows used for an answer builds trust and simplifies review.
- Keep it modular. Design with interchangeable embeddings, vector stores, and databases so you can adapt as technologies evolve.
- Practical workflow tip: for complex corporate sheets, lean on the structure-aware path; for clean, flat datasets, you’ll often gain speed and determinism with SQL.
Key Takeaways
- Real-world spreadsheets aren’t uniform; header complexity and merged cells mean a single retrieval method won’t always be best. SQuARE’s hybrid engine adapts to the sheet’s structure.
- The system uses a two-path routing strategy: structure-preserving chunk retrieval for complex headers, and schema-aware SQL for flat tables. An agent oversees routing and can merge evidence from both paths when helpful.
- A core strength is returning verifiable evidence: the exact cells or blocks used to answer, with header context and units preserved.
- Evaluation across diverse datasets shows significant accuracy gains over a baseline that uses a single path (and outperforms a public tool-free ChatGPT-4o baseline on structurally complex sheets).
- The design is modular and practical: embeddings, vector stores, and databases can be swapped; the approach works on modest hardware with reasonable latency.
- Limitations point to clear future work: better learned routing, OCR/layout handling, stronger schema alignment, and cross-sheet queries—areas where the field is actively evolving.
- For practitioners, the takeaway is simple: when dealing with messy tabular data, structure-aware routing combined with a safe SQL path can deliver more reliable, auditable results than any one-size-fits-all method.
If you’re curious about building smarter spreadsheet QA tools, SQuARE offers a compelling blueprint: respect the table’s anatomy, pick the right reasoning path, and keep a verifiable trail of evidence. That combination doesn’t just answer questions—it makes the answers trustworthy and easy to audit. And in the end, that’s what makes data-driven work far more powerful.