Cracking Real-World Spreadsheets: How SQuARE Bridges Structure-Aware Retrieval and SQL for Trustworthy Table Answers
Introduction: spreadsheets as the quiet engine of analysis
If you’ve ever built a budget, evaluated a quarterly balance sheet, or pulled a country-by-year indicator from a World Bank workbook, you’ve touched the reality SQuARE is trying to tame. Spreadsheets are the lingua franca of quantitative work across finance, research, and policy. They’re powerful, flexible, and human-friendly—until you ask a machine to answer a question about them. Real-world sheets aren’t neatly organized: you’ll find multi-row headers, merged cells that carry meaning (think “Subtotal” or “USD (millions)”), and unit lines that sit between the header and the numbers. Ask a naive QA system to slice and dice, and you get the wrong column, the wrong unit, or a blob of numbers without a trustworthy anchor.
Enter SQuARE (Structured Query and Adaptive Retrieval Engine): a hybrid system that doesn’t pretend every sheet is the same. It first reads a sheet’s structure, judges how complex it is, and then routes your query along the safest path. If the question hinges on a header path or a unit label, SQuARE uses structure-preserving chunk retrieval. If the question is more about filters and aggregations on a flat table, it builds a clean relational view and runs targeted SQL. When confidence is messy, a small agent supervises, blends evidence from both paths, and returns exact cells or rows as auditable proof. The result is answers that stay faithful to the original spreadsheet and easy to verify.
In this post, I’ll untangle how SQuARE works, why its hybrid approach matters, and what it could mean for real-world spreadsheet analysis—from finance teams chasing precise numbers to researchers cross-checking strategic indicators.
How SQuARE works at a high level: two routes, one goal
The core idea behind SQuARE is simple to state, a bit trickier to execute well: adapt retrieval to the sheet’s structure and the question’s intent. The system first estimates the sheet’s structural complexity and then routes the query to one of two engines (or a blend):
- Path A: Structure-preserving chunk retrieval for multi-header or merged-cell sheets. Here, the system segments the sheet along header boundaries, preserves the full header path, the time labels, and any unit strings. Each segment, or block, is described briefly and indexed with a concise description to help embeddings understand what the block contains. When you ask a question, the system pulls the top few blocks that best match the query in embedding space, and the answer model reads those blocks directly to produce a result. This path keeps the semantics of headers and units intact, so you’re not accidentally comparing apples to oranges.
- Path B: Schema-aware SQL over a relational view for flat tables. If the sheet is relatively straightforward (single header row, no messy units in the header region), SQuARE builds a cleansed schema and then uses a constrained SQL path. A lightweight language model generates a safe SQL query (restricted to a fixed SELECT-FROM-WHERE-GROUP BY-ORDER BY-LIMIT pattern, with no DDL/DML changes) over this view. The system then executes the query and uses the resulting rows as evidence.
A small, confident agent oversees the routing. It looks at the sheet’s complexity score and the question’s cues to decide which path to use. If confidence looks shaky, it can switch modes, merge evidence from both paths, and summarize when the content would exceed a token budget. The result is an evidence-first answer: the exact cells/rows used are surfaced, with header context and units preserved when they matter.
Let’s break down the main ingredients that power this adaptive behavior.
The complexity score: when to choose structure vs SQL
SQuARE doesn’t rely on a single “one-size-fits-all” rule. Instead, it computes a sheet-level complexity score that hinges on two observable properties:
- Header depth (HH): how many layers of headers are nested above the data.
- Merge density (MM): how many cells are merged or split in the header region.
From these, SQuARE derives a sheet-spanning, normalized boundary. Think of it as a smart gate that says: if a sheet has a deep header stack and a lot of merged header cells, you’re in “Multi-Header territory” and should lean toward the structure-preserving path. If it’s a clean, flat table with a single header row, you can safely use the SQL path (and possibly still benefit from chunking).
The routing decision: a guided, confidence-aware process
Once the sheet is classified, the system builds the necessary indices. For Multi-Header sheets, SQuARE focuses on the semantic index that encodes header paths, time labels, and units, along with a short, descriptive description of each block. For Flat sheets, it also builds a relational view with cleaned column names and types.
When a query arrives, the agent compares the query to the available blocks (via embeddings) and selects up to three blocks to present to the answer model. If those blocks provide strong, consistent evidence, you get an answer from the chunk path. If not, the system may switch to the SQL path or even merge evidence from both paths if the question benefits from it.
Quality gates and evidence-first results
A key feature is a strict acceptance test on the retrieved context. If the context isn’t solid—say, the blocks don’t cover the needed data, or the SQL result looks empty—the system falls back to the alternate path. If both paths can contribute, the system merges their contexts, trims the total content to fit the budget, and re-checks quality before answering. If everything still isn’t convincing, the system politely abstains rather than guessing.
Modularity and practicality: you can swap in tools without rewriting the flow
From a software engineering perspective, SQuARE is designed to be modular. Embeddings, vector stores, and even the back-end database for SQL can be swapped in and out without changing the control flow. This makes it easier to experiment with different embedding models, or different SQL engines, or to adapt the framework to new kinds of tabular data.
The methodology and what the researchers actually built
The authors formalize their approach with a practical algorithm (described as Algorithm 1 in their work) that:
- Estimates the sheet complexity (HH and MM) and uses a sheet-normalized threshold to classify sheets as Flat or Multi-Header.
- Builds a semantic index for all sheets, and only builds the relational (SQL) index when needed (Flat sheets).
- Chooses a retrieval path based on sheet type and query cues, with a quality gate to ensure evidence is strong enough to answer.
- Uses an agent to decide between chunk retrieval, SQL, or a blend, with fallback and merge options.
- Applies caching for often-asked queries and chunks, keeping runtime predictable on modest hardware.
A note on the scoring thresholds: the researchers tuned parameters on held-out data, including a couple of thresholds that determine when a sheet is considered Flat vs Multi-Header and how many chunks to fetch. They emphasize that this is not a one-shot rule; it’s a dynamic, data-calibrated approach.
Real-world datasets and what SQuARE could do with them
To test their idea, the researchers used three broad kinds of spreadsheets:
- Complex multi-header corporate spreadsheets: quarterly or annual balance sheets from major tech companies (e.g., Microsoft, Meta, Alphabet, Amazon, etc.). These sheets feature nested headers, merged cells, and units like “USD (millions)” that sit between the headers and the values.
- A complex, merged World Bank workbook: a blend of World Bank Gender Statistics and World Development Indicators with non-uniform layouts across sheets.
- Flat, single-header public tables: five datasets such as Health and Nutrition, Public Sector Debt, Global Economic Prospects, Energy Consumption, and Education Attainment. These provided a clean contrast to test the SQL path.
For each dataset, they created substantial QA workloads: hundreds of questions per dataset, ranging from direct lookups to multi-predicate filters and cross-year comparisons. They evaluated two things:
- Exact-match accuracy for numeric and categorical answers (the gold standard here is exact numeric equality or exact categorical matches, not just “semantics” of phrasing).
- Retrieval recall (R@k): how often the system surfaced the actual evidence blocks or SQL rows needed to answer.
Key findings from the results
The results are striking enough to illustrate why a hybrid approach matters:
- On complex multi-header spreadsheets, SQuARE with a Gemma-based model achieved about 91% exact-match accuracy, significantly outperforming a single-path approach and a widely used baseline (ChatGPT-4o): 91.3% vs 80.7% with Llama, and 28.7% for the public ChatGPT-4o baseline in the same setting. The improvement is all about preserving header paths and units in the retrieval and answer process.
- On the World Bank workbook, SQuARE reached 86% accuracy, beating the Llama-based version (74%) and ChatGPT-4o (54%). Here the ability to keep header semantics intact in the right blocks really paid off.
- On flat tables, the dual-index approach (vector + SQL) delivered strong performance overall (about 93%), with the Hard tier still reaching around 87%. ChatGPT-4o, while competitive on Easy/Medium, lagged on Hard items where deterministic SQL reasoning shines.
- Retrieval recall revealed the strength of the chunk path for complex sheets: the system surfaced the necessary evidence in three chunks or fewer for most queries. For flat tables, SQL-based evidence often landed on the first try, while the chunk path still contributed in tricky Hard cases.
- Ablations confirmed the value of the fallback and merge strategies: removing the fallback or forced-path decisions hurt accuracy, sometimes by several points. The ability to switch modes and merge evidence was material to performance, especially on diverse data.
- Latency and hardware: the team ran experiments with quantized models on a modest GPU (T4 15 GB). End-to-end latency was comparable to ChatGPT-4o in many cases, slightly slower on complex financial spreadsheets, but often faster or similar on flat tables. The modular design still keeps the system practical for typical data teams and does not demand exotic hardware.
Why this matters in practice
Accuracy, auditability, and trust are the three big wins here:
- Exact evidence, not fuzzy inferences: SQuARE returns the exact cells or rows used to compute the answer, along with the header context and units. This makes QA auditable and reduces the risk of “plausible but wrong” results that plague many purely text-based QA systems on tables.
- Structure first, then SQL: By respecting the grid’s structure, SQuARE avoids misalignments caused by header nesting or unit lines. In the world of real spreadsheets, when a row is a “Total” line or a unit line sits between blocks, preserving that context is crucial for correctness.
- Hybrid, not dogmatic: The system isn’t locked into one approach. It leverages the strengths of semantic chunking for structure-rich sheets and the precision of SQL for well-behaved tables. The lightweight agent coordinates these paths to maximize accuracy while controlling cost and latency.
- Practical deployment: The architecture is modular and swappable. Teams can plug in different embedding models, vector stores, or SQL backends. Caching and lazy indexing help keep costs predictable—an important consideration for teams processing thousand of spreadsheet questions.
Real-world implications and potential use cases
- Finance and accounting teams: quickly answer questions like “What was the year-over-year change in working capital in FY2023, in USD (millions)?” while preserving the exact rows and units used in the balance sheet.
- Policy and development analysts: pull indicators across years from merged World Bank workbooks without losing the semantic signals embedded in multi-row headers and units.
- Researchers and analysts working with public datasets: use the SQL path to perform precise filters and aggregations on flat tables, while still having a separate, structure-aware route for more complex, header-rich sheets.
Limitations and future directions
No system is perfect, and SQuARE acknowledges a few gaps and opportunities:
- Router learning and calibration: the current agent is prompt-based. A learned router with uncertainty estimates could reduce fallbacks and sharpen decision-making, especially as datasets grow more diverse.
- Layout variability and OCR: the presented work focuses on clean spreadsheets. Real-world scans and OCR’d tables add layout noise. Integrating robust table detectors and layout parsers would extend SQuARE’s reach to document-style inputs.
- SQL robustness: schema drift and more complex joins pose challenges. Future work could add schema alignment, header-to-column aliasing, and safe join discovery to improve resilience.
- Cross-sheet queries: many spreadsheets span multiple sheets or even multiple workbooks. Extending SQuARE to handle multi-sheet joins while maintaining the evidence-first paradigm would be a natural evolution.
- Evaluation breadth: the study used manual grading and a tool-free ChatGPT baseline. More baselines, perturbation tests, and public datasets would strengthen the evaluation and help the community compare approaches.
What this means for prompting and debugging
If you’re experimenting with complex spreadsheet QA prompts, SQuARE’s approach offers a blueprint:
- Don’t force one tool to do everything. Start with an awareness of structure: ask your model to look for header depth, units, and any merged cells before diving into answers.
- Build a fallback strategy. A lightweight supervisor that can switch modes or merge evidence often yields more robust results than a single path.
- Surface evidence, not just answers. If you can, return the specific cells or rows used to derive the answer—this builds trust and makes auditing straightforward.
Conclusion: a practical path to trustworthy spreadsheet QA
SQuARE represents a pragmatic advance in question answering over real spreadsheets. By coupling a simple yet powerful complexity metric with two specialized retrieval pathways—structure-preserving chunks for complex, header-rich sheets and constrained SQL over a cleansed schema for flat tables—SQuARE delivers answers that are both accurate and auditable. A small agent orchestrates the process, ready to blend evidence or switch paths when confidence wavers. The result is not only better performance on challenging data but also a framework that emphasizes provenance and verification in a world where numbers in spreadsheets often sit at the core of decisions.
Key Takeaways
- Real-world spreadsheets are messy by design. SQuARE recognizes this and uses a sheet-specific complexity score to decide how to retrieve information.
- There are two retrieval paths: structure-preserving chunks (for multi-header/merged-cell sheets) and SQL over a cleansed relational view (for flat tables). The system chooses the best path per question.
- An agent monitors confidence and can switch modes or merge evidence from both paths, reducing brittle failures and improving reliability.
- Evidence-first design matters: SQuARE surfaces the exact rows/blocks used to answer, with header context and units that matter for verification.
- The approach shines on complex corporate spreadsheets and multi-source World Bank workbooks, while also performing well on flat datasets with precise SQL filtering.
- The framework is modular and practical, designed to run on modest hardware with swappable components, which lowers barriers to adoption in real-world workflows.
- Limitations point to a clear research agenda: better routing via learned uncertainty, OCR/layout handling, stronger SQL robustness, and cross-sheet querying.
- For practitioners, the takeaway is to leverage structure-aware routing and to maintain explicit evidence, especially when the stakes require numeric fidelity and auditability.
If you’re someone who builds dashboards, audits quarterly reports, or mentors students wrangling data in spreadsheets, SQuARE offers a blueprint for turning the stubborn problem of real-world table QA into a reliable, auditable workflow.