Reading Corporate Narratives: How Large Language Models Decode Stance in SEC Filings and Earnings Calls

Reading Corporate Narratives investigates whether large language models can determine sentence-level stance toward debt, earnings per share (EPS), and sales in SEC filings and earnings call transcripts. The paper compares zero-shot, few-shot, and chain-of-thought prompting, finding that few-shot with CoT often yields the strongest performance, while outlining practical caveats for analysts, investors, and regulators.

Reading Corporate Narratives: How Large Language Models Decode Stance in SEC Filings and Earnings Calls

Introduction: Why this matters

If you’ve ever skimmed a company’s annual report or listened to an earnings call, you know corporate language isn’t exactly consumer-friendly. Behind the numbers, there are shifts in strategy, risks, and forward-looking statements that investors and regulators care about. This research digs into a tricky but incredibly practical question: can modern large language models (LLMs) automatically tell us whether sentences in SEC filings and earnings call transcripts show a positive, negative, or neutral stance toward three key financial targets—debt, earnings per share (EPS), and sales?

In short, the answer is yes—and with some important caveats. The study shows that with careful prompting (zero-shot, few-shot, and especially when you add chain-of-thought reasoning), LLMs can perform stance detection at the sentence level without needing massive amounts of labeled data. That’s a big deal for analysts who want scalable, target-specific insights from lengthy corporate narratives.

This post breaks down what the researchers did, what they found, and what it means for real-world use—whether you’re an investor, auditor, regulator, or just someone curious about how AI can parse business talk.

What the study sets out to do

  • Build a sentence-level stance dataset focused on three concrete financial targets: debt, EPS, and sales.
  • Source sentences from SEC Form 10-K filings and quarterly earnings call transcripts (ECTs).
  • Label those sentences with stance (positive, negative, neutral) using an advanced ChatGPT variant, then validate with human reviewers.
  • Systematically evaluate several contemporary LLMs under different prompting strategies: zero-shot, few-shot, and chain-of-thought (CoT).
  • Explore the impact of context (no background, full MD&A context, or summarized context) and the quality of few-shot exemplars (random vs semantically similar) on model performance.
  • Compare performance across two data streams: actual SEC filings (formal, number-dense) and ECTs (more conversational, narrative).

The authors also openly share prompts, data, and code to support reproducibility—a nice nudge toward practical, real-world research.

The data and how it was created (in plain language)

  • Target sentences: Those that explicitly reference debt, EPS, or sales, or are relevant to those targets.
  • Sources: Form 10-K annual reports (SEC filings) and quarterly earnings call transcripts.
  • Annotation: An advanced reasoning model (ChatGPT-o3-pro) produced a stance label and a justification for each sentence. Human validators then checked a sample of these annotations.
  • Ground truth quality: Across all tested targets (sales, EPS, debt), human validators agreed with the model’s justifications more than 97% of the time. That’s strong alignment between model reasoning and human judgment.
  • Datasets: Training and test splits were drawn from two companies—MATIV Holdings Inc. and 3M Co.—over different time periods (training: 2020–2021; test: 2022–2024). The study uses these two datasets separately to compare how models handle the two document styles: SEC filings vs. earnings calls.

Think of the data as two slightly different kinds of corporate speeches: one is the formal, number-heavy annual narrative, and the other is the more conversational quarterly updates. Both are important in financial decision-making, but they demand different kinds of reading and reasoning from algorithms.

The models and the prompting tricks the study tries

The researchers compare four large language models:

  • Llama 3.3 (70B parameters)
  • Gemma 3 (27B)
  • Mistral 3 Small (24B)
  • ChatGPT 4.1-mini (OpenAI’s smaller GPT-4.1 variant)

Some quick notes to orient you:
- The first three are open-weight/open-weight family models with instruction-following tuning.
- The ChatGPT 4.1-mini is a very capable, proprietary model with a huge context window and multimodal input in this version.
- The study uses three prompting scenarios:
- Zero-shot: No training examples in the prompt.
- Few-shot: A handful of labeled examples included in the prompt. They test two exemplar-selection strategies:
- Random sampling
- Semantic similarity: choose examples most similar to the test instance (cosine similarity via Sentence-BERT embeddings).
- Chain-of-Thought (CoT): The model is prompted to show intermediate reasoning steps before giving a final stance.

Context usage (how much company background you feed into the prompt) matters too:
- No context: No MD&A (for SEC) or earnings call content is included as background.
- Full context: The entire MD&A section (for SEC) or the full earnings call transcript is included.
- Summarized context: The MD&A or ECT is summarized (via a ChatGPT-pro prompted reduction) to provide a concise background.

In sum, the study runs a fairly large matrix of experiments:
- Datasets: SEC filings vs. ECTs
- Models: Llama 3.3, Gemma 3, Mistral 3 Small, GPT 4.1-mini
- Prompting: zero-shot vs few-shot vs CoT
- Context: no vs full vs summarized
- Few-shot exemplar strategy: random vs semantically similar
- CoT presence: with vs without

This helps answer not just which model is best, but how to prompt and structure the prompt for best performance in finance-specific stance tasks.

What they found: the big takeaways

1) Overall performance leaders
- GPT-4.1-mini came out on top with an average accuracy of about 87.8% across all setups.
- Llama 3.3 (70B) was a strong runner-up at around 83.0%.
- Gemma 3 (27B) followed at roughly 81.2%.
- Mistral 3 Small (24B) trailed the group at about 68.6%.

This is encouraging: even the open-weight models are quite capable, but the best results often still sit with the stronger proprietary option (GPT-4.1-mini in this study).

2) Chain-of-Thought (CoT) helps, especially for smaller models
- Across the board, adding CoT reasoning improved accuracy by about 4.23 percentage points on average.
- The biggest gains from CoT were seen with smaller models like Mistral 24B, particularly when transcripts (context) were full or summarized.
- Larger models like GPT-4.1-mini benefited from CoT too, but the improvement was smaller, suggesting they already reason quite well with prompts.

Put simply: if you’re using a smaller model, asking it to “think it through step by step” pays off. If you’re already using a very capable model, CoT helps, but the boost is less dramatic.

3) The power of semantically similar few-shot examples
- Choosing the most semantically similar few-shot examples consistently improved performance versus random selection.
- On average, this approach yielded about a 2% boost across both datasets. The gain was even higher for Mistral 24B (around 4% on average).
- This finding highlights a practical tip: a little bit of targeted exemplars that closely resemble the test cases can go a long way.

4) Context matters, but not uniformly
- Providing background information (full MD&A or summarized context) helped some models, especially in zero-shot settings:
- GPT-4.1-mini and Gemma 3 showed noticeable gains with context in zero-shot tasks.
- Mistral 24B sometimes degraded with broader context in zero-shot settings.
- As few-shot examples increased (k = 1, 5, 10), the benefit of extra context diminished. When you have enough in-context examples, the model leans on those examples more than the surrounding background.
- In many cases, summarized context performed almost as well as full context, which is practical because summarizing can save you time and still deliver strong results.

5) SEC filings vs. earnings calls: different challenges
- Earnings call transcripts (ECTs) tended to be easier for stance detection than the SEC filings.
- ECTs are more conversational and often contain clearer statements about changes in debt, EPS, or sales.
- SEC filings are more formal and dense with numbers, which makes recognizing stance require deeper quantitative reasoning.
- A concrete example in the study showed a sentence about debt ratios over two years (a positive shift if leverage falls) that required the model to infer improvement from a relative metric. This kind of calculation is easier for humans and remains a sticking point for models—especially those not explicitly trained for numerical reasoning.

6) Target-specific stance is doable without huge labeled datasets
- The study demonstrates that you can leverage modern LLMs to perform sentence-level stance detection focused on specific financial targets without collecting thousands of labeled examples per class.
- Ground-truth annotations were generated with a strong-language model and then validated by humans, striking a balance between scale and quality.

Why these findings matter in the real world

  • Faster, scalable analysis: Analysts can scan thousands of sentences across filings and calls to pinpoint where a company seems to be positively or negatively framing debt, EPS, or sales. This reduces manual slog and helps surface signals quickly.
  • Targeted insights: The ability to assess stance toward specific targets is more informative than general sentiment. For investors, regulators, or auditors, knowing if a sentence leans negative or positive about debt, for instance, can inform risk assessments and decision-making.
  • Prompt engineering as a practical skill: The study underscores that the way you prompt an LLM—especially whether you include CoT, how you select few-shot examples, and what background you provide—can materially affect results. Practitioners should treat prompting as a craft, not a one-size-fits-all step.
  • Data efficiency: Achieving good performance without massive labeled datasets makes this approach accessible for organizations that don’t have the resources for large-scale annotation campaigns.

Practical implications and takeaways for practitioners

  • Start with a strong, capable model, then layer prompting strategies
    • If you have access, GPT-4.1-mini is a solid starting point. If you’re using open-weight models for cost or privacy reasons, Llama 3.3 and Gemma 3 are competitive—especially when you use CoT and semantically similar few-shot exemplars.
  • Use CoT prompts for better reasoning, especially with smaller models
    • When you’re not using the absolute strongest model, asking the system to outline its reasoning steps before the final decision can noticeably boost accuracy.
  • Curate few-shot exemplars thoughtfully
    • Instead of random examples, pick exemplars that are semantically close to the test instances. This small change can yield meaningful gains, particularly for models with fewer parameters.
  • Optimize context use
    • Provide background context in zero-shot scenarios to help models catch nuances, but don’t overdo it. If you’re using few-shot prompts, the marginal benefit of extra background context often drops after a few examples.
    • Summarized context can be nearly as effective as full context. When bandwidth, latency, or token limits matter, summaries are a practical compromise.
  • Expect different performance across document types
    • Don’t assume a single approach will work equally well for all finance texts. ECTs and SEC filings present different challenges; tailor your prompts and maybe even model choice by document type.
  • Validate and monitor for biases
    • Although the agreement with human judgments was high, the dataset was annotated with a powerful model, which may introduce biases. It’s wise to periodically sample human-validated annotations to ensure models remain aligned with expert judgment.
  • Be transparent about limitations
    • The study used data from two companies. Broader generalization to other industries or firms may require additional validation. Also, numerical reasoning remains a persistent hurdle in formal financial prose.

Limitations worth noting (so you don’t over-interpret)

  • Scope of data: Only two companies were used for both SEC filings and ECTs. This limits generalizability. Real-world deployments should test across a broader set of firms and industries.
  • Annotation method: Ground truth relied heavily on ChatGPT-o3-pro annotations with human checks. While agreement was high, the possibility of model-induced biases or systematic errors isn’t zero.
  • Numerical reasoning challenges: SEC filings’ dense numerical information can trip up even strong models, particularly when stance depends on relative changes or ratios. This remains an active area for improvement.

A quick peek at the practical setup

If you’re thinking about applying this yourself, here’s a lightweight blueprint inspired by the study:

  • Data collection:
    • Gather Form 10-K MD&A sections and quarterly earnings call transcripts for companies of interest.
    • Split into sentences and filter for those that mention debt, EPS, or sales.
  • Annotation (ground truth):
    • Use a high-quality reasoning model to generate stance labels and justifications.
    • Have human annotators verify a subset to estimate agreement and correct errors.
  • Model choices:
    • Start with a strong prompt-capable model (e.g., GPT-4.1-mini) for baseline performance.
    • Experiment with open-weight options (Llama 3.3, Gemma 3, Mistral 3 Small) to balance cost and accuracy.
  • Prompting strategy:
    • Zero-shot with and without context to gauge baseline.
    • Few-shot with semantically similar exemplars; vary k (1, 5, 10) to find the sweet spot.
    • Include CoT prompts to encourage step-by-step reasoning.
  • Context management:
    • Test full MD&A or ECT transcripts vs summarized versions to find what works best for your models and constraints.
  • Evaluation:
    • Measure accuracy and consider confidence calibration if you plan to surface results in dashboards or reports.
    • Track performance by target (debt, EPS, sales) and by document type (SEC vs ECT) to pinpoint where improvements are needed.

Conclusion: what it all adds up to

This study offers a compelling demonstration that modern large language models, with careful prompting and example selection, can do targeted stance detection in financial texts at the sentence level. The key takeaway is not just which model performs best (GPT-4.1-mini often leads the pack) but that the way you prompt, the kind of background you provide, and how you choose your few-shot examples can make a meaningful difference—especially when you’re working with data that’s heavy on numbers and formal language.

For professionals who analyze corporate disclosures, this work suggests a practical, scalable path to surface meaningful stance signals about debt, EPS, and sales. It also provides concrete guidelines for prompt engineering in finance, highlighting two actionable strategies: use chain-of-thought reasoning and curate semantically similar few-shot exemplars. And because the researchers have made prompts, data, and code public, practitioners can adapt and build on this approach in their own analytic workflows.

If you’re curious about experimenting further, you could explore extending the dataset to more companies, broader time windows, or additional financial targets (like cash flow or liquidity ratios). You might also test more advanced numerical reasoning methods or hybrid approaches that combine LLMs with dedicated financial calculators to handle those tricky ratio-based stances more reliably.

Key Takeaways

  • Targeted stance detection is feasible: Large language models can determine stance toward specific financial targets (debt, EPS, sales) at the sentence level using SEC filings and earnings call transcripts.
  • CoT boosts performance, especially for smaller models: Prompting models to show reasoning steps helps improve accuracy, with the biggest gains for less-capable models.
  • Semantically similar few-shot exemplars win: Choosing few-shot examples that are most similar to the test instance consistently improves performance (not just random examples).
  • Context helps, but with diminishing returns: Providing company background helps in zero-shot tasks; as you add more few-shot examples, the extra context becomes less critical. Summarized context often performs nearly as well as full context.
  • ECTs vs. SECs present different challenges: Earnings calls are typically easier for stance detection due to conversational style, while SEC filings require deeper numerical and logical reasoning.
  • You don’t need massive labeled data: The approach demonstrates practical viability for real-world finance tasks without exploding annotation costs.
  • Reproducibility matters: Sharing prompts, data, and code supports the broader community in refining and adapting these techniques.

If you’re diving into the world of financial text analytics, this study provides a practical, evidence-backed playbook for getting started with prompt-based stance detection and highlights the nuanced trade-offs you’ll encounter along the way.

Frequently Asked Questions