Skeletonization Attacks on Vision-Language Models for Math Text Recognition

Skeletonization reshapes math text recognition in vision-language models. Tiny, targeted image changes can reduce the 2D data to a 1D skeleton, dramatically narrowing the search for correct LaTeX rendering. The result reveals surprising model limits, transfer effects to chat-based assistants, and practical safeguards for safer AI.
1st MONTH FREE Basic or Pro • code FREE
Claim Offer

Skeletonization Attacks on Vision-Language Models for Math Text Recognition

Table of Contents


Introduction

Imagine trying to read a dense mathematical formula from an image, where the goal isn’t just to recognize characters but to convert them into precise LaTeX code. Now imagine a clever method that quietly makes minimal, targeted changes to the image to fool a powerful large vision-language model (think tools like ChatGPT that can parse visuals), without obvious visual glitches. That’s the essence of the latest research on skeletonization-based adversarial perturbations for mathematical text recognition. In short: a new way to strike at the “seeing and reasoning” part of vision-enabled LLMs by carving down the problem space and probing how the model interprets math formulas.

This work dives into the vulnerabilities of foundation models when their visual-text understanding is put to the test, especially for mathematical expressions that are notoriously tricky to parse and translate into LaTeX. The authors introduce a novel attack that leverages skeletonization—a technique that reduces characters to their essential, one-pixel-wide lines—to drastically narrow where an attacker can nudge an image. The combination of skeletonization with text-area focusing makes the attack practical in black-box settings, which is a big deal given that you often don’t have access to the model internals in real-world scenarios. The researchers validate their approach by analyzing both character-level changes and semantic shifts in the LaTeX output and even testing how the perturbations transfer to ChatGPT’s image-understanding flow. For those curious about the nuts and bolts, this study is grounded in the original paper Skeletonization-Based Adversarial Perturbations on Large Vision Language Model's Mathematical Text Recognition.

If you’re after a quick hook: the work shows that restricting the attack to skeletonized text regions, and then exploring perturbations with different search strategies (including CMA-ES, TPE, and Random Search), can dramatically affect both the syntactic output (the LaTeX code) and the meaning (the math it represents). It’s a sobering reminder that even high-performing, visually intelligent models can be sensitive to carefully crafted perturbations—especially in specialized domains like math OCR.


Why This Matters

This isn’t just an academic exercise in attacking models; it’s a wake-up call about how robust our AI systems really are when they “see” text in images. Here’s why this line of research is particularly timely and important right now:

  • Real-world AI relies on visual-text understanding for education, research, and professional work. If a math equation image can be subtly mangled to produce incorrect LaTeX or misleading semantic output, the consequences could range from minor errors in notes to serious academic honesty concerns in online learning environments.
  • The study probes black-box vulnerabilities. In practical terms, you don’t always get to peek under the hood of a model like ChatGPT or a math OCR tool; you only feed it an image and observe the output. Demonstrating effective attacks under such constraints highlights a gap between performance metrics (accuracy on standard benchmarks) and real-world robustness.
  • This research sharpens the conversation about robustness vs. efficiency. Skeletonization reduces search space, making attacks faster and more scalable. At the same time, it prompts developers to consider defenses that explicitly account for the sparse, line-based nature of written math, rather than treating all pixels as equally interchangeable.
  • It builds on a broader suite of AI safety and robustness work, tying into the conversation about how we during training and evaluation should account for visual reasoning and notation-specific intricacies. The paper extends prior ideas around adversarial perturbations (like FGSM in white-box contexts) into a practical, text-focused, vision-language setting.

A practical takeaway is that education and productivity tools that interpret math images—whether for homework assistance, online exams, or automated grading—need defenses that acknowledge the unique structure of mathematical notation and the fragility of LaTeX representations in OCR. The research also reinforces the importance of cross-model evaluation. The authors test the adversarial images against a real-world service (ChatGPT) to demonstrate black-box transferability, a crucial step toward understanding risk in deployed systems.

If you want to dive deeper, you can read more in the original paper here: Skeletonization-Based Adversarial Perturbations on Large Vision Language Model's Mathematical Text Recognition. It’s a great example of moving beyond generic image perturbations to domain-specific, structure-aware attacks.


The Attack Idea: Skeletonization and 1D Representation

From Pixels to One Pixel: The Skeletonization Trick

Think of a dense, noisy network of pixels forming a character. The idea behind skeletonization is to strip away the “fat” and keep only the essential spine of the character—the minimal structure that still conveys the identity of the glyph. In practice, this reduces each character to a 1-pixel-wide line that captures the essential strokes. Why does that matter for an attacker? It drastically reduces the search space for perturbations. Instead of nudging countless pixel values across a large image, the attacker can focus on a narrow, skeletonized representation that still preserves the critical information needed for recognition.

For math expressions, where precise spatial relationships (like superscripts, subscripts, and fraction bars) matter, this skeletonization is a double-edged sword. It preserves the core geometry while removing extraneous texture. The attack then works in this lean space to perturb the locations that actually matter for OCR-to-LaTeX conversion, making the process more efficient and, in a black-box setting, more feasible.

Targeting Text Areas: Bounding Boxes and 1D Arrays

The pipeline is deliberately text-centric. The attacker first detects the bounding boxes for all characters in the input image. This step ensures we’re perturbing the right regions—those that actually contribute to the OCR result. Once the text zones are isolated, each character region undergoes skeletonization, turning it into those minimal one-pixel-wide lines.

But the clever part isn’t just skeletonization by itself; it’s how they transform the resulting structure into a compact, searchable form. The skeletonized regions are converted into a 1D array by concatenating pixel values row by row. Then, all the character arrays are concatenated in a top-to-bottom, left-to-right order. The result is a unique 1D representation for each image that captures the essential textual layout without the noise. This 1D representation makes it much easier to apply optimization methods to find perturbations that meaningfully drift the LaTeX output.

In short: skeletonize to a lean, text-focused representation, convert to a concise 1D fingerprint, and then search for perturbations within that fingerprint.

If you want a sense of the numbers behind the approach, the researchers tested multiple narrowing strategies (ranging from using the full image down to just the skeletonized text region) and found that the more the search space was narrowed, the more effective the attack tended to be. That’s a strong hint that for visual-language models, the critical levers are really in the text-form regions rather than the image background.

As a reminder, the method is described in more detail in the original work, which you can read here: Skeletonization-Based Adversarial Perturbations on Large Vision Language Model's Mathematical Text Recognition.


Experiment Setup and Key Findings

A Small, Focused Dataset: 40 Equations at 50px

To keep things controlled and avoid data leakage from large public datasets, the researchers built a compact dataset of 40 digital images of mathematical equations. Each image was resized to a height of 50 pixels to standardize evaluation across test cases. The goal here wasn’t to beat a mega-benchmark but to rigorously test the attack mechanics in a clean, controlled environment and to explore how different narrowing strategies affect attack effectiveness.

This dataset choice matters. In math OCR, a lot of the difficulty comes from the precise spatial language of formulas, not merely character detection. By focusing on a compact, carefully curated set, the team could isolate the impact of their skeletonization pipeline and the optimization loop on LaTeX outputs.

Cosine Similarity and TF-IDF: Measuring Syntactic Shifts

A core part of the evaluation is how they quantify changes in the resulting LaTeX code. They borrowed a classic information-retrieval trick: represent LaTeX sequences with TF-IDF vectors and then compute cosine similarity between the clean and adversarial LaTeX outputs. The less similar the two sequences are, the more the attack has managed to perturb syntactic structure.

This TF-IDF + cosine similarity approach is a practical proxy for “how differently do the two LaTeX strings read?”—not just visually but in terms of the textual content that LaTeX encodes. It also serves as the loss function that guides the optimization loop (more on that next).

To validate the robustness of the attack strategy beyond raw syntax, the researchers also performed a manual semantic check. That is, they asked: do these syntactic changes actually alter the meaning of the math expression, or are they minor, equivalent reformulations? This dual approach helps separate purely superficial alterations from genuine shifts in mathematical meaning.

For readers who want to see the numbers, the paper reports that narrower search spaces correlate with higher attack effectiveness, as shown in their early tables. And for a sense of transferability, they tested the perturbed images against ChatGPT to see if the deception survives the black-box path into a real-world vision-capable LLM.

Search Space Narrowing vs. Optimization: What Worked?

The optimization step is where the rubber meets the road. They experimented with three optimization strategies:

  • Covariance Matrix Adaptation Evolution Strategy (CMA-ES)
  • Tree-structured Parzen Estimator (TPE)
  • Random Search

All three were used in conjunction with different narrowing approaches to the search space, but one surprising result stood out: Random Search consistently outperformed CMA-ES and TPE in their experiments. The authors suggest that random sampling might better explore a wider portion of the narrowed space and avoid getting stuck in local optima, especially in the high-variance, sparse perturbation landscape created by skeletonized text.

This finding is a little reminder that the most sophisticated optimization tool isn’t always the best tool for every problem. In a black-box setting with a constrained, sparse perturbation space, brute-force-ish exploration via Random Search can reveal weaknesses more reliably than more “intelligent” methods that assume smoother landscapes.

Transfer to ChatGPT: Real-World Black-Box Impacts

To test real-world relevance, the researchers fed the adversarially perturbed images into ChatGPT (via the GPT-4 model on the web) and compared the LaTeX outputs to those from clean images. The results showed notable semantic changes, signaling that the perturbations could propagate beyond a controlled OCR setting into actual deployed vision-enabled services.

In the paper’s tables, the “upper row” corresponds to the LaTeX recognized from original images, while the “lower row” shows the LaTeX from attacked images. The semantic shifts were substantial, underscoring that black-box transfer attacks on foundation models with vision capabilities are not just theoretical curiosities but practical threats in real-world workflows.

If you’d like to see the exact comparisons, the paper includes visual examples and a clear narrative of how the same perturbation can derail the downstream LaTeX interpretation by a sophisticated model.

Reading tip: the same paper that introduces these methods also discusses performance metrics like PSNR (peak signal-to-noise ratio) to quantify how much the image changed visually, as well as the cosine-similarity-derived success rate metric (the proportion of images where the similarity dropped below a threshold). This combination of perceptual and semantic metrics strengthens the case that these perturbations matter on multiple levels.

For those who want to explore the core idea in depth, you can consult the original publication here: Skeletonization-Based Adversarial Perturbations on Large Vision Language Model's Mathematical Text Recognition.


Practical Implications and Safeguards

This research isn’t message-in-a-bottle doom-and-gloom; it’s a practical invitation to strengthen the robustness of vision-enabled models in high-stakes domains like mathematics. Here are some takeaways for practitioners, educators, and researchers:

  • Domain-specific defenses matter. Since the attack targets the skeletonized, text-centric regions of math images, defenses could focus on robust skeletonization-aware OCR pipelines. Techniques like adversarial training that specifically incorporate skeletonized perturbations in the math OCR stage could improve resilience against these attacks.
  • Robust evaluation should go beyond generic benchmarks. This study demonstrates that a system’s true robustness isn’t captured by standard accuracy metrics alone; it also depends on how the model behaves under stylized, domain-specific perturbations that preserve visuals but disrupt meaning.
  • Cross-model testing is essential. The fact that the attack transfers to a real-world service like ChatGPT signals that defenses must be evaluated across both purpose-built OCR systems and general foundation models with visual capabilities.
  • Practical safeguards in education and content creation. In a world where students might use AI tools to interpret or generate LaTeX from images, educators and platform designers should be aware of potential manipulation channels. This could inform policies around image submissions, prompts used for OCR, and post-processing checks that validate the semantic equivalence of LaTeX outputs.
  • A call for broader, open benchmarks. The authors’ choice to construct a new, small dataset highlights a gap: reproducible, domain-relevant benchmarks for math OCR robustness. The community could benefit from shared datasets that explicitly test semantic fidelity under perturbations.

In sum, the work provides a blueprint for how to think about vulnerabilities in math-aware vision-language systems and suggests practical directions for hardening these models. If you’re building or deploying math-augmented AI tools, consider these skeleton-aware perturbations as a stress test that complements traditional evaluation methods.


Main Content Sections (Digestible Breakdown)

1) The Attack Idea: Skeletonization and 1D Representation
- Why skeletonization helps: it compresses text to its essential stroke structure, reducing the number of pixels attackers need to perturb.
- How the attack is targeted: it focuses on character bounding boxes, then skeletonizes each, producing a concise 1D representation that captures the sequence and layout of text.

2) Measuring Impact: Character vs. Semantic Change
- Character-level changes are tracked using cosine similarity on TF-IDF representations of LaTeX strings.
- Semantic shifts are assessed manually to determine whether perturbations alter the mathematical meaning.
- The dual metric approach reveals attacks that erode not only syntactic accuracy but also semantic correctness.

3) Black-Box Transfer and Real-World Implications
- The transfer test with ChatGPT demonstrates that perturbations aren’t just a lab artifact; they can degrade outputs in widely used AI services.
- Random Search’s surprising effectiveness shows that even simple strategies can exploit fundamental weaknesses in text recognition pipelines when the search space is narrow.

4) Implications for Robustness and Future Defenses
- The study spotlights the need for not just more data but smarter, domain-aware defenses.
- Future work could explore score-based query attacks and more resilient architectures that explicitly model the structure of mathematical notation.

For deeper context, the original paper lays out the methodological details and results: Skeletonization-Based Adversarial Perturbations on Large Vision Language Model's Mathematical Text Recognition.


Key Takeaways

  • Skeletonization dramatically narrows the perturbation search space by reducing characters to essential strokes, enabling efficient black-box attacks on math text recognition.
  • The attack uses a 1D array representation derived from skeletonized text regions, which helps guide perturbations to impactful pixels.
  • TF-IDF-based cosine similarity between clean and attacked LaTeX outputs serves as a robust syntactic loss, while manual semantic checks reveal meaningful meaning changes in the math expressions.
  • Among optimization strategies, Random Search often outperformed CMA-ES and TPE in this narrowed space, highlighting that simpler, broader exploration can beat more sophisticated methods in certain adversarial contexts.
  • The attacks transfer to real-world vision-capable models like ChatGPT, signaling tangible risks for educational tools, OCR services, and automated math reasoning platforms.
  • This work motivates the development of domain-aware defenses and more robust evaluation practices to ensure math notation integrity in AI-assisted workflows.

Practical takeaway: if your workflow depends on accurate LaTeX extraction from images, especially for math-heavy content, you should consider skeletonization-aware defenses and cross-model validation to mitigate these kinds of perturbations.


Sources & Further Reading

If you want to explore more about this topic or keep an eye on how defenses evolve in vision-enabled LLMs, following this line of work is a great starting point. It sits at a critical junction of OCR robustness, adversarial machine learning, and the practical reliability of AI tools in education and research.

Frequently Asked Questions

Limited Time Offer

Unlock the full power of AI.

Ship better work in less time. No limits, no ads, no roadblocks.

1ST MONTH FREE Basic or Pro Plan
Code: FREE
Full AI Labs access
Unlimited Prompt Builder*
500+ Writing Assistant uses
Unlimited Humanizer
Unlimited private folders
Priority support & early releases
Cancel anytime 10,000+ members
*Fair usage applies on unlimited features to prevent abuse.