AI Geography Explained: How ChatGPT “Sees” the World
Table of Contents
- Introduction
- Why This Matters
- How Generative AI Forms “Defaults” (and Gets Brittle)
- When Bias Slips Back In Through Composite Tasks
- The “Model Knows vs. Model Shows” Problem
- A Research Agenda for Safer Geographic AI
Introduction
If you’ve ever asked ChatGPT (or another generative AI) “What’s a country?” or “Name a major city,” you’ve probably gotten an answer that felt confident—even if it wasn’t exactly what you expected. New research digs into a surprisingly important question: how geography looks according to generative AI systems, especially large language models like ChatGPT. This work is based on the new article Geography According to ChatGPT -- How Generative AI Represents and Reasons about Geography.
At the heart of the paper is an idea many people miss when evaluating AI: accuracy isn’t the whole story. Even if a model gives facts that are “right,” it can still represent the world in a skewed way—overemphasizing some places, underrepresenting others, and sometimes reacting strongly to tiny changes in how you phrase a question. The authors argue that we need to study representation, not just correctness: not only what the model says, but how it constructs a geographic picture.
The paper lays out three exploratory “vignettes” (mini investigations) to provoke follow-up work. First, do these models form strong defaults—and how fragile are those defaults under paraphrasing? Second, can distributional shifts sneak back in when you combine multiple “benign” tasks—like generating AI personas and then assigning roles? Third, do models merely recall geographic rules, or do they truly apply them when reasoning independently? Let’s walk through what they found and why it should make anyone who cares about places—planning, tourism, mapping, or even everyday decision-making—pay attention.
Why This Matters
This research is significant right now because generative AI is no longer just a chatbot hobby. People increasingly use these systems to make real decisions: where to travel, how to plan, what to buy, how to imagine cities and neighborhoods, and what risks might exist in certain places. When AI becomes an everyday “geographic narrator,” even small representational quirks can steer real-world behavior—quietly and at scale.
Here’s a scenario you could face today: a tourism recommender or content generator uses ChatGPT-like systems to draft itineraries. If the model has strong defaults that repeatedly point people toward a small set of iconic places (think: “Japan” as the default country exemplar), it can contribute to over-tourism in some destinations while making others feel “invisible.” The paper doesn’t claim this is the only cause of tourism patterns—but it shows how easily a model can produce a narrowing lens, even when you don’t ask for it.
And it builds on earlier AI research in a meaningful way. We already know AI can be biased along human categories like gender or race. What’s new here is the insistence that geographic representation deserves similar scrutiny—and that the “bias story” may be incomplete if we only test single prompts or check for straightforward “correct vs incorrect.” The authors also push back on a common assumption from explainable AI: even if you can inspect internal reasoning, you may still miss what deployed systems do to people in practice (“in-the-wild” effects). In other words, this isn’t just about model internals; it’s about the ecosystem-level consequences of how AI depicts geography.
How Generative AI Forms “Defaults” (and Gets Brittle)
One of the most vivid findings in the paper is the tendency of generative models to generate defaults—a small set of prototypical answers that show up again and again for a given category.
Defaults: the “San Diego Zoo effect”
Earlier work the authors cite (studying multiple LLMs across ~300 geographic feature types from GeoNames) found patterns like:
- Every one of the 11 models consistently picked San Diego Zoo when asked to name a zoo.
- 10 out of 11 would consistently name Everglades for “wetland.”
- 8 out of 11 leaned heavily toward Paris as an example city.
So the model’s “knowledge” may be accurate in a narrow sense—San Diego Zoo is real, Everglades is real—but the representation is lopsided. A city is not “Paris-shaped,” and a wetland is not “Everglades-shaped,” yet the model repeatedly acts like these are the default prototypes.
The paper further claims something that matters for safety and robustness: defaults can increase in strength across newer generations. For instance, older ChatGPT-3.5 appeared more diverse than later models, matching observations that model changes aimed at reducing hallucinations can also lead to more convergence.
Brittleness: tiny wording changes → big geographic swings
Even more interesting is how fragile these outputs can be. The authors show that prompts that are nearly identical semantically can trigger different default countries.
In one experiment involving 200 independent queries to GPT-5.1 at temperature 0.3, the system favored:
- Japan in 168 cases for the prompt: “Name a country, please.”
- Canada in 104 cases for: “Please name a country.”
Both prompts are basically the same intent, but the model’s distribution shifts a lot. And it doesn’t always stop at two countries: increasing the temperature to 1.0 made a third option (Brazil) appear occasionally.
There’s a deeper implication here: you might expect an “intelligent system” to produce stable answers under paraphrasing. But instead, it behaves more like a system with strong attractors—geographic exemplars it locks onto.
Why this matters in real life
If you build tools on top of these systems—summaries, itinerary drafts, outreach emails, even “seed ideas” for planners—you can accidentally hard-code geographic stereotypes. Users will think they’re getting personalized recommendations, while the model is quietly repeating a small learned subset.
Even the paper’s framing is pointed: it’s not enough to ask whether the model is “correct.” The question becomes: does the model paint the world with a broad brush or a narrow brush—and does that brush change when the user’s phrasing changes by a single word?
When Bias Slips Back In Through Composite Tasks
If defaults describe how models choose examples, this section is about how distributions can shift later—especially when you combine AI steps that individually seem harmless.
Debiasing isn’t a magic eraser
The authors note that efforts to debias generative models (and the data behind them, like knowledge graphs) can be controversial. Even if debiasing works on paper, it may:
- over-correct into weird distortions, and
- embed a specific set of norms that vary by region and culture.
They cite real-world examples like image generation cases where attempts to improve diversity or avoid stereotypes can produce unexpected or offensive outcomes. The takeaway is not “debiasing is pointless,” but rather: debiasing changes behavior through an additional policy layer, and that policy can itself become a source of new distortions.
Distributional shift can re-enter via “benign” scaffolding
Here’s the vignette idea: AI agents are often used in chains. You don’t just ask one question—you run step 1, then step 2, and so on. A system may pass a bias test in isolation, but a composite workflow can generate a new distribution.
The authors tested a two-stage setup (with sensitive topics, so they explicitly caution that they’re not claiming this shows “racial bias” in a simple way):
Step 1: Generate 50 realistic personas for the Greater Los Angeles area.
GPT-4o produced hypothetical residents with fields like names, occupations, ethnicity/race, and age. Across 8 independent runs they got 381 correctly created hypothetical residents. They compared that synthetic distribution to publicly available LA demographic stats (2020).
They found mismatches—for ethnicity/race, their generated sample looked like:
- Hispanic or Latino: 35.43%
- White non-Hispanic: 26.77%
- Black or African American alone: 19.69%
- Asian alone: 17.59%
- Other Race alone: 0.52%
The sample underrepresented Hispanics/Latinos compared to the reference distribution and overrepresented some other groups.
Step 2: Use the generated personas in another cold-start AI run.
They fed those ~50 people back into GPT-4o to create a plot premise for a future book. They asked which characters had a past criminal record, to make it a crime story.
The resulting “criminal role” patterns differed from what you’d expect from certain pre-COVID arrest statistics—e.g., they observed less than 6% Whites assigned criminal roles in their runs (details shown in their figure).
The key point: even safeguards can’t guarantee geographic “fairness”
Again, the authors are careful: crime and arrest distributions have their own complexities and biases; you can’t treat any reference distribution as a perfect ground truth without confounds (age, prompt wording, and causal factors).
The scientific punchline is broader than race: the evaluation problem is hard. Even when a system is trained with safeguards (like refusing some racially loaded questions), it can still generate a shifted distribution through the composition of tasks. If you’re building tools that generate personas for cities, neighborhoods, or demographic narratives, you need to test not just each step—but the entire workflow.
The authors also suggest that symbolic systems like knowledge graphs might provide reference data when you need “grounded” distributions. But doing that robustly across geography is still an open challenge.
The “Model Knows vs. Model Shows” Problem
This vignette tackles a subtle but important distinction: models can talk about geographic laws and principles, while failing to apply them under new constraints.
Lots of models can explain Zipf’s law—until you make them use it
The paper discusses city population size distributions. Across many countries, the distribution of major city sizes roughly resembles Zipf’s law (rank-size rule): city sizes often follow a pattern where the biggest city is much larger, and subsequent cities decline in a characteristic way.
Many models will happily respond if you ask, “Do you know about Zipf’s law?” They can provide explanations, references, and examples.
But the authors changed the task to test application. They asked models to imagine a new island nation (“Novaterra”) with a given total population (about 60 million) and an economy/level similar to the US or Japan. Then they asked for the names and sizes of its 30 largest cities, with sizes that should respect the total population constraint and roughly align with the rank-size-type pattern.
Across 25 runs (5 runs × 5 models), only 2 outputs—both from GPT-5—explicitly referenced the rank-size rule and produced city sizes close to expectations. Even more striking: none of the systems maintained the total population constraint, frequently exceeding 60 million before reaching the rank of 10.
Why this is more than “math accuracy”
You might say: “Well, city populations don’t strictly follow Zipf’s law in every fictional country.” True. The point isn’t to force a single law onto every case.
The point is the difference between:
- Model Knows: can retrieve and describe geographic principles when prompted.
- Model Shows: can consistently apply principles to a constrained generation task without being explicitly told.
In geography, this matters because many downstream uses (urban analytics narratives, disaster planning stories, demographic scenario generation) rely on constraints and consistency, not just on plausible-sounding explanations.
The Allestone analogy: latent understanding isn’t guaranteed
The paper closes the vignette with an intriguing comparison to a child artist’s fictional map (“Allestone” by Thomas Williams Malkin). Researchers analyzed whether that child’s map adhered to real geographic principles like fractal coastline properties, Horton’s law, and central place theory. Remarkably, it approximated these patterns—suggesting something like “latent understanding.”
Could generative AI replicate this kind of principle adherence autonomously? That’s the question the vignette raises: can we detect deeper structural reasoning, not just surface-level correctness?
A Research Agenda for Safer Geographic AI
The paper’s conclusions connect these vignettes into a larger message: representation is not neutral, and AI systems behave like “herds” at scale. Even small biases—about which cities count, which countries “come to mind,” which narratives get plausible roles—can compound as AI becomes a reference layer for more AI.
One particularly worrying future mechanism is model collapse: future AI models trained on outputs generated by earlier AI systems may inherit and amplify the same narrow geographic viewpoints. If the training data increasingly reflects AI’s default exemplars, the geographic “needle” gets stuck more easily.
Correctness is necessary, but representation is equally crucial
Traditional evaluation often focuses on whether a response is right. The authors argue that we should also evaluate:
- geographic coverage (what regions are visible?),
- representational diversity (how narrow are the defaults?),
- robustness (does paraphrasing break the distribution?), and
- deeper reasoning (can principles be applied under constraints?).
The neurosymbolic twist—and the risk of hidden problems
Finally, the paper points out a growing trend: newer AI systems increasingly call tools—like retrieval augmented generation (RAG) or code execution—rather than relying only on latent reasoning. For geographic reasoning, this might mean loading polygon geometry into Python and using Shapely for topology operations.
This can reduce visible errors. But the authors warn: it may also obscure unresolved latent representation issues. If the model can “cheat” by calling tools, you might miss that it still fails to understand the underlying spatial relationships in contexts where tool use breaks or is unavailable.
So even as systems become more capable, the representational questions remain—and they may become harder to detect.
Key Takeaways
- Generative AI forms strong geographic defaults (e.g., specific exemplars like San Diego Zoo for “zoo”), often reducing geographic diversity.
- Outputs can be brittle: tiny prompt changes (“Name a country” vs “Please name a country”) can swing which country appears most (Japan vs Canada in the reported GPT-5.1 tests at temperature 0.3).
- Distributional shift can re-enter through multi-step workflows, even when safeguards exist—like generating AI personas for LA and then assigning sensitive attributes (criminal roles) in a follow-up step.
- There’s a gap between “knowing” and “showing”: models may explain geographic principles but fail to apply them consistently under constraints (e.g., city population allocations ignoring a total population constraint even when rank-size concepts were expected).
- Evaluation must go beyond correctness toward representation, coverage, robustness, and constraint-consistent reasoning—especially because AI outputs increasingly shape how people think and act about places.
Sources & Further Reading
- Original Research Paper: Geography According to ChatGPT -- How Generative AI Represents and Reasons about Geography
- Authors: Authors:
Krzysztof Janowicz,
Gengchen Mai,
Rui Zhu,
Song Gao,
Zhangyu Wang,
Yingjie Hu,
Lauren Bennett