Diversity, Novelty & Popularity Bias in ChatGPT Recommendations

ChatGPT isnt just about accuracy anymore. This post investigates how its recommendations in books, films, and music balance diversity, novelty, and popularity bias. Using benchmarks like Facebook Books, MovieLens, and Last.FM, revealing ChatGPT's strengths and its lags in beyond-accuracy dimensions.
1st MONTH FREE Basic or Pro • code FREE
Claim Offer

Diversity, Novelty & Popularity Bias in ChatGPT Recommendations

Table of Contents

Introduction

ChatGPT has evolved from a conversational assistant into a potential driver of recommendations across books, movies, and music. But as the Recommender Systems (RSs) community expands its gaze beyond mere accuracy, three questions rise to the top: How diverse are its suggestions? Are the recommendations genuinely novel or just a clever re-packaging of popular items? And how much do popularity biases creep in? This post dives into these “beyond-accuracy” dimensions, drawing on new research that puts ChatGPT to the test across three domains—Books, Movies, and Music—using well-known benchmarks like Facebook Books, MovieLens, and Last.FM. The study compares ChatGPT-3.5 and ChatGPT-4, and also weighs it against traditional baselines. For a full deep dive, you can check the original work here: Exploring Diversity, Novelty, and Popularity Bias in ChatGPT's Recommendations.

Why this matters now is simple: as AI-powered assistants become more embedded in everyday tools—video streaming, shopping assistants, and even personal wellness apps—the quality of recommendations shapes what people discover, consume, and even learn. If a model overemphasizes popular items, users miss out on hidden gems; if it’s too cautious, users may tire of the same handful of recommendations. The new study argues that ChatGPT’s multi-domain reach offers an intriguing blend of accuracy and beyond-accuracy traits, but also reveals important biases and limitations that product teams should understand before deploying these capabilities at scale.

In the sections that follow, I’ll translate the research into an approachable, practical lens: what the researchers found, what it means for real-world applications, and how this builds on the broader arc of AI-driven personalization.

Why This Matters

Beyond-accuracy dimensions—diversity, novelty, and popularity bias—are not luxuries. They’re the difference between a recommendation feed that feels stale and one that invites genuine exploration. Consider a streaming service that nudges you toward the same handful of popular titles. You may think you’re getting good suggestions, but your long-tail interests and occasional serendipitous finds suffer. Conversely, a feed that exposes you to a broader set of items can boost long-term engagement, satisfaction, and the sense that the system truly “gets” you.

This study is especially timely because it probes ChatGPT’s behavior in real-world-like settings, not just standard accuracy metrics. It’s one thing to claim high precision, recall, or nDCG; it’s another to show that a model can actively diversify recommendations, surface novel options you wouldn’t easily find on your own, and yet avoid leaning too heavily on the most-copied, most-popular items. The work also builds on a broader AI research trajectory: earlier RS work emphasized utility and usefulness beyond raw accuracy; modern research blends classic methods with large language models to evaluate how prompts, prompts engineering, and interaction styles influence outcomes. In short, this isn’t just about whether ChatGPT can predict well; it’s about whether it can offer a healthier, more enriching mix of suggestions that respects user curiosity and fairness.

If you’re curious about the experimental setup and the numbers behind these claims, the original paper explains the methodologies in detail, including the three prompting strategies they tested (zero-shot, few-shot, and role-playing prompts) and the post-processing steps used to align ChatGPT’s outputs with real catalogs. You’ll also find a transparent discussion of limitations, like potential memorization or hallucination risks, and how the authors mitigated them. A natural next step is to compare results with other LLMs and across more domains, something the authors point out as future work.

Link to the underlying study for deeper reading: Exploring Diversity, Novelty, and Popularity Bias in ChatGPT's Recommendations.

Diversity in ChatGPT Recommendations

Diversity in RSs measures how broad the set of recommended items is, rather than how consistently the same few items appear across lists. In this study, two key metrics are used:

  • Gini coefficient (lower means more even distribution across items; higher signals concentration on a few items)
  • Item Coverage (how many unique items from the catalog appear across all recommendations)

Take the Facebook Books dataset as an anchor. The results show that ChatGPT-4 tends to offer more diverse lists than ChatGPT-3.5, with a Gini of about 0.1050 and an Item Coverage of 1,004 out of 2,234 items. That means GPT-4 skewed a bit away from nonstop repeats and touched a larger portion of the catalog, albeit not uniformly across all baselines. In other words, ChatGPT-4 widens the palette compared with GPT-3.5, though some baselines (e.g., dedicated Content-Based or neighbor-based methods) can push diversity even further in some scenarios.

Last.FM paints a similar but nuanced picture: GPT-4’s Gini is around 0.2023, with coverage of 944 out of 1,507 items. That suggests a broader spread than GPT-3.5 and shows ChatGPT’s capacity to surface a wider range of music items, not just the most popular tracks. MovieLens tells a slightly different story: GPT-4’s Gini is about 0.0853, with coverage of 553 out of 1,862 items. While this is a solidly diverse footprint, the study notes that some strong diversity baselines (e.g., certain graph-based or hybrid methods) still push the envelope further.

So, what does this mean in practice? ChatGPT’s diversity is domain-sensitive. In Books and Music, GPT-4 demonstrates appreciable variety beyond a single motif, which can help users discover long-tail items and new genres. In Movies, the diversity picture is more mixed; the model delivers a reasonable spread but doesn’t consistently beat the strongest diversity-focused baselines. Across domains, the headline takeaway is that the prompting approach and the model version matter: GPT-4 generally offers more diverse recommendations than GPT-3.5, and role-playing prompts often deliver cleaner, more varied candidate lists.

Practical implication: if you’re building a mixed-media recommender, you might pair ChatGPT with a diversity-focused post-filter. For instance, after generating a candidate list with an LLM, you could apply a diversification step that prioritizes underrepresented genres or release years, ensuring long-tail exposure without sacrificing relevance. The study’s results provide a factual backbone for those design decisions and highlight where ChatGPT’s diversity strengths lie.

For a broader sense of the methodology and the framing around diversity, novelty, and bias, you can read the original paper. It also situates these results within the evolution of ChatGPT-based recommendation approaches and how beyond-accuracy metrics have gained prominence in RS research.

Novelty and Discovery

Novelty captures how surprising or unexpected recommended items are to the user—essential for serendipity and lifelong engagement. The study uses two complementary measures:

  • EPC (Expected Popularity Complement): higher values indicate a tilt toward items that are less mainstream
  • EFD (Expected Free Discovery): higher values reflect a greater chance that users encounter items they wouldn’t actively seek out on their own

Across datasets, ChatGPT shows noteworthy novelty, with some consistent patterns:

  • Facebook Books: ChatGPT-4 exhibits high novelty (EPC around 0.0353 and EFD around 0.3486). These figures outpace most baselines, including many collaborative-filtering and content-based methods. In other words, GPT-4 tends to surface book picks that feel fresh, not simply the most discussed titles.
  • Last.FM: Both GPT-3.5 and GPT-4 score above average on EPC and EFD, but GPT-4 edges out GPT-3.5. The gap isn’t huge, but it’s meaningful: GPT-4 shifts a bit toward less mainstream items compared with GPT-3.5 and many baselines.
  • MovieLens: GPT-4 again leads the way among the ChatGPT variants, with EPC around 0.1453 and EFD around 1.6010. While these numbers are lower than the best-performing accuracy-oriented models, they place GPT-4 on par with several strong baselines in terms of novelty.

Put simply: in terms of novelty, ChatGPT-4 tends to be more adventurous than GPT-3.5, and in some domains, it rivals or surpasses traditional RS methods that prioritize relevance. The general takeaway is that ChatGPT, when prompted thoughtfully, can push beyond just returning safe, “safe-to-click” items and instead offer picks that broaden a user’s experiential horizon.

Analogy: novelty in recommendations is like a festival lineup. If every show is the same style, you might get comfortable—but you miss the thrill of discovering an unexpected act that redefines your musical taste. GPT-4 seems more willing to throw in a curveball, while still keeping the overall experience relevant.

Practical implication: for brands seeking to boost user discovery and engagement, leveraging ChatGPT-4 with a design that rewards exploration (while maintaining relevance) could help users uncover new authors, genres, or artists they might love but wouldn’t seek out on their own. The study provides concrete novelty indicators you can monitor as you tune prompts and post-processing steps.

To deeper-dive into the novelty results and their interpretation, see the original paper’s detailed tables and discussion. The authors explicitly connect novelty outcomes to the practical aim of fostering serendipity in user experiences.

Popularity Bias in Practice

Popularity bias is the tendency to favor well-known, frequently interacted-with items at the expense of niche or long-tail options. The study uses two complementary metrics:

  • APLT (Average Popularity of Long-Tail Items): higher means more long-tail exposure
  • ARP (Average Rating-based Popularity): lower values indicate less bias toward popular items

Facebook Books presents an interesting dynamic: ChatGPT-3.5 shows APLT around 0.1870 with ARP about 46; GPT-4 nudges APLT up to about 0.2424 while reducing ARP to around 40. This indicates that GPT-4 expands long-tail exposure a bit but still isn’t as biased toward the most popular items as some baselines. In other words, GPT-4 reduces some reliance on the crowd-pleasers but doesn’t flip the bias on its head.

Last.FM provides a mixed picture. GPT-3.5 has APLT ~0.1391 and ARP ~99, while GPT-4 moves to APLT ~0.1267 and ARP ~102. Here, GPT-4 suggests a somewhat smaller share of the long tail and a tilt toward more popular items relative to GPT-3.5. That said, both ChatGPT variants stay in the mid-range compared with strongly popularity-driven baselines (e.g., MostPop, which shows ARP well above 100). It’s a nuanced outcome: ChatGPT is not relentlessly chasing popularity, but in this domain, GPT-4 shows a proverbial “popularity ambivalence”—less tail exposure than GPT-3.5 in this dataset, but also not as biased toward mainstream items as some alternatives.

MovieLens yields a similar pattern: ARP values around 90 for GPT-3.5 and 95 for GPT-4, which are lower than the MostPop baseline (ARP around 182) but higher than some graph-based or neighbor-based methods. APLT values follow the same trend, reinforcing a cautious stance toward deep long-tail coverage. The upshot: ChatGPT does not perfectly solve popularity bias, but it also does not blindly chase popularity. GPT-4 tends to be modestly less biased toward the mainstream lineup than its predecessor in certain domains, indicating progress in balancing item exposure.

The researchers emphasize that while ChatGPT’s bias toward popular items is present, it’s not as pronounced as the most extreme baselines. This suggests a middle ground where users still see relevant, well-known options, but there’s room for improving long-tail exposure and fairness.

One practical takeaway for product teams: if your goal is to maximize discovery without sacrificing satisfaction, you can use ChatGPT as a starting point for recommendations and then apply a post-filter or diversification step that deliberately elevates underrepresented items. The study’s findings provide a credible benchmark for how far such strategies can push the needle across domains.

For additional context and the exact numeric landscape, the original paper details the ARP and APLT values by dataset and model version (GPT-3.5 vs GPT-4), and it discusses how these biases compare to various baselines.

Cold-Start Scenarios and Beyond-Accuracy

Cold-start is a notorious challenge in RSs: how to seed a good set of recommendations when a user has only a handful of interactions. The study simulates this by giving each user a maximum of ten interactions and then evaluating ChatGPT’s performance against strong collaborative filtering and content-based baselines.

Accuracy under cold-start shows that ChatGPT remains competitive, even strong, in identifying relevant items with limited data. In Facebook Books, GPT-4 achieves nDCG around 0.0538 and Recall around 0.0873, outperforming several baselines that rely on longer user histories. In Last.FM, ChatGPT maintains solid performance (nDCG around 0.28 or higher, Recall around 0.34+), surpassing some non-personalized and content-based approaches and holding its own vs. more complex baselines. For MovieLens, GPT-4 not only achieves the highest nDCG among the compared methods (about 0.1405) but also maintains strong Recall, illustrating its ability to surface genuinely relevant items even with sparse data.

Beyond accuracy, the cold-start scenario is where the study highlights ChatGPT’s capabilities in diversity and novelty. GPT-4 generally improves diversity and item coverage compared with GPT-3.5 in all three datasets, indicating a broader exploratory tendency when user history is shallow. Novelty, as reflected by EPC and EFD, also stays favorable for GPT-4 vs. GPT-3.5, suggesting that even with little data, ChatGPT can push items that users might not discover on their own. As with the main results, the balance between novelty/diversity and accuracy is domain-dependent: some domains benefit more from novelty, others from accuracy, and a few from a more cautious approach to popularity bias.

An important note from the methodology: to ensure fair evaluation, the authors implemented a post-processing pipeline to map ChatGPT’s outputs to in-catalog items when possible, using Gestalt pattern matching with a 90% similarity threshold. This helps avoid evaluating “External Items” the model may hallucinate. The practical implication is clear: when using LLM-powered recommendations in real systems, a robust reconciliation step with the catalog is essential to maintain a credible evaluation and user experience. The study reports that out-of-catalog items tended to appear late in lists (beyond top-10 in most cases), which helps protect rank-sensitive metrics.

If you want to see the cold-start results in full detail, the original paper lays out the numbers by dataset and model version for both accuracy and beyond-accuracy metrics, offering a transparent view of where ChatGPT shines and where it still struggles.

Key Takeaways

  • ChatGPT, especially the GPT-4 variant, shows strong beyond-accuracy performance across multiple domains, balancing diversity, novelty, and to a reasonable extent, popularity bias.
  • Domain differences matter: Books tend to yield the strongest balance of novelty and diversity with GPT-4, Last.FM shows solid results, and MovieLens presents more mixed outcomes with a relatively larger tilt toward popular items.
  • Cold-start scenarios reveal that ChatGPT is surprisingly effective with limited user history, maintaining solid accuracy and improving diversity and novelty compared with many baselines.
  • Role-Playing prompts emerged as an effective prompting strategy in this study, reducing duplicates and improving the practicality of generated lists. However, all prompting approaches require careful post-processing to align model outputs with real catalogs and avoid evaluation and user experience pitfalls.
  • Memorization and potential data leakage are important caveats when using LLM-based recommendations. The authors emphasize the need for caution and continued research into how memorization interacts with recommendation quality and fairness.
  • Practical deployments can benefit from a hybrid approach: start with ChatGPT to surface diverse, novel candidates and then apply diversification and ranking strategies that preserve accuracy while expanding exposure to long-tail items.

If you’re considering applying ChatGPT in a live recommender system, this research offers a solid evidence base for what to expect and what to watch out for, plus a clear methodology for evaluating beyond-accuracy metrics in your own domain. For a deeper, numbers-rich treatment, see the original work.

Sources & Further Reading

Notes for readers who want to dive deeper: the paper lays out a thorough methodology, including the prompting strategies (Zero-Shot, Few-Shot, Chain-of-Thought, and Role-Playing prompts), post-processing with Gestalt matching to keep results within catalog bounds, and the detailed evaluation framework across accuracy and beyond-accuracy metrics. The authors also discuss limitations related to memorization and hallucination risks, which are critical considerations as we move toward more AI-driven personalization in real-world applications.

In short, ChatGPT is not just a clever generator of recommendations; it’s a tool that, with thoughtful prompting and careful system design, can offer a richer, more exploratory user experience. The key is to balance relevance with discovery, all while staying mindful of biases and the risks of over-reliance on popular items.

Frequently Asked Questions

Limited Time Offer

Unlock the full power of AI.

Ship better work in less time. No limits, no ads, no roadblocks.

1ST MONTH FREE Basic or Pro Plan
Code: FREE
Full AI Labs access
Unlimited Prompt Builder*
500+ Writing Assistant uses
Unlimited Humanizer
Unlimited private folders
Priority support & early releases
Cancel anytime 10,000+ members
*Fair usage applies on unlimited features to prevent abuse.