From Jargon to Clarity: How Language Models Create Readable Labels for Scientific Paper Clusters
If you’ve ever dived into a big library of papers and tried to get your bearings from the labels (or lack thereof) on clusters of topics, you know the struggle: some labels read like a spammy list of buzzwords, others are vague enough to fit almost anything. A recent study tackles this head-on, exploring how large language models (LLMs) can automatically generate descriptive, human-friendly labels for clusters of scientific documents. The goal? Make it easier for researchers and curious readers to grasp what a group of papers is really about—without needing to be a domain expert.
In this post, I’ll distill the authors’ ideas, methods, and findings into a practical guide you can use for your own bibliometric projects. We’ll break down the big distinctions, walk through the labeling workflow in plain language, and highlight what works (and what to watch out for) when you’re using language models to label clusters of science papers. Finally, you’ll find a practical “Key Takeaways” section to boost your own prompting and labeling efforts.
Two Kinds of Labels: Characteristic vs Descriptive
First, a quick vocabulary check. The paper makes a clean distinction between two types of cluster labels:
Characteristic labels: These are built by pulling out distinguishing terms from the documents themselves and concatenating them into a label. Think “standard model; Higgs boson; particle collider; LHC.” They’re precise and directly tied to the cluster’s content, but they can read as technical jargon and sometimes lack a coherent, high-level takeaway.
Descriptive labels: Generated by language models, these aim to summarize the cluster in a human-readable, intuitive way. They resemble the kinds of labels a human annotator would produce, such as “Particle Physics,” “Galactic Dynamics,” or “Black Hole Physics.” Descriptive labels are designed to be legible and interpretable, even to non-experts, and to convey the gist of the cluster without relying on the exact jargon of the underlying papers.
Why this distinction matters: descriptive labels aren’t just prettier; they’re meant to help a broad audience quickly understand what a cluster covers. But they’re also harder to pin down in a formal, replicable way, which is why the authors spend time defining the task, proposing a workflow, and building an evaluative framework.
From Idea to Practice: A Structured Descriptive Labeling Workflow
Here’s the practical way the authors approached “descriptive labeling” and how you can think about it in your own projects.
1) Start with solid clusters
- Data: They pulled a large set of English-language papers from Dimensions (a bibliographic database), spanning four fields: Plant biology, Oncology and carcinogenesis, Artificial Intelligence, and Applied and developmental psychology.
- Clustering pipeline: Each paper gets a dense vector representation via SPECTER (a SciBERT-based model). Those vectors are then reduced to two dimensions with UMAP (for efficiency), and clusters are found with HDBSCAN. The aim is to end up with roughly 100 clusters per field, with clusters not being too tiny.
2) Extract cluster characteristics (the Fi)
For each cluster, you extract a few high-signal features that will feed into the prompt sent to the language model. They use three kinds of characteristics:
- Characteristic terms: They use Dimensions’ “concepts” field (noun phrases) and compute TF-IDF within the cluster to identify the top 12 terms that best distinguish that cluster from others.
- Prominent venue titles: The three most frequent journals or conferences within the cluster.
- Prominent documents: The three papers with the highest field-normalized citation scores within the cluster (the top documents that tend to summarize or epitomize the cluster).
These three feature sets are assembled into a structured prompt for the language model.
3) Turn characteristics into a prompt for a language model
- The model is asked to generate a label based on the cluster’s Fi features. The labeling function is GenerateLabel(Fi, model, template, gamma), where:
- model is the language model (the study mainly used OpenAI’s ChatGPT family; other models exist but had more formatting issues in practice).
- template encodes the task instructions, plus the characteristics in a readable form.
- gamma covers model-specific knobs (like temperature).
- The authors emphasize the process is iterative: you generate a label, then you validate it and may re-prompt with clarifications if needed.
4) The validation loop: keeping labels honest and non-duplicative
Two big challenges with LLM-generated labels are:
- Within-label validity: the label should be coherent and actually describe the cluster.
- Across-label validity: the label shouldn’t duplicate or be too vague across many clusters.
To handle this, they add a validation function, Validate(L), that checks all generated labels and can return alternative labels L′. They implemented three practical checks:
- Format: labels should be between 3 and 50 characters (not too long, not too short).
- Duplicated: ensure we don’t have the same label used for multiple clusters.
- Non-specific: ensure each label is specific to its cluster. They do this by embedding the label and a concise sentence summarizing the cluster’s distinguishing features, then comparing cosine similarities. If the label is not sufficiently distinct, it’s regenerated.
If necessary, the prompt is updated with a clause describing the problem and an example to avoid in the future. The loop runs up to 10 iterations; if nothing improves, the most recent labels are kept.
5) System prompts and model choices: what actually mattered
The study tested several ChatGPT variants (ranging from older, cheaper models like gpt-3.5-turbo to newer, more capable options) and compared prompts with and without a system prompt. Their findings suggest:
- The inclusion of a system prompt (e.g., “You are a librarian with expertise in taxonomy…”) moderately influences results but isn’t a slam dunk.
- Relying solely on certain parts of the prompt (e.g., only the top papers, or only the concepts) can lead to semantic shifts in the labels. The most robust effect came from including the concepts, papers, and journals together.
- Model choice matters. Newer models generally produce labels that are more coherent in line with the baseline, while older ones can yield more divergent results.
6) Evaluation: comparing descriptive vs characteristic labels in a fair way
Instead of asking humans to pick between descriptive and characteristic labels, the researchers designed an evaluation that tests whether a label can unambiguously map to the correct cluster. They did a 50-cluster annotation exercise across the four fields, where annotators tried to pick the correct label from a short list (including the real label and three randomly chosen distractors) given example journals or papers.
Key metrics used in the study:
- Label-shift: a way to quantify semantic differences between two labeling workflows. You generate vector representations for each label, compute cosine similarity, and average across clusters. A lower similarity between workflows means more drastic changes; higher similarity means more consistency.
- First-pass metrics: count how many generated labels failed the validation criteria before looping (useful for early signals about prompt quality).
- Human-based identification accuracy: whether human annotators can correctly map a cluster to its label, comparing descriptive vs characteristic labels.
Takeaways from the findings:
- Descriptive labels performed at least as well as characteristic labels in uniquely identifying clusters, and in some fields they outperformed them (e.g., Plant biology showed a clear edge for descriptive labels).
- In other fields (Artificial Intelligence, Oncology, Psychology), descriptive labels were on par with characteristic labels.
- The quality of the input cluster matters. When the top papers weren’t representative of the cluster, a descriptor based on those papers tended to mislead. In such cases, relying on the concepts and venue terms tended to yield better, more robust labels.
- The language model choice and prompt design matter. The study found that a minimal but well-structured prompt often works well, and that including the full combination of concepts, papers, and journals generally produced more stable results than using any single component alone.
- Jargon vs readability: descriptive labels are more legible and approachable, which is a big win for dashboards and non-expert audiences, without sacrificing the ability to distinguish clusters.
What this means for real-world use
If you’re building or refining a bibliometric workflow, here are practical implications and tips drawn from the study:
Prefer descriptive labels for accessibility, but check for disambiguation. People reading dashboards or doing quick scans will benefit from readable summaries that still differentiate clusters. Use a validation step to prevent vague or duplicate labels across clusters.
Build a robust features set before prompting the model. The three-pronged Fi approach (characteristic terms, venues, and representative papers) gives the model a solid, multi-faceted view of each cluster. In practice, you can tailor these to your data sources:
- Terms/concepts from abstracts or titles
- Venue names that capture the disciplinary context
- Representative or highly-cited papers that embody the cluster’s focus
Don’t rely on a single highlighted document. The study warns that top-cited papers may not be fully representative of a cluster. It’s better to provide a mix of concepts and examples that cover the cluster’s breadth.
Prompt design matters, but keep it practical. A template plus multiple clauses works well. Including a system prompt can help steer the model, but expect diminishing returns in some cases. The key is to present the cluster’s characteristics in a clear, succinct, and extractable way.
Use a validation loop to maintain global consistency. Across-label validity is a real risk with LLM-generated labels. The iterative validation loop (with the three checks) is a practical way to enforce consistency and quality across a whole set of clusters.
Expect some variability. Language models are probabilistic. Running multiple iterations or multiple seeds helps you understand the stability of the labels and quantify how much prompt design influences results (label-shift gives you a statistical handle on this).
Balance cost and quality. The study shows that newer, larger models can produce more consistent results, but cheaper models (like older ChatGPT variants) can still be effective, especially when you pair them with solid prompts and validation. If cost is a concern, you may be pleasantly surprised by a well-tuned cheaper model.
Plan for scale and future improvements. The authors acknowledge that their study looks at a subset of model types and fields. As you scale up, you’ll want to test across different datasets, clustering schemes, and hierarchies (if you have nested clusters).
A few behind-the-scenes insights that are worth knowing
The cluster-agnostic nature of descriptive labels is both a strength and a challenge. They’re meant to summarize, not to reproduce exact terms, which helps readability but makes direct benchmarking against characteristic labels harder. The evaluation framework in the study is designed to account for this.
Descriptive labels aren’t meant to replace human curation entirely. They can speed up labeling at scale and make clusters more understandable, but for high-stakes bibliometric analyses you may still want human-in-the-loop checks for the final labeling.
The authors’ framework is modular. If you’re building your own workflow, you can swap in different clustering methods, different feature extractors, or different LLMs without changing the core logic of the labeling and validation steps.
Limitations and scope
The study’s scale is substantial but not exhaustive. They tested a subset of models and prompts and looked at clusters within four fields. Real-world deployments with much larger scales and more diverse datasets may reveal additional challenges.
The quality of input clusters matters. If clusters are poorly formed (low topical cohesion or highly imbalanced sizes), the downstream labeling quality will suffer.
They focused on clusters from a single discipline at a time (though four fields were sampled) rather than cross-disciplinary clusters. Cross-disciplinary labeling could introduce new vocabulary and require adjustments to prompts and validation criteria.
While they showed promising results for descriptive labels, ongoing work is needed to refine evaluation methods, especially for more granular or hierarchical cluster structures.
Concluding thought
Descriptive labeling with language models is more than a fancy gimmick; it’s a practical way to bring clarity to large collections of scientific literature. By formalizing what “descriptive” means, building a repeatable workflow, and setting up robust evaluation, the study provides a solid foundation for researchers and information professionals who want readable, interpretable labels without sacrificing the precision needed to navigate complex topic spaces.
If you’re curious about incorporating these ideas into your own bibliometric project, start with a clear set of cluster features (concepts, venues, and representative documents), craft a concise, informative prompt, and put a validation loop in place to keep labels unique and meaningful. With careful design, descriptive labels can make science more accessible, helping researchers and curious readers alike find the right cluster at a glance.
Key Takeaways
Descriptive vs. characteristic labels: Descriptive labels aim for human-readable summaries that convey the cluster’s topic, while characteristic labels pull from the documents’ own jargon. Descriptive labels are often more legible and accessible, with comparable effectiveness in distinguishing clusters.
A practical labeling workflow: Cluster papers, extract three feature types (concepts, venues, and top papers), convert features into a prompt, generate a label with an LLM, and validate the label through a loop that checks length, duplication, and specificity.
Validation is essential: Across-label validity prevents duplicate or vague labels. The validation process uses iterative prompting and embedding-based checks to ensure labels stay accurate and distinct.
Model choice and prompts matter: System prompts can help steer the model, and including all three feature types (concepts, papers, journals) generally yields more robust labels. Differences among models can influence label quality and consistency.
Evaluation framework: Label-shift and first-pass metrics help compare different labeling designs, while human annotation confirms whether descriptive labels can uniquely identify clusters as well as traditional labels.
Real-world impact: Readable, interpretable cluster labels enhance bibliometric dashboards, aid non-experts in navigating literature, and support scalable labeling across large document corpora.