Last updated 9 May 2026

What Are LLM Tokens? The Complete 2026 Guide

Tokens are the units large language models use to process text, images, audio, video and tool payloads — and the way they shape context limits, pricing and latency has changed materially since 2023. Here is the up-to-date 2026 mental model, with a side-by-side look at GPT-5.5, Claude Opus 4.7, Gemini 3.1 Pro and DeepSeek-V4.

TL;DR — LLM tokens in 2026

  • A token is the model’s context-and-billing unit, usually a subword fragment for text but increasingly applied to images, audio, video, files and tool schemas.
  • The classic rule still works for plain English: ~4 characters or ~0.75 words per token. It is a heuristic, not a guarantee.
  • Frontier context windows are now ~1M tokens across GPT-5.5, Claude 4.7 / Sonnet 4.6, Gemini 3.1 Pro and DeepSeek-V4.
  • Pricing is now multi-bucket: input, cached input, output, and reasoning / thinking tokens are billed separately on most APIs.
  • Tokenizer-version matters. Anthropic says Claude Opus 4.7 may use up to 35% more tokens on the same text than older models.
  • Multimodal counts vary by provider: Gemini bills video at 263 tokens/sec and audio at 32 tokens/sec; Claude images run up to ~4,784 tokens on Opus 4.7.
  • For accuracy, use provider-native counters. Local estimators miss tools, images, files and reasoning overhead.
This is the 2026 refresh of our original explainer. The earlier beginner-friendly version is still available: Demystifying Tokens (2023 edition).

What an LLM token actually is in 2026

A token is best understood as a model-readable unit drawn from a vocabulary of recurring fragments. In practice a token might be a whole short word, part of a longer word, punctuation with surrounding whitespace, or a byte-derived fragment. OpenAI’s docs and the official tiktoken reference both emphasise that tokens are commonly occurring character sequences, not words in the human sense — a leading space, for example, is often attached to the following word, which is exactly why raw word counts and token counts diverge as soon as you measure them.

The familiar beginner rule still holds for plain English: OpenAI documents roughly 4 characters or 0.75 words per token, and Google documents roughly 4 characters and 60–80 English words per 100 tokens for Gemini. But both providers treat that as an approximation. DeepSeek goes further by quantifying language variance: its API docs estimate ~0.3 tokens per English character and ~0.6 tokens per Chinese character. Token counts depend on the tokenizer, the language, and the exact byte / character patterns in the input.

The most important conceptual upgrade for a 2026 audience is this: a token is now better described as the model’s fundamental accounting unit for context and billing, not merely its basic text “building block”. That broader definition matches how modern APIs expose usage metadata and pricing across text, non-text modalities, reasoning, and caches.

2026 working definition: a token is the model’s context-and-billing unit for representing and processing inputs and outputs — usually via subword pieces for text, but increasingly across multimodal content and structured agent workflows as well.

How tokenization works under the hood

Modern tokenization is, in practice, a subword problem. Whole-word vocabularies do a poor job with rare words, misspellings, morphology, names and multilingual text. Sennrich, Haddow and Birch’s 2015 BPE paper framed subwords as an open-vocabulary fix for translation: rare or unknown words can be encoded as sequences of smaller units instead of collapsing to <unk>.

Three families dominate production tokenizers:

  • BPE (Byte Pair Encoding) — merges frequent symbol pairs into larger units. This is what powers OpenAI’s tokenizer and many open models.
  • WordPiece — uses a longest-match-first strategy over a learned vocabulary. Famous from BERT.
  • SentencePiece — trains subword models directly from raw sentences without assuming whitespace-delimited words. Critical for multilingual systems because it does not hard-code English-style preprocessing.

OpenAI’s tiktoken library is a useful production reference because its docs make the engineering tradeoffs explicit: BPE is reversible and lossless, works on arbitrary text, and compresses input so the token sequence is shorter than the original byte sequence — averaging about 4 bytes per token in practice. That is closer to how deployed LLM systems actually behave than “one token equals one word”.

OpenAI tokenizer breaking text into colour-coded subword pieces
Tokenization in action — visualised with OpenAI’s tokenizer.

How models use tokens after tokenization

After tokenization the model does not “see words”. It sees integer token IDs. Those IDs are mapped into vectors, positional information is added so token order matters, and transformer layers contextualise each token relative to the others through attention. This is the bridge between a string in your prompt and the internal numerical state the model actually manipulates. The original transformer paper made positional encoding necessary precisely because attention on its own does not encode order.

For decoder-style chat models, generation is autoregressive: the model predicts one token, appends it to the sequence, then predicts the next, repeating until it hits a stop condition or output limit. That iterative next-token loop is why token budgets matter twice — once for the input context, and again for the output the model is allowed to generate.

Long context is also why so much frontier engineering is now about tokens. Million-token windows, prompt caching, context compression and sparse-attention claims are all responses to the cost of storing, attending over and serving very long token sequences. “How many tokens?” is still one of the most practical questions in LLM engineering, not just a beginner curiosity.

A second example of subword tokenization with token IDs
Each token gets an integer ID before any model layer touches it.

What changed in current frontier models

OpenAI — GPT-5.5

GPT-5.5 is listed with a 1,050,000-token context window and 128,000 max output tokens. Standard API pricing is $5 / M input tokens, $0.50 / M cached input tokens, and $30 / M output tokens. OpenAI now provides an exact input-token counting endpoint for the same payload format used in the Responses API, and the docs explicitly recommend it over character-based estimates whenever images, files, tools or model-specific behaviour are involved. OpenAI’s prompt caching is automatic for eligible long prompts, kicks in from 1,024 tokens upward, and the docs say it can reduce latency by up to 80% and input-token cost by up to 90%. Reasoning models use internal reasoning tokens, discard them from later turns, and bill them as output tokens; the GPT-5.5 migration guide says the model reaches strong results with fewer reasoning tokens than prior generations.

Anthropic — Claude Opus 4.7 & Claude Sonnet 4.6

Claude Opus 4.7 and Claude Sonnet 4.6 both offer 1M-token context windows; Opus 4.7 has a 128k max output, Sonnet 4.6 has 64k. Anthropic prices Opus 4.7 at $5 / M input and $25 / M output, and Sonnet 4.6 at $3 / M input and $15 / M output. Anthropic’s most important token-specific 2026 change is unusually explicit: the company says Opus 4.7 uses a new tokenizer that may consume up to 35% more tokens for the same fixed text. That is a concrete example of why a token-focused blog in 2026 should treat token counts as model-version-specific rather than timeless constants.

Anthropic’s docs are also unusually revealing about token overhead outside plain text:

  • The token-counting API supports tools, images and PDFs.
  • Counts may include system-added optimisation tokens that are not billed.
  • Tool use carries documented system-prompt overhead.
  • Image-token heuristic: roughly width × height / 750.
  • Most Claude models cap native image resolution around 1,568 tokens per image; Opus 4.7 is the first Claude model with high-resolution image support up to roughly 4,784 tokens per image (~3× prior models).
  • Thinking is token-budgeted: current thinking tokens count toward output usage, while previous thinking blocks are stripped from future context.

Google — Gemini 3.1 Pro Preview & Gemini 3 Flash Preview

Gemini 3.1 Pro Preview and Gemini 3 Flash Preview both expose 1,048,576 input tokens and 65,536 output tokens. Gemini 3.1 Pro Preview is priced at $2 / M input and $12 / M output for prompts under 200k tokens, rising to $4 / M input and $18 / M output above that threshold. Gemini 3 Flash Preview is listed at $0.50 / M input and $3 / M output.

Two Google design choices are worth flagging:

  1. Gemini 3 models use dynamic thinking by default, and output pricing explicitly includes thinking tokens.
  2. Gemini usage metadata now surfaces multiple token categories: input, output, thought, cached content, and tool-use tokens. Token usage on Gemini is a structured accounting object, not a single integer.

Google is also one of the clearest providers on multimodal tokenization. Per its docs, images at or below 384 px on both dimensions count as 258 tokens, larger images are tiled into 768×768 patches at 258 tokens each, video is counted at 263 tokens per second, and audio at 32 tokens per second. Caching now includes implicit and explicit modes, with explicit caches defaulting to a one-hour TTL. Note that Google retired Gemini 3 Pro Preview on 9 March 2026 in favour of Gemini 3.1 Pro Preview, so use the newer naming when you write integrations.

DeepSeek — DeepSeek-V4-Pro & DeepSeek-V4-Flash

DeepSeek-V4-Pro and DeepSeek-V4-Flash both list a 1M-token context window and up to 384k output tokens — notably larger on the output side than the mainstream OpenAI, Anthropic or Google figures. DeepSeek states that thinking mode is supported on both and is the default. Context caching is enabled by default, uses overlapping-prefix cache hits, and the pricing page separates cache-hit and cache-miss input rates aggressively. Legacy names deepseek-chat and deepseek-reasoner are being phased out and currently map to the non-thinking and thinking modes of V4 Flash until 24 July 2026.

2026 token-economics comparison table

Quick side-by-side of the four major frontier API families. Pricing is per million tokens unless noted, snapshot taken 9 May 2026 — always re-check the provider docs before going live.

Frontier LLM token economics — May 2026 snapshot
Model Context window Max output Input price Cached input Output price Reasoning / thinking
GPT-5.5 (OpenAI) 1,050,000 128,000 $5 / M $0.50 / M $30 / M Reasoning tokens billed as output; auto-discarded from later turns
Claude Opus 4.7 (Anthropic) 1,000,000 128,000 $5 / M (per Anthropic caching docs) $25 / M Thinking tokens count toward output; previous blocks stripped
Claude Sonnet 4.6 (Anthropic) 1,000,000 64,000 $3 / M (per Anthropic caching docs) $15 / M Thinking supported; lighter compute footprint
Gemini 3.1 Pro Preview (Google) 1,048,576 65,536 $2 / M (<200k) · $4 / M (≥200k) Implicit + explicit (1h TTL default) $12 / M (<200k) · $18 / M (≥200k) Dynamic thinking by default; thought tokens included in output price
Gemini 3 Flash Preview (Google) 1,048,576 65,536 $0.50 / M Implicit + explicit (1h TTL default) $3 / M Dynamic thinking by default
DeepSeek-V4-Pro 1,000,000 384,000 Cache-hit / cache-miss split Default-on overlapping-prefix cache Cache-hit / cache-miss split Thinking is the default mode
DeepSeek-V4-Flash 1,000,000 384,000 Cache-hit / cache-miss split Default-on overlapping-prefix cache Cache-hit / cache-miss split Maps to legacy deepseek-chat / deepseek-reasoner until 24 July 2026

Numbers compiled from each provider’s official documentation as of 9 May 2026. See the Sources section for direct links.

Why token counts now vary more than they used to

The biggest reason token counts feel slippery in 2026 is that providers now change token behaviour at the model-version level. Anthropic says so directly for Opus 4.7 (up to 35% more tokens on the same text). OpenAI’s token-counting docs warn that model-specific behaviour can affect counts, including reasoning and caching. Google’s Gemini 3.1 Pro page even markets “improved token efficiency” — a reminder that providers are now optimising not only model quality but also how much internal or external token budget they need to finish a task.

A second reason is payload structure. OpenAI’s token-counting docs say local tokenizers are inadequate for images, files, tools and schemas. Anthropic’s docs show a simple “Hello, Claude” example at 14 input tokens, but a request with a modest weather tool schema at 403 tokens. Anthropic’s pricing docs also document tool-use system-prompt overhead, and OpenAI’s realtime docs note that “special tokens aside from the content of a message” can make a 10-token message count as 12. Developers who only count user-visible characters can still miss costs by a large margin in agentic or tool-heavy systems.

A third reason is multilingual and multimodal variance. SentencePiece was specifically designed to support language-independent subword training from raw sentences. DeepSeek quantifies how English and Chinese can produce different token-to-character ratios. Google extends the idea beyond text with token-per-image-tile, token-per-second-audio and token-per-second-video rules. Once you combine language with modalities and tool schemas, “1 token ≈ 4 characters” becomes what it always should have been: a rough English-prose heuristic, not a general law of LLM usage.

Strategies for managing tokens in 2026

The classic 2023 advice still applies, but it now needs a few 2026 upgrades:

  • Cache aggressively. Move stable instructions, retrieval boilerplate and tool schemas to the very start of your prompt so prompt caching (OpenAI, Anthropic, Google, DeepSeek) can hit them.
  • Treat reasoning tokens as a budget. On GPT-5.5, Claude 4.7 and Gemini 3, you pay for thinking. Expose a knob for “thinking effort” rather than letting it default for every call.
  • Prefer one big in-context document over many roundtrips on million-token models — it is often cheaper than orchestrating small calls.
  • Strip prior thinking blocks from history if your provider does not auto-strip them.
  • Tile or downscale images deliberately. Knowing Gemini bills 258 tokens per 768×768 patch and Opus 4.7 can spend ~4,784 tokens on a single high-res image changes how you preprocess inputs.
  • Always measure with the provider’s native counter before you ship pricing models or quotas.
  • Pin tokenizer versions in tests. A vendor swap or a new tokenizer rev can move your bill 20–35% overnight.

How to count tokens accurately

For a 2026 article the most defensible recommendation is to skip generic estimators and link people to provider-native counting tools:

Estimate locally if you want; measure with provider-native counters if accuracy matters. That is the production-era message that most distinguishes 2026 from 2023.

Frequently asked questions

An LLM token is the basic unit a language model uses to read and generate text. It is usually a subword fragment, a short word, or punctuation drawn from the model’s vocabulary. In English prose, ~1 token is roughly 4 characters or 0.75 words, but the exact count depends on the tokenizer, the language and the content type.

For typical English prose, 1,000 words is roughly 1,300 tokens because OpenAI documents about 0.75 words per token. The figure can be higher for code, non-English languages, structured data or anything heavy on punctuation and whitespace.

Each provider counts multimodal inputs differently. Google’s Gemini docs count images at 384 px or below as 258 tokens and tile larger images into 768×768 patches at 258 tokens each, video at 263 tokens / second and audio at 32 tokens / second. Anthropic uses a width × height / 750 heuristic for images, capping most Claude models around 1,568 tokens per image, with Opus 4.7 going up to roughly 4,784 tokens per image.

Reasoning tokens (also called thinking tokens) are model-internal tokens that frontier models like GPT-5.5, Claude 4.7 and Gemini 3 generate while they “think” before producing a final answer. They are billed as output tokens and they consume your output budget, but providers typically discard them from later turns so they do not eat into future context.

Yes. OpenAI’s prompt caching activates automatically for prompts of 1,024 tokens or longer and can cut input-token cost by up to 90% and latency by up to 80%. Google offers implicit and explicit caching on Gemini 2.5+ with a one-hour default TTL on explicit caches. DeepSeek enables overlapping-prefix context caching by default with separate cache-hit and cache-miss input rates.

In 2026 most frontier APIs offer roughly 1 million input tokens. GPT-5.5 supports 1,050,000 input and 128,000 output tokens. Claude Opus 4.7 and Sonnet 4.6 both support 1M input with 128k and 64k output respectively. Gemini 3.1 Pro Preview and Gemini 3 Flash Preview support 1,048,576 input and 65,536 output. DeepSeek-V4-Pro and V4-Flash support 1M input and up to 384,000 output tokens.

Token counts depend on the tokenizer, the language, content type and any tools, files or system overhead the API adds. Anthropic explicitly says Claude Opus 4.7 uses a new tokenizer that may consume up to 35% more tokens for the same fixed text. Tool schemas, system prompts and special tokens can also add invisible token cost on top of your visible message.

Use the provider’s own counter rather than a generic estimator. OpenAI exposes an exact input-token endpoint matching the Responses API. Anthropic has a token-counting API that supports tools, images and PDFs. Google’s Gemini API surfaces input, output, thought, cached-content and tool-use tokens in usage metadata. DeepSeek ships a downloadable tokenizer package.

Sources

Verification note: this article cites provider documentation as of 9 May 2026. The space changes fast — Google retired Gemini 3 Pro Preview on 9 March 2026 and DeepSeek’s V4 Pro discount runs only until 31 May 2026 UTC — so always re-check the provider docs for current model names, windows and prices before going to production.

Get the next 2026 LLM update in your inbox

One short email when frontier model pricing, context limits or tokenizer behaviour changes. No spam, unsubscribe in one click.