Word Count vs Tokens: The Hidden Gap Between AI and Language Services
Language Learning Models (LLMs) have disrupted the world of language services. If we have such powerful AI tools by our side, can’t they do everything through prompting?
The answer to this question is not straightforward. Human translators work with words and word counts. LLMs work with tokens. In this word count vs tokens division, translation providers and their clients might face misunderstandings.
In the first article of this series, we saw how language is evolving through word drift. This time, we explain the tokens vs word count difference. Knowing it can make your translation operations more efficient and immune to mistakes.
Two Different Ways of Perceiving Language
For centuries (an audacious translator might even say millennia), we’ve been operating through word counts. We charged our translation work through word and character counts, as well. Our workflows revolved primarily around words and word counts. The translation industry was founded on a word as a unit of meaning. A human translator can recognise, translate, review, and approve words.
LLMs base their translation on tokens. But one token isn’t necessarily one word. One word can contain more than one token.
Also, tokenization without a deep understanding of what it really is can lead to incorrect or even dangerous translation outputs. Take, for instance, the aforementioned word drift – a situation in which a lexeme or expression changes its meaning due to mistranslation. Before LLMs came to power, this lexical drift typically happened in a cultural context, over a course of time.
We have the forthcoming World Cup in the USA, Mexico, and Canada. A better part of the world will use the world ‘striker’ many times in the time ahead. What originally denotes someone who strikes or hits anything has gradually acquired a much narrower reference to an attacker in a football lineup.
False friends are also a good example of shifted meaning – embarrassed in English, and embarazada in Spanish (meaning pregnant). So, if an LLM translates football as ‘soccer’ (the US term for football), the rest of the world says ‘football’, it shouldn’t lead to any fatal outcomes.
But if language loops and literal translations occur in the field of medical device translation, the consequences can be way more serious. The aforementioned difference between embarrassed and embarazada could lead to certain issues in a medical translation.
That’s why understanding what word tokens are and how LLMs use them is important for every translation service provider.
Word Count vs Tokens: Meaning vs Structure
Where we see meaning and its potential layers, LLMs see structure that needs to be followed to break text into meaningful units, only to put them together in a new structure in another language. This new structure is called the translation.
But how do tokens really work, and how many tokens per word are commonly calculated?
Tokens are based on language morphology. One word can equal one token, but it doesn’t have to be the case. For instance, the verb ‘believe’ is one token. Now things are getting complicated: ‘believable’ is still one word, but two tokens. An LLM would divide this adjective into ‘believ-’ and ‘-able’. And if we say ‘unbelievable’, an AI-powered translation tool would typically break it down into three tokens: ‘un’, ‘believ’, and ’able’. So, morphology plays a vital role in tokenization.
And this is only one example in the English language. Typically, other letters, such as Chinese or the Cyrillic alphabet, take more tokens per word than English.
So, in human translation services, we read between the lines, scan the text for the context, and adapt the terminology to certain cultural or regulatory environments. For instance, life sciences translation services require that the language service provider work in compliance with various domestic and international standards and legal regulations.
AI tools can do that, but they lack the proper understanding of essential details that make a translation legally and regulatory compliant.
However, the gap between human translators’ expertise and AI’s technical translation is getting narrower.
Tokens vs Word Count in Practical Matters
When using AI for translation, you paste your text into an AI tool and get the output in a matter of seconds. It’s straightforward and efficient.
Still, when you try to plug AI into real-world translation workflows, certain mismatches and misunderstandings start to happen.
Example 1: Pricing and Expectations
Until the last few years, a client would typically ask: “How many words?”
The word count defined the budget, the deadlines, and the expectations.
Today, clients still might ask about the word count, but tech tools that may be used in translation operations work with tokens. This means that the basic unit of the technology and the basic unit of industry don’t fully match.
Example 2: Segmentation
We often hear the sentence: “The system must comply with all applicable regulations.”
It refers to a single segment and one unit of responsibility in a TMS:
- One translation
- One review
- One approved version.
There’s a certified translator or a team of translators responsible for delivering such a translation.
LLMs don’t process such a sentence in the same way. For an AI translation solution, this is still only a sequence of tokens. That’s why it’s more difficult to track, reuse, and control a shared AI translation.
Example 3: Consistency
In traditional translation workflows, the same segment typically leads to the same translation. Such consistency is especially important in heavily regulated industries, such as the aforementioned life sciences or technology and manufacturing. In other words, wherever safety is the utmost feature, translation consistency and accuracy mustn’t be compromised.
The thing is that in AI workflows, the same input may lead to similar, but not identical output. In creative, more flexible contexts, such as marketing or game localization services, it can be uncomfortable, but not fatal. In highly regulated or documentation-heavy contexts, such glitches can lead to serious legal issues, let alone terrible outcomes in the field.
Example 4: Traceability
Translators and translation managers must make sure that in every translation, all the stakeholders know what has been changed, where the changes have been implemented, and why they have been applied in the first place.
Every modern TMS, based on words, makes this traceability possible. In token-based systems, such transparency isn’t provided naturally, which leads to discrepancies.
Where Structure Becomes Critical
Translation management systems are still here because they ensure one critical factor: structure. LLMs and these systems aren’t mutually exclusive, even though they operate on different engines.
Translation must be consistent; terminology must be constantly controlled; and changes must be traceable. Retaining a certain operational and translation structure is essential. Again, highly regulated fields of work, as highlighted above, require a deeper awareness of the cost of error. This awareness translates (pun intended) into a properly edited, proofread, and adapted output. Localisation also plays an important part in those specific niches.
All these features pose a demanding challenge: how to make AI translation solutions complement TMSs rather than function as two separate units. Bridging the gap between these fields will reduce the word count vs tokens difference.
Also, international frameworks such as ISO 18587 (post-editing of machine translation) and ISO 13485 (medical device quality management) bridge the gap between flexible, token-based AI output and structured, word-based workflows. Using these international standards ensures that LSPs establish and keep control, consistency, and traceability over their procedures and output.
Such international standards and frameworks prevent small inconsistencies from becoming larger problems at scale.
A Different Way to Look at the Future
After more than two decades in the language service industry, we can discuss to what extent AI is changing the translation industry. Tokens vs word counts, LLMs juxtaposed with human translation, and stricter international regulations are all challenges that translation companies face every day.
To be truly honest, LLMs can significantly speed up the translation process. They are machines. They understand language statistically and treat words in a binary way. That’s why a token is a unit of measure in machine translation.
A computer sees a word from the outside. A human translator understands the interior of the word. Together, they make a powerful translation transformer that can do more per unit of time than translators in the past. What we need are proper procedures and operations for maintaining quality.
As one of the leading LSPs for regulated industries in Europe, Ciklopea stands at the front of the tectonic changes the entire industry has been experiencing in recent years.