Demystifying Tokens and Embeddings in Large Language Models
- Andrew Butson
- 14 Jan, 2024
Understanding the inner workings of large language models (LLMs) can feel like unraveling a complex tapestry of algorithms and data. Two foundational threads in this tapestry are tokens and embeddings. They are not just buzzwords; they are the bedrock upon which modern NLP applications are built. In this article, we’ll delve into what tokens and embeddings are, how they function within LLMs, and explore some intriguing examples that highlight their significance.
Table of Contents
Open Table of Contents
What Are Tokens?
At the most basic level, a token is a piece of text that the model considers as a single unit. While humans read text as a sequence of characters forming words and sentences, LLMs process text by breaking it down into these smaller units.
Tokenization: Breaking Down Text
Tokenization is the process of converting raw text into tokens. This is a crucial step because LLMs do not understand text in its raw form. They require numerical representations to perform computations.
Consider the sentence:
“Life is like a box of chocolates.”
A tokenizer might break this down into the following tokens:
- “Life”
- “is”
- “like”
- “a”
- “box”
- “of”
- “choco”
- “lates”
- ”.”
Notice that “chocolates” might be split into “choco” and “lates” depending on the tokenizer’s strategy. Tokenization affects the model’s understanding and generation of text, making it a vital component of the LLM pipeline.
Different Tokenization Methods
There are several tokenization methods, each with its advantages and trade-offs:
- Word Tokenization: Splits text into words based on spaces. Simple but can’t handle out-of-vocabulary words effectively.
- Subword Tokenization: Breaks words into meaningful subword units, handling rare or new words better.
- Character Tokenization: Treats every character as a token, useful for languages without clear word boundaries but can result in very long sequences.
- Byte Pair Encoding (BPE): A popular subword tokenization method that iteratively merges the most frequent pairs of bytes.
Embeddings: Giving Meaning to Tokens
Once text is tokenized, each token needs a numerical representation. This is where embeddings come into play.
Word Embeddings and Beyond
Embeddings are dense vector representations of tokens that capture semantic meaning. Traditional word embeddings like Word2Vec and GloVe assign a fixed vector to each word, capturing relationships like:
vector("king") - vector("man") + vector("woman") ≈ vector("queen")
However, these embeddings are static; the representation of a word doesn’t change with context.
Contextualized Embeddings with Language Models
Modern LLMs generate contextualized embeddings, where the representation of a word varies depending on its context. For example, the word “bank” would have different embeddings in:
- “She sat on the river bank.”
- “He went to the bank to deposit money.”
This context-awareness is crucial for tasks like disambiguation and understanding nuanced language.
Exploring Tokenization with Examples
To truly grasp tokenization, let’s dive into a hands-on example.
Tokenizing a Multilingual Joke
Consider the multilingual text:
“Why did the chicken cross the road? 为了到达另一边! 🐔🚶♂️”
This sentence combines English, Chinese, and emojis. Let’s see how different tokenizers handle it.
from transformers import AutoTokenizer
# Load different tokenizers
tokenizers = {
"BERT": AutoTokenizer.from_pretrained("bert-base-uncased"),
"GPT-2": AutoTokenizer.from_pretrained("gpt2"),
"XLNet": AutoTokenizer.from_pretrained("xlnet-base-cased")
}
# The multilingual text
text = "Why did the chicken cross the road? 为了到达另一边! 🐔🚶♂️"
# Tokenize and display tokens
for name, tokenizer in tokenizers.items():
tokens = tokenizer.tokenize(text)
print(f"{name} Tokens:")
print(tokens)
print("---")
Output:
-
BERT Tokens:
['why', 'did', 'the', 'chicken', 'cross', 'the', 'road', '?', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]']
-
GPT-2 Tokens:
['Why', ' did', ' the', ' chicken', ' cross', ' the', ' road', '?', ' 为', '了', '到', '达', '另', '一', '边', '!', ' 🐔', '🚶', '\u200d', '♂', '️']
-
XLNet Tokens:
['▁Why', '▁did', '▁the', '▁chicken', '▁cross', '▁the', '▁road', '?', '▁为了到达', '另一', '边', '!', '▁', '🐔', '🚶♂️']
Analysis:
- BERT treats unknown characters (Chinese characters and emojis) as
[UNK]
. - GPT-2 handles Chinese characters and emojis by breaking them into sub-units.
- XLNet uses SentencePiece and can handle multilingual text more gracefully.
This demonstrates how the choice of tokenizer impacts the model’s ability to process and understand different languages and symbols.
Comparing Tokenizers from Popular Models
Understanding how different models tokenize the same text can shed light on their capabilities.
BERT vs. GPT vs. Custom Models
Let’s consider the sentence:
“The COVID-19 pandemic impacted global economies in 2020.”
BERT Tokenization:
bert_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
bert_tokens = bert_tokenizer.tokenize("The COVID-19 pandemic impacted global economies in 2020.")
print(bert_tokens)
Output:
['the', 'cov', '###id', '-', '19', 'pandemic', 'impacted', 'global', 'economies', 'in', '2020', '.']
- BERT breaks “COVID-19” into “cov”, “###id”, ”-”, “19”.
GPT-2 Tokenization:
gpt2_tokenizer = AutoTokenizer.from_pretrained("gpt2")
gpt2_tokens = gpt2_tokenizer.tokenize("The COVID-19 pandemic impacted global economies in 2020.")
print(gpt2_tokens)
Output:
['The', 'COVID', '-', '19', 'pandemic', 'impacted', 'global', 'economies', 'in', '2020', '.']
- GPT-2 keeps “COVID” together with a space indicator “ but separates “-19”.
Custom Medical Model Tokenization:
Imagine a custom tokenizer trained on medical texts.
- It might keep “COVID-19” as a single token because it’s a frequent term in medical literature.
This comparison highlights how tokenizers adapted to specific domains can capture domain-specific terminology more effectively.
Applications of Tokens and Embeddings
Tokens and embeddings are not just theoretical concepts; they have practical applications that touch our daily lives.
Building a Music Recommendation System
Let’s explore how embeddings can power a music recommendation engine.
Step 1: Treat Songs as Tokens
Imagine a dataset where each playlist is a ‘sentence’ and each song is a ‘word’. By applying Word2Vec, we can learn embeddings for songs based on the playlists they appear in.
Step 2: Train the Model
from gensim.models import Word2Vec
# Sample playlists (each playlist is a list of song IDs)
playlists = [
['song_1', 'song_2', 'song_3'],
['song_2', 'song_4', 'song_5'],
# ... more playlists
]
# Train Word2Vec model
model = Word2Vec(sentences=playlists, vector_size=50, window=5, min_count=1, workers=4)
Step 3: Make Recommendations
# Get similar songs to 'song_1'
similar_songs = model.wv.most_similar('song_1')
print(similar_songs)
Output:
[('song_3', 0.95), ('song_5', 0.90), ('song_2', 0.88)]
Analysis:
- Songs that often appear together in playlists have similar embeddings.
- This method captures the collective taste of listeners.
Semantic Search with Text Embeddings
Embeddings can transform search by capturing the semantic meaning of queries and documents.
Step 1: Embed Documents
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
documents = [
"How to bake a chocolate cake",
"Best practices for Python programming",
"Understanding quantum physics",
# ... more documents
]
doc_embeddings = model.encode(documents)
Step 2: Embed the Query and Compute Similarity
query = "Python coding tips"
query_embedding = model.encode(query)
# Compute cosine similarity
import numpy as np
cosine_scores = np.dot(doc_embeddings, query_embedding) / (
np.linalg.norm(doc_embeddings, axis=1) * np.linalg.norm(query_embedding)
)
# Find the most similar document
most_similar_doc = documents[np.argmax(cosine_scores)]
print(f"Most similar document: {most_similar_doc}")
Output:
Most similar document: Best practices for Python programming
Analysis:
- The embeddings capture semantic similarity beyond keyword matching.
- This enables more intuitive and accurate search results.
Tokens and embeddings are fundamental to the functionality of large language models. Tokens break text into digestible pieces, while embeddings give those pieces meaningful numerical representations. Together, they enable models to understand and generate human-like language.
From processing multilingual texts to building recommendation systems and enhancing search technologies, the applications of tokens and embeddings are vast and impactful. As LLMs continue to evolve, so will the techniques for tokenization and embedding, pushing the boundaries of what’s possible in natural language processing.