Unveiling the Power of Text Clustering and Topic Modeling with BERTopic

In today’s data-driven world, organizations grapple with massive amounts of unstructured text data—from customer reviews and social media posts to research papers and news articles. While supervised learning techniques like classification have dominated the industry in recent years, unsupervised methods like text clustering and topic modeling offer untapped potential for uncovering hidden patterns and insights in textual data.

In this article, we’ll explore how modern language models have revolutionized text clustering and how BERTopic, a state-of-the-art topic modeling framework, leverages these advancements to provide meaningful clusters and topics. We’ll delve into a common pipeline for text clustering, enrich it with examples, and demonstrate how combining different language models can lead to powerful new techniques.

The Importance of Text Clustering and Topic Modeling

Text clustering aims to group similar documents based on their semantic content, meaning, and relationships. Imagine sorting through thousands of customer reviews for an e-commerce platform. Manually categorizing them would be a Herculean task. Text clustering automates this process by grouping reviews that express similar sentiments or discuss similar products, enabling quick exploratory data analysis and efficient categorization.

Topic modeling takes this a step further by discovering abstract “topics” within a collection of documents. These topics are typically represented by keywords or key phrases that encapsulate the essence of the cluster. For instance, a topic might be characterized by words like “battery life,” “screen resolution,” and “camera quality,” indicating that the cluster is about smartphone features.

Harnessing Language Models for Clustering

Recent advancements in language models, especially Transformer-based models like BERT and GPT, have empowered us to capture the contextual and semantic nuances of text better than ever before. Language is not just a bag of words; it’s a complex structure where context matters. By converting text into embeddings—numerical representations that capture meaning—we can leverage these models for more effective clustering.

A Common Pipeline for Text Clustering

A popular approach to text clustering involves a three-step pipeline:

Embedding Documents: Convert each document into a numerical embedding using a language model.
Dimensionality Reduction: Reduce the high-dimensional embeddings to a lower-dimensional space to simplify clustering.
Clustering: Apply a clustering algorithm to group the reduced embeddings into meaningful clusters.

Let’s walk through each step with an example.

1. Embedding Documents

Suppose we have a dataset of movie reviews. Our goal is to cluster these reviews to understand common themes, such as reviews about acting, plot, or cinematography.

First, we need to convert each review into an embedding. We’ll use a Transformer-based embedding model optimized for semantic similarity, such as Sentence Transformers. Here’s how we might proceed:

from sentence_transformers import SentenceTransformer

# Load the embedding model
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')

# Sample reviews
reviews = [
    "The cinematography was breathtaking and the visuals stunning.",
    "An amateurish plot with mediocre acting.",
    "A masterpiece of storytelling with brilliant performances.",
    # ... more reviews
]

# Generate embeddings
embeddings = embedding_model.encode(reviews, show_progress_bar=True)

Each review is now represented as a high-dimensional vector capturing its semantic meaning.

2. Reducing Dimensionality

High-dimensional data can be challenging for clustering algorithms due to the curse of dimensionality. We can use techniques like Uniform Manifold Approximation and Projection (UMAP) to reduce the embeddings to a lower-dimensional space while preserving their meaningful structure.

from umap import UMAP

# Reduce embeddings to 5 dimensions
umap_model = UMAP(n_components=5, min_dist=0.0, metric='cosine', random_state=42)
reduced_embeddings = umap_model.fit_transform(embeddings)

3. Clustering the Reduced Embeddings

With reduced embeddings, we can apply a clustering algorithm. Density-based algorithms like HDBSCAN are well-suited here because they can discover clusters of varying shapes and sizes and can identify outliers.

from hdbscan import HDBSCAN

# Apply HDBSCAN clustering
hdbscan_model = HDBSCAN(min_cluster_size=2, metric='euclidean', cluster_selection_method='eom')
clusters = hdbscan_model.fit_predict(reduced_embeddings)

Inspecting the Clusters

Now, let’s examine the clusters:

import numpy as np

# Print reviews in the first cluster
cluster = 0
for index in np.where(clusters == cluster)[0]:
    print(reviews[index])

Suppose cluster 0 contains reviews praising visual aspects:

"The cinematography was breathtaking and the visuals stunning."
"Visually a masterpiece, every frame is art."

Cluster 1 might include critiques of the plot:

"An amateurish plot with mediocre acting."
"The story was predictable and lacked depth."

By inspecting the clusters, we gain insights into common themes in the reviews without prior labeling.

From Clustering to Topic Modeling

While text clustering groups similar documents, topic modeling aims to extract meaningful topics represented by keywords or key phrases from these clusters. This is particularly useful when dealing with large datasets where manual inspection is impractical.

Introducing BERTopic

BERTopic is a flexible topic modeling framework that leverages BERT embeddings and clustering algorithms to produce interpretable topics. Its modularity allows for customization at each step, enabling the integration of different models and techniques.

How BERTopic Works

Document Embedding: Similar to our earlier pipeline, BERTopic starts by embedding documents using a language model.
Dimensionality Reduction: It reduces the embeddings’ dimensionality, often using UMAP.
Clustering: BERTopic clusters the documents using an algorithm like HDBSCAN.
Topic Representation: It generates topic representations using a class-based variant of TF-IDF called c-TF-IDF, which considers the importance of words in the context of the entire cluster.

Applying BERTopic to Real-World Data

Let’s consider a more exciting example: analyzing social media posts about a global sporting event, such as the Olympics. Our dataset contains thousands of tweets expressing opinions, excitement, or critiques.

Setting Up the Data

# Sample tweets
tweets = [
    "Can't believe the world record was broken again!",
    "The opening ceremony was a dazzling spectacle.",
    "Disappointed with the judging in gymnastics.",
    # ... more tweets
]

Running BERTopic

from bertopic import BERTopic

# Initialize BERTopic
topic_model = BERTopic()

# Fit the model on the tweets
topics, _ = topic_model.fit_transform(tweets)

Exploring the Topics

We can retrieve the topics and see the associated keywords:

topic_model.get_topic_info()

Suppose one of the topics is characterized by keywords like:

('ceremony', 0.1), ('opening', 0.08), ('spectacle', 0.06), ('dazzling', 0.04)

This topic clearly relates to the opening ceremony.

Visualizing the Topics

BERTopic offers various visualization tools to understand the relationships between topics and the distribution of documents within them.

# Visualize topics
topic_model.visualize_topics()

Enhancing Topic Representations

While the initial topic representations provide valuable insights, we can leverage additional techniques to refine them.

Using KeyBERT for Keyword Extraction

KeyBERT is a minimalist keyword extraction technique that uses BERT embeddings. We can integrate it into BERTopic to improve topic representations.

from bertopic.representation import KeyBERTInspired

# Update topics with KeyBERT
representation_model = KeyBERTInspired()
topic_model.update_topics(tweets, representation_model=representation_model)

This approach re-ranks keywords based on their semantic similarity to the topic, providing more coherent and relevant keywords.

Incorporating Generative Models

Generative language models like GPT-3 can be used to produce more descriptive topic labels. By providing the top keywords and sample documents from a topic, we can prompt the model to generate a concise label.

import openai
from bertopic.representation import OpenAI

# Set your OpenAI API key
openai.api_key = 'YOUR_API_KEY'

# Define the prompt template
prompt = """
I have a topic with the following keywords: [KEYWORDS]

And these example tweets: [DOCUMENTS]

Please provide a short label for this topic.
"""

# Update topics using GPT-3
representation_model = OpenAI(prompt=prompt)
topic_model.update_topics(tweets, representation_model=representation_model)

Now, instead of a set of keywords, each topic has a clear, human-readable label generated by the language model.

The Power of Modularity

One of BERTopic’s strengths is its modularity. Each component—the embedding model, dimensionality reduction, clustering algorithm, and representation model—can be swapped out or configured to suit specific needs.

For instance:

Embedding Models: Replace the default embedding model with domain-specific models for biomedical text, legal documents, or code.
Clustering Algorithms: Use k-means clustering if you prefer fixed cluster sizes or when the number of topics is known in advance.
Representation Models: Stack representation models to further refine topics, such as combining KeyBERT with MMR (Maximal Marginal Relevance) to reduce redundancy.

Text clustering and topic modeling open doors to uncovering hidden structures and themes within vast textual datasets. With modern language models and frameworks like BERTopic, we can move beyond superficial analysis and derive deep, actionable insights.

Whether you’re analyzing customer feedback, organizing research papers, or monitoring social media trends, these techniques provide powerful tools to make sense of unstructured text. By harnessing the modularity of frameworks like BERTopic, you can tailor the process to your specific domain and continuously integrate advancements in language AI.

As we continue to innovate and combine different models and techniques, the potential for discovering new insights from text data grows exponentially. It’s an exciting time for anyone working with natural language processing, and the tools have never been more accessible.