## How to understand the differences in texts by categories

Nowadays, working in product analytics, we face a lot of free-form texts:

- Users leave comments in AppStore, Google Play or other services;
- Clients reach out to our Customer Support and describe their problems using natural language;
- We launch surveys ourselves to get even more feedback, and in most cases, there are some free-form questions to get a better understanding.

We have hundreds of thousands of texts. It would take years to read them all and get some insights. Luckily, there are a lot of DS tools that could help us automate this process. One such tool is Topic Modelling, which I would like to discuss today.

Basic Topic Modelling can give you an understanding of the main topics in your texts (for example, reviews) and their mixture. But it’s challenging to make decisions based on one point. For example, 14.2% of reviews are about too many ads in your app. Is it bad or good? Should we look into it? To tell the truth, I have no idea.

But if we try to segment customers, we may learn that this share is 34.8% for Android users and 3.2% for iOS. Then, it’s apparent that we need to investigate whether we show too many ads on Android or why Android users’ tolerance to ads is lower.

That’s why I would like to share not only how to build a topic model but also how to compare topics across categories. In the end we will get such insightful graphs for each topic.

The most common real-life cases of free-form texts are some kind of reviews. So, let’s use a dataset with hotel reviews for this example.

I’ve filtered comments related to several hotel chains in London.

Before starting text analysis, it’s worth getting an overview of our data. In total, we have 12 890 reviews on 7 different hotel chains.

Now we have data and can apply our new fancy tool Topic Modeling to get insights from it. As I mentioned in the beginning, we will use Topic Modelling and a powerful and easy-to-use `BERTopic`

package (documentation) for this text analysis.

You might wonder what Topic Modelling is. It is an unsupervised ML technique related to Natural Language Processing. It allows you to find hidden semantic patterns in texts (usually called documents) and assign “topics” to them. You don’t need to have a list of topics beforehand. The algorithm will define them automatically — usually in the form of a bag of the most important words (tokens) or N-grams.

`BERTopic`

is a package for Topic Modelling using HuggingFace transformers and class-based TF-IDF. `BERTopic`

is a highly flexible modular package so that you can tailor it to your needs.

If you want to understand how it works better, I advise you to watch this video from the author of the library.

You can find the full code on GitHub.

According to the documentation, we typically don’t need to preprocess data unless there is a lot of noise, for example, HTML tags or other markdowns that don’t add meaning to the documents. It’s a significant advantage of `BERTopic`

because, for many NLP methods, there is a lot of boilerplate to preprocess your data. If you are interested in how it could look like, see this guide for Topic Modelline using LDA.

You can use `BERTopic`

with data in multiple languages specifying `BERTopic(language= "multilingual")`

. However, from my experience, the model works a bit better with texts translated into one language. So, I will translate all comments into English.

For translation, we will use `deep-translator`

package (you can install it from PyPI).

Also, it could be interesting to see distribution by languages, for that we could use `langdetect`

package.

`import langdetect`

from deep_translator import GoogleTranslatordef get_language(text):

try:

return langdetect.detect(text)

except KeyboardInterrupt as e:

raise(e)

except:

return '<-- ERROR -->'

def get_translation(text):

try:

return GoogleTranslator(source='auto', target='en')

.translate(str(text))

except KeyboardInterrupt as e:

raise(e)

except:

return '<-- ERROR -->'

df['language'] = df.review.map(get_language)

df['reviews_transl'] = df.review.map(get_translation)

In our case, 95+% of comments are already in English.

To understand our data better, let’s look at the distribution of reviews’ length. It shows that there are a lot of extremely short (and most likely not meaningful comments) — around 5% of reviews are less than 20 symbols.

We can look at the most common examples to ensure that there’s not much information in such comments.

`df.reviews_transl.map(lambda x: x.lower().strip()).value_counts().head(10)`reviews

none 74

<-- error --> 37

great hotel 12

perfect 8

excellent value for money 7

good value for money 7

very good hotel 6

excellent hotel 6

great location 6

very nice hotel 5

So we can filter out all comments shorter than 20 symbols — 556 out of 12 890 reviews (4.3%). Then, we will analyse only long statements with more context. It’s an arbitrary threshold based on examples, you can try a couple of levels and see what texts are filtered out.

It’s worth checking whether this filter disproportionally affects some hotels. Shares of short comments are pretty close for different categories. So, the data looks OK.

Now, it’s time to build our first topic model. Let’s start simple with the most basic one to understand how library works, then we will improve it.

We can train a topic model in just a few code lines that could be easily understood by anyone who has used at least one ML package before.

`from bertopic import BERTopic`

docs = list(df.reviews.values)

topic_model = BERTopic()

topics, probs = topic_model.fit_transform(docs)

The default model returned 113 topics. We can look at top topics.

`topic_model.get_topic_info().head(7).set_index('Topic')[`

['Count', 'Name', 'Representation']]

The biggest group is `Topic -1`

, which corresponds to outliers. By default, `BERTopic`

uses `HDBSCAN`

for clustering, and it doesn’t force all data points to be part of clusters. In our case, 6 356 reviews are outliers (around 49.3% of all reviews). It is almost a half of our data, so we will work with this group later.

A topic representation is usually a set of most important words specific to this topic and not others. So, the best way to understand a topic is to look at the main terms (in `BERTopic`

, a class-based TF-IDF score is used to rank the words).

`topic_model.visualize_barchart(top_n_topics = 16, n_words = 10)`

`BERTopic`

even has Topics per Class representation that can solve our task of understanding the differences in course reviews.

`topics_per_class = topic_model.topics_per_class(docs, `

classes=filt_df.hotel)topic_model.visualize_topics_per_class(topics_per_class,

top_n_topics=10, normalize_frequency = True)

If you are wondering how to interpret this graph, you are not alone — I also wasn’t able to guess. However, the author kindly supports this package, and there are a lot of answers on GitHub. From the discussion, I learned that the current normalisation approach doesn’t show the share of different topics for classes. So, it hasn’t completely solved our initial task.

However, we did the first iteration in less than 10 rows of code. It’s fantastic, but there’s some room for improvement.

As we saw earlier, almost 50% of data points are considered outliers. It’s quite a lot, let’s see what we could do with it.

The documentation provides four different strategies to deal with the outliers:

- based on topic-document probabilities,
- based on topic distributions,
- based on c-TF-IFD representations,
- based on document and topic embeddings.

You can try different strategies and see which one fits your data the best.

Let’s look at examples of outliers. Even though these reviews are relatively short, they have multiple topics.

`BERTopic`

uses clustering to define topics. It means that not more than one topic is assigned to each document. In most real-life cases, you can have a mixture of topics in your texts. We may be unable to assign a topic to the documents because they have multiple ones.

Luckily, there’s a solution for it — use Topic Distributions. With such an approach, each document will be split into tokens. Then, we will form subsentences (defined by sliding window and stride) and assign a topic for each such subsentence.

Let’s try this approach and see whether we will be able to reduce the number of outliers without topics.

However, Topic Distributions are based on the fitted topic model, so let’s enhance it.

First of all, we can use CountVectorizer. It defines how a document will be split into tokens. Also, it can help us to get rid of meaningless words like `to`

, `not`

or `the`

(there are a lot of such words in our first model).

Also, we could improve topics’ representations and even try a couple of different models. I used the `KeyBERTInspired`

model (more details), but you could try other options (for example, LLMs).

`from sklearn.feature_extraction.text import CountVectorizer`

from bertopic.representation import KeyBERTInspired, PartOfSpeech, MaximalMarginalRelevancemain_representation_model = KeyBERTInspired()

aspect_representation_model1 = PartOfSpeech("en_core_web_sm")

aspect_representation_model2 = [KeyBERTInspired(top_n_words=30),

MaximalMarginalRelevance(diversity=.5)]

representation_model = {

"Main": main_representation_model,

"Aspect1": aspect_representation_model1,

"Aspect2": aspect_representation_model2

}

vectorizer_model = CountVectorizer(min_df=5, stop_words = 'english')

topic_model = BERTopic(nr_topics = 'auto',

vectorizer_model = vectorizer_model,

representation_model = representation_model)

topics, ini_probs = topic_model.fit_transform(docs)

I specified `nr_topics = 'auto'`

to reduce the number of topics. Then, all topics with a similarity over threshold will be merged automatically. With this feature, we got 99 topics.

I’ve created a function to get top topics and their shares so we could analyse it easier. Let’s look at the new set of topics.

`def get_topic_stats(topic_model, extra_cols = []):`

topics_info_df = topic_model.get_topic_info().sort_values('Count', ascending = False)

topics_info_df['Share'] = 100.*topics_info_df['Count']/topics_info_df['Count'].sum()

topics_info_df['CumulativeShare'] = 100.*topics_info_df['Count'].cumsum()/topics_info_df['Count'].sum()

return topics_info_df[['Topic', 'Count', 'Share', 'CumulativeShare',

'Name', 'Representation'] + extra_cols]get_topic_stats(topic_model, ['Aspect1', 'Aspect2']).head(10)

.set_index('Topic')

We can also look at the Interoptic distance map to better understand our clusters, for example, which are close to each other. You can also use it to define some parent topics and subtopics. It’s called Hierarchical Topic Modelling and you can use other tools for it.

`topic_model.visualize_topics()`

Another insightful way to better understand your topics is to look at `visualize_documents`

graph (documentation).

We can see that the number of topics has reduced significantly. Also, there are no meaningless stop words in topics’ representations.

However, we still see similar topics in the results. We can investigate and merge such topics manually.

For this, we can draw a Similarity matrix. I specified `n_clusters`

, and our topics were clustered to visualise them better.

`topic_model.visualize_heatmap(n_clusters = 20)`

There are some pretty close topics. Let’s calculate the pair distances and look at the top topics.

`from sklearn.metrics.pairwise import cosine_similarity`

distance_matrix = cosine_similarity(np.array(topic_model.topic_embeddings_))

dist_df = pd.DataFrame(distance_matrix, columns=topic_model.topic_labels_.values(),

index=topic_model.topic_labels_.values())tmp = []

for rec in dist_df.reset_index().to_dict('records'):

t1 = rec['index']

for t2 in rec:

if t2 == 'index':

continue

tmp.append(

{

'topic1': t1,

'topic2': t2,

'distance': rec[t2]

}

)

pair_dist_df = pd.DataFrame(tmp)

pair_dist_df = pair_dist_df[(pair_dist_df.topic1.map(

lambda x: not x.startswith('-1'))) &

(pair_dist_df.topic2.map(lambda x: not x.startswith('-1')))]

pair_dist_df = pair_dist_df[pair_dist_df.topic1 < pair_dist_df.topic2]

pair_dist_df.sort_values('distance', ascending = False).head(20)

I found guidance on how to get the distance matrix from GitHub discussions.

We can now see the top pairs of topics by cosine similarity. There are topics with close meanings that we could merge.

`topic_model.merge_topics(docs, [[26, 74], [43, 68, 62], [16, 50, 91]])`

df['merged_topic'] = topic_model.topics_

** Attention:** after merging, all topics’ IDs and representations will be recalculated, so it’s worth updating if you use them.

Now, we’ve improved our initial model and are ready to move on.

With real-life tasks, it’s worth spending more time on merging topics and trying different approaches to representation and clustering to get the best results.

The other potential idea is splitting reviews into separate sentences because comments are rather long.

Let’s calculate topics’ and tokens’ distributions. I’ve used a window equal to 4 (the author advised using 4–8 tokens) and stride equal 1.

`topic_distr, topic_token_distr = topic_model.approximate_distribution(`

docs, window = 4, calculate_tokens=True)

For example, this comment will be split into subsentences (or sets of four tokens), and the closest of existing topics will be assigned to each. Then, these topics will be aggregated to calculate probabilities for the whole sentence. You can find more details in the documentation.

Using this data, we can get the probabilities of different topics for each review.

`topic_model.visualize_distribution(topic_distr[doc_id], min_probability=0.05)`

We can even see the distribution of terms for each topic and understand why we got this result. For our sentence, `best very beautiful`

was the main term for `Topic 74`

, while `location close to`

defined a bunch of location-related topics.

`vis_df = topic_model.visualize_approximate_distribution(docs[doc_id], `

topic_token_distr[doc_id])

vis_df

This example also shows that we might have spent more time merging topics because there are still pretty similar ones.

Now, we have probabilities for each topic and review. The next task is to select a threshold to filter irrelevant topics with too low probability.

We can do it as usual using data. Let’s calculate the distribution of selected topics per review for different threshold levels.

`tmp_dfs = []`# iterating through different threshold levels

for thr in tqdm.tqdm(np.arange(0, 0.35, 0.001)):

# calculating number of topics with probability > threshold for each document

tmp_df = pd.DataFrame(list(map(lambda x: len(list(filter(lambda y: y >= thr, x))), topic_distr))).rename(

columns = {0: 'num_topics'}

)

tmp_df['num_docs'] = 1

tmp_df['num_topics_group'] = tmp_df['num_topics']

.map(lambda x: str(x) if x < 5 else '5+')

# aggregating stats

tmp_df_aggr = tmp_df.groupby('num_topics_group', as_index = False).num_docs.sum()

tmp_df_aggr['threshold'] = thr

tmp_dfs.append(tmp_df_aggr)

num_topics_stats_df = pd.concat(tmp_dfs).pivot(index = 'threshold',

values = 'num_docs',

columns = 'num_topics_group').fillna(0)

num_topics_stats_df = num_topics_stats_df.apply(lambda x: 100.*x/num_topics_stats_df.sum(axis = 1))

# visualisation

colormap = px.colors.sequential.YlGnBu

px.area(num_topics_stats_df,

title = 'Distribution of number of topics',

labels = {'num_topics_group': 'number of topics',

'value': 'share of reviews, %'},

color_discrete_map = {

'0': colormap[0],

'1': colormap[3],

'2': colormap[4],

'3': colormap[5],

'4': colormap[6],

'5+': colormap[7]

})

`threshold = 0.05`

looks like a good candidate because, with this level, the share of reviews without any topic is still low enough (less than 6%), while the percentage of comments with 4+ topics is also not so high.

This approach has helped us to reduce the number of outliers from 53.4% to 5.8%. So, assigning multiple topics could be an effective way to handle outliers.

Let’s calculate the topics for each doc with this threshold.

`threshold = 0.13`# define topic with probability > 0.13 for each document

df['multiple_topics'] = list(map(

lambda doc_topic_distr: list(map(

lambda y: y[0], filter(lambda x: x[1] >= threshold,

(enumerate(doc_topic_distr)))

)), topic_distr

))

# creating a dataset with docid, topic

tmp_data = []

for rec in df.to_dict('records'):

if len(rec['multiple_topics']) != 0:

mult_topics = rec['multiple_topics']

else:

mult_topics = [-1]

for topic in mult_topics:

tmp_data.append(

{

'topic': topic,

'id': rec['id'],

'course_id': rec['course_id'],

'reviews_transl': rec['reviews_transl']

}

)

mult_topics_df = pd.DataFrame(tmp_data)

Now, we have multiple topics mapped to each review and we can compare topics’ mixtures for different hotel chains.

Let’s find cases when a topic has too high or low share for a particular hotel. For that, we will calculate for each pair topic + hotel share of comments related to the topic for this hotel vs. all others.

`tmp_data = []`

for hotel in mult_topics_df.hotel.unique():

for topic in mult_topics_df.topic.unique():

tmp_data.append({

'hotel': hotel,

'topic_id': topic,

'total_hotel_reviews': mult_topics_df[mult_topics_df.hotel == hotel].id.nunique(),

'topic_hotel_reviews': mult_topics_df[(mult_topics_df.hotel == hotel)

& (mult_topics_df.topic == topic)].id.nunique(),

'other_hotels_reviews': mult_topics_df[mult_topics_df.hotel != hotel].id.nunique(),

'topic_other_hotels_reviews': mult_topics_df[(mult_topics_df.hotel != hotel)

& (mult_topics_df.topic == topic)].id.nunique()

})mult_topics_stats_df = pd.DataFrame(tmp_data)

mult_topics_stats_df['topic_hotel_share'] = 100*mult_topics_stats_df.topic_hotel_reviews/mult_topics_stats_df.total_hotel_reviews

mult_topics_stats_df['topic_other_hotels_share'] = 100*mult_topics_stats_df.topic_other_hotels_reviews/mult_topics_stats_df.other_hotels_reviews

However, not all differences are significant for us. We can say that the difference in topics’ distribution is worth looking at if there are

**statistical significance**— the difference is not just by chance,**practical significance**— the difference is bigger than X% points (I used 1%).

`from statsmodels.stats.proportion import proportions_ztest`mult_topics_stats_df['difference_pval'] = list(map(

lambda x1, x2, n1, n2: proportions_ztest(

count = [x1, x2],

nobs = [n1, n2],

alternative = 'two-sided'

)[1],

mult_topics_stats_df.topic_other_hotels_reviews,

mult_topics_stats_df.topic_hotel_reviews,

mult_topics_stats_df.other_hotels_reviews,

mult_topics_stats_df.total_hotel_reviews

))

mult_topics_stats_df['sign_difference'] = mult_topics_stats_df.difference_pval.map(

lambda x: 1 if x <= 0.05 else 0

)

def get_significance(d, sign):

sign_percent = 1

if sign == 0:

return 'no diff'

if (d >= -sign_percent) and (d <= sign_percent):

return 'no diff'

if d < -sign_percent:

return 'lower'

if d > sign_percent:

return 'higher'

mult_topics_stats_df['diff_significance_total'] = list(map(

get_significance,

mult_topics_stats_df.topic_hotel_share - mult_topics_stats_df.topic_other_hotels_share,

mult_topics_stats_df.sign_difference

))

We have all the stats for all topics and hotels, and the last step is to create a visualisation comparing topic shares by categories.

`import plotly`# define color depending on difference significance

def get_color_sign(rel):

if rel == 'no diff':

return plotly.colors.qualitative.Set2[7]

if rel == 'lower':

return plotly.colors.qualitative.Set2[1]

if rel == 'higher':

return plotly.colors.qualitative.Set2[0]

# return topic representation in a suitable for graph title format

def get_topic_representation_title(topic_model, topic):

data = topic_model.get_topic(topic)

data = list(map(lambda x: x[0], data))

return ', '.join(data[:5]) + ',

' + ', '.join(data[5:])

def get_graphs_for_topic(t):

topic_stats_df = mult_topics_stats_df[mult_topics_stats_df.topic_id == t]

.sort_values('total_hotel_reviews', ascending = False).set_index('hotel')

colors = list(map(

get_color_sign,

topic_stats_df.diff_significance_total

))

fig = px.bar(topic_stats_df.reset_index(), x = 'hotel', y = 'topic_hotel_share',

title = 'Topic: %s' % get_topic_representation_title(topic_model,

topic_stats_df.topic_id.min()),

text_auto = '.1f',

labels = {'topic_hotel_share': 'share of reviews, %'},

hover_data=['topic_id'])

fig.update_layout(showlegend = False)

fig.update_traces(marker_color=colors, marker_line_color=colors,

marker_line_width=1.5, opacity=0.9)

topic_total_share = 100.*((topic_stats_df.topic_hotel_reviews + topic_stats_df.topic_other_hotels_reviews)

/(topic_stats_df.total_hotel_reviews + topic_stats_df.other_hotels_reviews)).min()

print(topic_total_share)

fig.add_shape(type="line",

xref="paper",

x0=0, y0=topic_total_share,

x1=1, y1=topic_total_share,

line=dict(

color=colormap[8],

width=3, dash="dot"

)

)

fig.show()

Then, we can calculate the top topics list and make graphs for them.

`top_mult_topics_df = mult_topics_df.groupby('topic', as_index = False).id.nunique()`

top_mult_topics_df['share'] = 100.*top_mult_topics_df.id/top_mult_topics_df.id.sum()

top_mult_topics_df['topic_repr'] = top_mult_topics_df.topic.map(

lambda x: get_topic_representation(topic_model, x)

)for t in top_mult_topics_df.head(32).topic.values:

get_graphs_for_topic(t)

Here are a couple of examples of resulting charts. Let’s try to make some conclusions based on this data.

We can see that Holiday Inn, Travelodge and Park Inn have better prices and value for money compared to Hilton or Park Plaza.

The other insight is that in Travelodge noise may be a problem.

It’s a bit challenging for me to interpret this result. I’m not sure what this topic is about.

The best practice for such cases is to look at some examples.

*We stayed in the East tower where**the lifts are under renovation**, only one works, but there are signs showing the way to service lifts which can be used also.**However, the carpet and the furniture could have a**refurbishment**.**It’s built right over Queensway station. Beware that this tube stop will be closed for**refurbishing**for one year! So you might consider noise levels.*

So, this topic is about the cases of temporary issues during the hotel stay or furniture not in the best condition.

You can find the full code on GitHub.

Today, we’ve done an end-to-end Topic Modelling analysis:

- Build a basic topic model using the BERTopic library.
- Then, we’ve handled outliers, so only 5.8% of our reviews don’t have a topic assigned.
- Reduced the number of topics both automatically and manually to have a concise list.
- Learned how to assign multiple topics to each document because, in most cases, your text will have a mixture of topics.

Finally, we were able to compare reviews for different courses, create inspiring graphs and get some insights.

Thank you a lot for reading this article. I hope it was insightful to you. If you have any follow-up questions or comments, please leave them in the comments section.

*Ganesan, Kavita and Zhai, ChengXiang. (2011). OpinRank Review Dataset. UCI Machine Learning Repository. *

*https://doi.org/10.24432/C5QW4W*