Topic Modelling

Now that we have cleaned, tokenised and generated ngrams for our documents (claims), we are ready to perform topic modelling

Final test processing — generate bag of words

First we import the required modules

from gensim.corpora import Dictionary
from gensim.models import LdaModel
from gensim.models import LdaMulticore
from gensim.models.coherencemodel import CoherenceModel

and extract the documents that we wish to use in our model. Here I am using all rows/documents since I only have 30K and the resulting training only takes 3 minutes. For a larger dataset I would take a sample.

1	`docs = df['ngrams'].values`

Before training a topic model we need to perform some additional processing:

Dictionary is a dict-like collection that maps integer ids (keys) to words (values).
Filter out tokens in the dictionary using their frequency (number of documents appears in).
Create a list of bag of words
Next we force the loading of the dictionary collection by accessing one of its keys.
Finally we create a collection to map id->word (effectively reverse lookup of dictionary)

dictionary = Dictionary(docs)

dictionary.filter_extremes(no_below=20, no_above=0.5)

corpus = [dictionary.doc2bow(doc) for doc in docs]

# Make a index to word dictionary.
dictionary[0]  # This is only to "load" the dictionary.
id2word = dictionary.id2token

The collections dictionary and corpus can be inspected using usual python code:

Size of dataset used in training model and sample BOW.

Train LDA model

Latent Dirichlet Allocation (LDA) is a way of extracting latent, or hidden, topics from a set of documents. LDA assumes that documents are composed of a set of topics sampled from a probability distribution, and topics are composed of words sampled from a probability distribution. It is important to note that LDA does not choose the number of topics for you, nor does it label the topics; labelling topics is up to you based on the most frequent words in the topic.

num_topics = 19
chunksize = 2000
passes = 20
iterations = 400
eval_every = 1 

# Make a index to word dictionary.
dictionary[0]  # This is only to "load" the dictionary.
id2word = dictionary.id2token

model = LdaModel(corpus=corpus, id2word=id2word, chunksize=chunksize,
    alpha='auto', eta='auto',
    iterations=iterations, num_topics=num_topics,
    passes=passes, eval_every=eval_every)

Interpreting the model

Using model.show_topics() we can see which tokens are used to determine the topics (note that the topics are identified by integers).

Scoring the model

A LDA model performance is typically measured based on its Coherence Score.

coherence_model_lda = CoherenceModel(model=model, texts=docs, dictionary=dictionary,coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('Coherence Score: ', coherence_lda)

The above model has a score of

Coherence Score:  0.34826254816755037

but what does it mean? For this dataset we have topic already specified in column category. Lets compare our generated topics against these. First we need to set the topic for each document (claim) in our dataset. To do this I use the following function which select to mst likely topic for a given document

def get_topic(doc):
    bow = dictionary.doc2bow(doc)               # convert tokens to bow
    topics = model.get_document_topics(bow)     # get topic,prob based on bow
    return np.array(topics).argmax(axis=0)[1]   # pick topic with highest probability

df['topic'] = df['ngrams'].apply(get_topic)

Now a crosstab of given topic label category and predicted topic number topic should give us a nice clean table with only one non-zeros value in each row and in each column (i.e. a permutation of a diagonal matrix) ...

NOTE: When generating the steps for the practical I see now that I left in the 169 claims with missing category. Don't have time to fix this. So ignore this row. Sorry.

Well ... this is disappointing ... it seems that the topic modeling was throwing everything into topic 8. Can we do better? In the words of Obama, "Yes We Can".

This is easily improved by:

My cleaning/tokenizing was basic and have left punctuation stuck to various tokens. This should be cleaned.
Trump rambles (a bit like my practicals) and if left to talk long enough will flip between topics. We could help (and speed up training) but limiting the number of

Choosing the Number of Topics — Don't do

Since LDA does not determine the number of topics we could follow our standard procedure and perform a parameter sweep and select the number of topics that maxamised coherence score. We won't do this here since it will take too long (30 min) and we know that the number of topics should be 18.

Visualising Topics

There is an excellent library for interactively visualising output from a LDA.

Import using

import pyLDAvis
import pyLDAvis.gensim_models as gensimvis
pyLDAvis.enable_notebook()

import warnings
warnings.simplefilter(action='ignore', category=(FutureWarning,DeprecationWarning) )

and create/open visualisation using

lda_viz = gensimvis.prepare(model, corpus, dictionary)
lda_viz

For time reasons I won't cover this, but see resources at the start of this practical.