Task 6: Topic Modelling (~40%)

Create a section in your notebook called Text Preprocessing.

Finally we get to topic modelling part. The dataframe df_submissions has two text columns title and text. I want to see if topic modelling based on data from title column matches topic modelling based on text column.

I'm making a big assumption here. Namely: for any two rows, the title should be in the same topic whenever the text are in the same topics.

If this assumption is true then:

Using column title as initial documents
- Generate tokens and store in column title_tokens.
- Generate ngrams, based on tokens in title_tokens and store in column title_ngrams.
- Generate LDA model, title_model, using title_ngrams using num_topics=5.
- Generate topics from the model title_model and store in column title_topics.
Using column text as initial documents
- Generate tokens and store in column text_tokens.
- Generate ngrams, based on tokens in text_tokens and store in column text_ngrams.
- Generate LDA model, text_model, using text_ngrams using num_topics=5.
- Generate topics from the model text_model and store in column text_topics.
Then a crosstab of df.title_topics and df.text_topics should have a single large count in each row and in each column.

Result of crosstab of topics based on title and on text.

If we rearrange (permute) the rows/cols we might get a better alignment between title_topic and text_topic. This is a simple optimisation problem, which is implemented in the following function

def align_topics(verbose=False):
    """Finds the best relabelling (permutation) to maxamise alignment of title_topic and text_topic.
    """

    from itertools import permutations

    data = pd.crosstab(df.title_topic, df.text_topic).values
    n = data.shape[0]

    max_score = np.finfo(float).min
    max_permutation = None
    for permutation in permutations(range(n)):
        tmp =  np.array([data[k] for k in permutation])
        score = tmp.trace()
        if score>max_score:
            max_score = score
            max_permutation = permutation

    if verbose:
        mapping = ", ".join([f"{k}->{v}"  for k,v in enumerate(max_permutation)])
        print(f"Max score {max_score} obtained with relabeling:\n\t {mapping}\nand resulting cross table of\n{data}")

    return max_score, max_permutation, np.array([data[k] for k in max_permutation])

score, permutation, data = align_topics(True)

Running this with verbose=True we get output

Max score 898 obtained with relabeling:
     0->0, 1->1, 2->4, 3->3, 4->2
and resulting cross table of
[[454 182 323   1   0]
 [178 229  41   4  38]
 [ 34  60  22   2   0]
 [  0  14   3  12   0]
 [  2   4 203   5  58]]

All of the code to train the models is given in the trump false claims practical. But the following might be of use:

To speed up model fitting replace LdaModel by LdaMulticore but then you need to remove the alpha keyword argument (since that is not supported by LdaMulticore).
To see the list of words used in each topic run

for idx, topic in model.print_topics(-1):
    print("Topic: {}\nnWords: {}".format(idx, topic ))

I would use the above lists of words to grow extra_stopwords, in the hope of improving the topic alignment.

Objective

Assuming num_topics=5 and using LDA model LdaMulticore (or LdaModel if your computer does not support LdaMulticore) optimise the topic alignment between title and text by:

Modifying the text preprocessing steps - tokenising, etc
Modifying default_stopwords and extending extra_stopwords.
Modifying the parameters used to generate the ngrams