Task 6: Topic Modelling (~40%)

Create a section in your notebook called Text Preprocessing.


Finally we get to topic modelling part. The dataframe df_submissions has two text columns title and text. I want to see if topic modelling based on data from title column matches topic modelling based on text column.

I'm making a big assumption here. Namely: for any two rows, the title should be in the same topic whenever the text are in the same topics.

If this assumption is true then:

Result of crosstab of topics based on title and on text.

If we rearrange (permute) the rows/cols we might get a better alignment between title_topic and text_topic. This is a simple optimisation problem, which is implemented in the following function

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
def align_topics(verbose=False):
    """Finds the best relabelling (permutation) to maxamise alignment of title_topic and text_topic.
    """

    from itertools import permutations

    data = pd.crosstab(df.title_topic, df.text_topic).values
    n = data.shape[0]

    max_score = np.finfo(float).min
    max_permutation = None
    for permutation in permutations(range(n)):
        tmp =  np.array([data[k] for k in permutation])
        score = tmp.trace()
        if score>max_score:
            max_score = score
            max_permutation = permutation

    if verbose:
        mapping = ", ".join([f"{k}->{v}"  for k,v in enumerate(max_permutation)])
        print(f"Max score {max_score} obtained with relabeling:\n\t {mapping}\nand resulting cross table of\n{data}")

    return max_score, max_permutation, np.array([data[k] for k in max_permutation])

score, permutation, data = align_topics(True)

Running this with verbose=True we get output

1
2
3
4
5
6
7
8
Max score 898 obtained with relabeling:
     0->0, 1->1, 2->4, 3->3, 4->2
and resulting cross table of
[[454 182 323   1   0]
 [178 229  41   4  38]
 [ 34  60  22   2   0]
 [  0  14   3  12   0]
 [  2   4 203   5  58]]

All of the code to train the models is given in the trump false claims practical. But the following might be of use:

1
2
for idx, topic in model.print_topics(-1):
    print("Topic: {}\nnWords: {}".format(idx, topic ))

Objective

Assuming num_topics=5 and using LDA model LdaMulticore (or LdaModel if your computer does not support LdaMulticore) optimise the topic alignment between title and text by: