Classifying Trump's false claims while in office

Outline

This week's practical is based on the dataset compiled by the Washington Post on the false claims made by Trump while in office. Of the 30,573 claims, 30,404 have been categorised using 18 categories (Immigration (3,225 claims), Foreign policy, Election, ..., Terrorism (72 claims)) also each claim has been dated, tagged as to location (Campaign rally, Remarks, Interview, Twitter (of course), etc.), rationale for claim's falsehood, and is cross linked to related claims.

The typical use of this dataset is to build a classifier — using only columns claim, location, date and target category — that predicts the claim category. For evaluation the macro-F1 metric is used since this is an unbalanced, multi-class (18) problem, and that we can assume that cost of Type I vs Type II errors are the same.

However, instead, I want to use Latent Dirichlet allocation (LDA) topic model to try to identify the topics of the claims. So we are going to:

Preprocessing: See effect of different text preprocessing on classifier.
Topic Modelling: Use Latent Dirichlet allocation (LDA) topic model to try to identify the topics of the claim and see how close do they match the manually curated topic (column category). (spoiler badly, but perhaps you will do a better job that I)
Temporal View Generate a temporal view of topic prevalence.
Trending Topics Search for trending topics using Mann-Kendall Trend Test

Resources

You might find the following useful:

Determining a Quote’s Source Using Scikit-Learn
This is a similar (and easier) problem of classifying tweets as from Bernie Sanders or Donald Trump, but it has some ideas that you should consider, such as:
- Cleaning the text.
- Using n-grams
- Using metrics other than simple frequency counts.
Text clustering and classification with NLTK
Understanding Fortnite’s Reddit Community using Unsupervised Topic Modeling
pyLDAvis: Topic Modelling Exploration Tool That Every NLP Data Scientist Should Know
How to Perform a Mann-Kendall Trend Test in Python