This week's practical is based on the dataset compiled by the Washington Post on the false claims made by Trump while in office. Of the 30,573 claims, 30,404 have been categorised using 18 categories (Immigration (3,225 claims), Foreign policy, Election, ..., Terrorism (72 claims)) also each claim has been dated, tagged as to location (Campaign rally, Remarks, Interview, Twitter (of course), etc.), rationale for claim's falsehood, and is cross linked to related claims.
The typical use of this dataset is to build a classifier — using only columns claim, location, date and target category — that predicts the claim category. For evaluation the macro-F1 metric is used since this is an unbalanced, multi-class (18) problem, and that we can assume that cost of Type I vs Type II errors are the same.
However, instead, I want to use Latent Dirichlet allocation (LDA) topic model to try to identify the topics of the claims. So we are going to:
category). (spoiler badly, but perhaps you will do a better job that I)You might find the following useful:
Determining a Quote’s Source Using Scikit-Learn
This is a similar (and easier) problem of classifying tweets as from Bernie Sanders or Donald Trump, but it has some ideas that you should consider, such as:
n-gramsUnderstanding Fortnite’s Reddit Community using Unsupervised Topic Modeling
pyLDAvis: Topic Modelling Exploration Tool That Every NLP Data Scientist Should Know