Classifying Trump's false claims while in office

Outline

This week's practical is based on the dataset compiled by the Washington Post on the false claims made by Trump while in office. Of the 30,573 claims, 30,404 have been categorised using 18 categories (Immigration (3,225 claims), Foreign policy, Election, ..., Terrorism (72 claims)) also each claim has been dated, tagged as to location (Campaign rally, Remarks, Interview, Twitter (of course), etc.), rationale for claim's falsehood, and is cross linked to related claims.

The typical use of this dataset is to build a classifier — using only columns claim, location, date and target category — that predicts the claim category. For evaluation the macro-F1 metric is used since this is an unbalanced, multi-class (18) problem, and that we can assume that cost of Type I vs Type II errors are the same.

However, instead, I want to use Latent Dirichlet allocation (LDA) topic model to try to identify the topics of the claims. So we are going to:

Resources

You might find the following useful: