Task 5: Sample Dataset (~5%)

Create a section in your notebook called Sample Dataset.

The dataset is too large to analyise fully so we will perform the topic modelling analysis for a single subreddit in the df_submissions dataframe. So pick a subreddit from one of the top 20 subreddits ordered in terms of number of submissions, I picked TruthLeaks and suggest you use this — but feel free to try another subreddit.

Once you have picked your subreddit of choice then create a new dataframe called df, from the df_submissions dataframe by selecting all rows where:

subreddit matches the one you selected
drop rows where text is NA
drop rows where title==text
(We are going to compare topic modelling based on title against on text, rows where title==text will bias this.)
drop rows where title="[removed]" or text="[removed]"
(These rows do not have a meaningful entry for text.)
drop duplicate rows based on subset ["title", "text"]