Create a section in your notebook called Sample Dataset.
The dataset is too large to analyise fully so we will perform the topic modelling analysis for a single subreddit in the df_submissions dataframe. So pick a subreddit from one of the top 20 subreddits ordered in terms of number of submissions, I picked TruthLeaks and suggest you use this — but feel free to try another subreddit.
Once you have picked your subreddit of choice then create a new dataframe called df, from the df_submissions dataframe by selecting all rows where:
text is NAtitle==texttitle against on text, rows where title==text will bias this.)drop rows where title="[removed]" or text="[removed]"
(These rows do not have a meaningful entry for text.)
drop duplicate rows based on subset ["title", "text"]