Create a section in your notebook called Text Preprocessing.
We are going to performs some topic modelling, but before we can get to that we need to clean our text. As discussed in the topic modelling of Trumps false claims practical I like to create a function that performs most of the text cleaning, and I typically have two functions (Note name change of functions to make their use clearer):
generate_tokens(text, **kwargs)
Takes in raw text (and optional keyword arguments) and converts the raw text to a bag (list) of cleaned tokens. Using steps cleaning, tokenising, striping stop-words, standardising (lemmisation/stemming).
(Note in the practical I called this function clean_text.)
generate_ngrams(tokens, **kwargs)
Generate ngrams (usually bigrams or trigrams) from a bag (list) of tokens.
generate_tokens(text, **kwargs)In the practical I gave you the template (Note name change)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | |
and during the lab we implemented the following
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 | |
If we define a small set of sample text, such as
1 2 3 4 5 6 7 8 9 10 | |
we can text this function, as follows
1 2 3 | |
to get output
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | |
The above implementation is ok but it still kinda limited - you CAN use it for the rest of this assignment but you might get better results if you implement the following version:
generate_tokens(text, **kwargs)First we load and define alternative tokenizers, default stopwords, etc. — feel free to add more
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | |
Now the shell of our improved generate_tokens looks like
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 | |
The following examples show my implementation of this function for various different keyword arguments.
1 2 3 | |
has output
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | |
1 2 3 | |
has output
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | |
1 2 3 | |
has output
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | |
TODO
OK, now we have a decent token generator function, but they are still improvements, such as:
generate_ngramsThis function is nearly exactly (only name changed) the same as given in Trump false claim practial.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | |
which generates
1 2 3 4 5 6 7 8 | |
Now, if we (you)
Select a set of documents - here we are going to take rows from df_submissions dataframe, and column title or text.
Pick a suitable tokenizer, a suitable set of default stopwords, a suitable set of extra stopwords (you could pick these based on the output of the LDA model), pick a lemmatizer or stemmer.
then the
generate_tokens can be used to generate the tokens from the raw text.generate_ngrams can be used to generate the ngrams from the tokens.