Task 4: Text Preprocessing (~15%)

Create a section in your notebook called Text Preprocessing.

We are going to performs some topic modelling, but before we can get to that we need to clean our text. As discussed in the topic modelling of Trumps false claims practical I like to create a function that performs most of the text cleaning, and I typically have two functions (Note name change of functions to make their use clearer):

generate_tokens(text, **kwargs)
Takes in raw text (and optional keyword arguments) and converts the raw text to a bag (list) of cleaned tokens. Using steps cleaning, tokenising, striping stop-words, standardising (lemmisation/stemming). (Note in the practical I called this function clean_text.)
generate_ngrams(tokens, **kwargs)
Generate ngrams (usually bigrams or trigrams) from a bag (list) of tokens.

Function `generate_tokens(text, **kwargs)`

In the practical I gave you the template (Note name change)

def generate_tokens(text, lower=True, min_token_len=0):
    """
    Given a string, separate into tokens and clean. Return list of tokens.
    """

    # lower case
    text = text.lower() if lower else text 

    # TODO 
    # tokenize text
    # drop stop-words and punctuation (see string.punctuation)
    # lemmatize each token
    # drop tokens less than min_token_len

    return clean_text

and during the lab we implemented the following

import string
import nltk 

# stopwords
default_stopwords = nltk.corpus.stopwords.words('english')

# lemmatizer
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()


def generate_tokens(text, lower=True, min_token_len=0):
    """
    Given a string, separate into tokens and clean. Return list of tokens.
    """

    # lower case
    text = text.lower() if lower else text 

    # tokenize text
    tokens = nltk.word_tokenize(text)

    # drop stop-words and punctuation (see string.punctuation)
    tokens = [t for t in tokens if t not in default_stopwords and t not in string.punctuation ]

    # lemmatize each token
    tokens = [lemmatizar.lemmatize(t) for t in tokens]

    # drop tokens less than min_token_len
    if min_token_length>0:
        tokens = [t for t in tokens if len(t)>=min_token_length]

    return tokens

If we define a small set of sample text, such as

sample_text = [
    "The movie was crap ",
    "The book was kind of good.",
    "A really bad, horrible book. 🤨", 
    "Today sux", 
    "Today kinda sux! But I'll get by, lol 😃",
    "sentiment analysis is shit. ",
    "sentiment analysis is the shit.",
    "I like to hate Michael Bay films, but I couldn't fault this one",
]

we can text this function, as follows

for text in sample_text:
    tokens = generate_tokens(text, min_token_length=4)
    print(f"{text}\n\t{tokens}")

to get output

The movie was crap 
    ['movie', 'crap']
The book was kind of good.
    ['book', 'kind', 'good']
A really bad, horrible book. 🤨
    ['really', 'horrible', 'book']
Today sux
    ['today']
Today kinda sux! But I'll get by, lol 😃
    ['today', 'kinda']
sentiment analysis is shit. 
    ['sentiment', 'analysis', 'shit']
sentiment analysis is the shit.
    ['sentiment', 'analysis', 'shit']
I like to hate Michael Bay films, but I couldn't fault this one
    ['like', 'hate', 'michael', 'film', 'could', 'fault']

The above implementation is ok but it still kinda limited - you CAN use it for the rest of this assignment but you might get better results if you implement the following version:

Optional — Improved function `generate_tokens(text, **kwargs)`

First we load and define alternative tokenizers, default stopwords, etc. — feel free to add more

import emoji, string, nltk

# define three possible tokenizers
tokenizer_1 = str.split
tokenizer_2 = nltk.word_tokenize
from nltk.tokenize import RegexpTokenizer
tokenizer_3 = RegexpTokenizer("\w+|\$[\d\.]+|http\S+").tokenize

# define two possible default_stopwords
default_stopwords_1 = nltk.corpus.stopwords.words('english')
import en_core_web_sm
nlp = en_core_web_sm.load()
default_stopwords_2 = nlp.Defaults.stop_words

lemmatizer = WordNetLemmatizer()

from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()

Now the shell of our improved generate_tokens looks like

def generate_tokens(text, lower=True, 
    tokenizer=None,
    default_stopwords=None, extra_stopwords=None, 
    lemmatizer=None, stemmer=None,
    min_token_length=0):

    """
    Given a string, separate into tokens and clean.
    """

    # lower case
    text = text.lower() if lower else text 

    # remove emoji 
    text = emoji.replace_emoji(text, replace='')

    # tokenize text (always needed)
    assert tokenizer is not None, "Need to specify a tokenizer to split text into tokens"
    tokens = tokenizer(text)

    # TODO drop default_stopwords and extra_stopwords (if parameters are provided) and punctuation

    # TODO lemmatize each token (only if lemmatizer parameter is provided)

    # TODO stem each token (only if stemmer parameter is provided)

    if min_token_length>0:
        tokens = [t for t in tokens if len(t)>=min_token_length]

    return tokens

The following examples show my implementation of this function for various different keyword arguments.

Example: uses split and no filtering on stopwords and no stemming/lemmatization

for text in sample_text:
    tokens = generate_tokens(text, tokenizer=tokenizer_1)
    print(f"{text}\n\t{tokens}")

has output

The movie was crap 
    ['the', 'movie', 'was', 'crap']
The book was kind of good.
    ['the', 'book', 'was', 'kind', 'of', 'good.']
A really bad, horrible book. 🤨
    ['a', 'really', 'bad,', 'horrible', 'book.']
Today sux
    ['today', 'sux']
Today kinda sux! But I'll get by, lol 😃
    ['today', 'kinda', 'sux!', 'but', "i'll", 'get', 'by,', 'lol']
sentiment analysis is shit. 
    ['sentiment', 'analysis', 'is', 'shit.']
sentiment analysis is the shit.
    ['sentiment', 'analysis', 'is', 'the', 'shit.']
I like to hate Michael Bay films, but I couldn't fault this one
    ['i', 'like', 'to', 'hate', 'michael', 'bay', 'films,', 'but', 'i', "couldn't", 'fault', 'this', 'one']

Example: uses default nltk tokeniser and filters stopwords and no stemming/lemmatization

for text in sample_text:
    tokens = generate_tokens(text, tokenizer=tokenizer_1, default_stopwords=default_stopwords_1)
    print(f"{text}\n\t{tokens}")

has output

The movie was crap 
    ['movie', 'crap']
The book was kind of good.
    ['book', 'kind', 'good.']
A really bad, horrible book. 🤨
    ['really', 'bad,', 'horrible', 'book.']
Today sux
    ['today', 'sux']
Today kinda sux! But I'll get by, lol 😃
    ['today', 'kinda', 'sux!', "i'll", 'get', 'by,', 'lol']
sentiment analysis is shit. 
    ['sentiment', 'analysis', 'shit.']
sentiment analysis is the shit.
    ['sentiment', 'analysis', 'shit.']
I like to hate Michael Bay films, but I couldn't fault this one
    ['like', 'hate', 'michael', 'bay', 'films,', 'fault', 'one']

Example: uses nltk regex-tokeniser and filters stopwords and lemmatization

for text in sample_text:
    tokens = generate_tokens(text, tokenizer=tokenizer_3, default_stopwords=default_stopwords_1, lemmatizer=lemmatizer)
    print(f"{text}\n\t{tokens}")

has output

The movie was crap 
    ['movie', 'crap']
The book was kind of good.
    ['book', 'kind', 'good']
A really bad, horrible book. 🤨
    ['really', 'bad', 'horrible', 'book']
Today sux
    ['today', 'sux']
Today kinda sux! But I'll get by, lol 😃
    ['today', 'kinda', 'sux', 'get', 'lol']
sentiment analysis is shit. 
    ['sentiment', 'analysis', 'shit']
sentiment analysis is the shit.
    ['sentiment', 'analysis', 'shit']
I like to hate Michael Bay films, but I couldn't fault this one
    ['like', 'hate', 'michael', 'bay', 'film', 'fault', 'one']

TODO

OK, now we have a decent token generator function, but they are still improvements, such as:

Instead of always deleting emojis we could convert back to text.

Function `generate_ngrams`

This function is nearly exactly (only name changed) the same as given in Trump false claim practial.

import gensim
from gensim.models import Phrases

# create tokens of sample_text for testing
docs = [
    generate_tokens(doc, tokenizer=tokenizer_3, lemmatizer=lemmatizer, default_stopwords=default_stopwords_1, min_token_length=3)
    for doc in sample_text]

# generate bigrams and trigrams using tokens in docs
bigram = Phrases(docs, min_count=10)
trigram = Phrases(bigram[docs])

def generate_ngrams(tokens): 
    return trigram[bigram[tokens]]

ngrams = [generate_ngrams(doc) for doc in docs]

ngrams

which generates

[['movie', 'crap'],
 ['book', 'kind', 'good'],
 ['really', 'bad', 'horrible', 'book'],
 ['today', 'sux'],
 ['today', 'kinda', 'sux', 'get', 'lol'],
 ['sentiment', 'analysis', 'shit'],
 ['sentiment', 'analysis', 'shit'],
 ['like', 'hate', 'michael', 'bay', 'film', 'fault', 'one']]

Now, if we (you)

Select a set of documents - here we are going to take rows from df_submissions dataframe, and column title or text.
Pick a suitable tokenizer, a suitable set of default stopwords, a suitable set of extra stopwords (you could pick these based on the output of the LDA model), pick a lemmatizer or stemmer.

then the

function generate_tokens can be used to generate the tokens from the raw text.
and function generate_ngrams can be used to generate the ngrams from the tokens.

Task 4: Text Preprocessing (~15%)

Function generate_tokens(text, **kwargs)

Optional — Improved function generate_tokens(text, **kwargs)

Example: uses split and no filtering on stopwords and no stemming/lemmatization

Example: uses default nltk tokeniser and filters stopwords and no stemming/lemmatization

Example: uses nltk regex-tokeniser and filters stopwords and lemmatization

Function generate_ngrams

Function `generate_tokens(text, **kwargs)`

Optional — Improved function `generate_tokens(text, **kwargs)`

Function `generate_ngrams`