Task 4: Text Preprocessing (~15%)

Create a section in your notebook called Text Preprocessing.


We are going to performs some topic modelling, but before we can get to that we need to clean our text. As discussed in the topic modelling of Trumps false claims practical I like to create a function that performs most of the text cleaning, and I typically have two functions (Note name change of functions to make their use clearer):

Function generate_tokens(text, **kwargs)

In the practical I gave you the template (Note name change)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
def generate_tokens(text, lower=True, min_token_len=0):
    """
    Given a string, separate into tokens and clean. Return list of tokens.
    """

    # lower case
    text = text.lower() if lower else text 

    # TODO 
    # tokenize text
    # drop stop-words and punctuation (see string.punctuation)
    # lemmatize each token
    # drop tokens less than min_token_len

    return clean_text

and during the lab we implemented the following

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
import string
import nltk 

# stopwords
default_stopwords = nltk.corpus.stopwords.words('english')

# lemmatizer
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()


def generate_tokens(text, lower=True, min_token_len=0):
    """
    Given a string, separate into tokens and clean. Return list of tokens.
    """

    # lower case
    text = text.lower() if lower else text 

    # tokenize text
    tokens = nltk.word_tokenize(text)

    # drop stop-words and punctuation (see string.punctuation)
    tokens = [t for t in tokens if t not in default_stopwords and t not in string.punctuation ]

    # lemmatize each token
    tokens = [lemmatizar.lemmatize(t) for t in tokens]

    # drop tokens less than min_token_len
    if min_token_length>0:
        tokens = [t for t in tokens if len(t)>=min_token_length]

    return tokens

If we define a small set of sample text, such as

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
sample_text = [
    "The movie was crap ",
    "The book was kind of good.",
    "A really bad, horrible book. ๐Ÿคจ", 
    "Today sux", 
    "Today kinda sux! But I'll get by, lol ๐Ÿ˜ƒ",
    "sentiment analysis is shit. ",
    "sentiment analysis is the shit.",
    "I like to hate Michael Bay films, but I couldn't fault this one",
]

we can text this function, as follows

1
2
3
for text in sample_text:
    tokens = generate_tokens(text, min_token_length=4)
    print(f"{text}\n\t{tokens}")

to get output

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
The movie was crap 
    ['movie', 'crap']
The book was kind of good.
    ['book', 'kind', 'good']
A really bad, horrible book. ๐Ÿคจ
    ['really', 'horrible', 'book']
Today sux
    ['today']
Today kinda sux! But I'll get by, lol ๐Ÿ˜ƒ
    ['today', 'kinda']
sentiment analysis is shit. 
    ['sentiment', 'analysis', 'shit']
sentiment analysis is the shit.
    ['sentiment', 'analysis', 'shit']
I like to hate Michael Bay films, but I couldn't fault this one
    ['like', 'hate', 'michael', 'film', 'could', 'fault']

The above implementation is ok but it still kinda limited - you CAN use it for the rest of this assignment but you might get better results if you implement the following version:

Optional — Improved function generate_tokens(text, **kwargs)

First we load and define alternative tokenizers, default stopwords, etc. — feel free to add more

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
import emoji, string, nltk

# define three possible tokenizers
tokenizer_1 = str.split
tokenizer_2 = nltk.word_tokenize
from nltk.tokenize import RegexpTokenizer
tokenizer_3 = RegexpTokenizer("\w+|\$[\d\.]+|http\S+").tokenize

# define two possible default_stopwords
default_stopwords_1 = nltk.corpus.stopwords.words('english')
import en_core_web_sm
nlp = en_core_web_sm.load()
default_stopwords_2 = nlp.Defaults.stop_words

lemmatizer = WordNetLemmatizer()

from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()

Now the shell of our improved generate_tokens looks like

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
def generate_tokens(text, lower=True, 
    tokenizer=None,
    default_stopwords=None, extra_stopwords=None, 
    lemmatizer=None, stemmer=None,
    min_token_length=0):

    """
    Given a string, separate into tokens and clean.
    """

    # lower case
    text = text.lower() if lower else text 

    # remove emoji 
    text = emoji.replace_emoji(text, replace='')

    # tokenize text (always needed)
    assert tokenizer is not None, "Need to specify a tokenizer to split text into tokens"
    tokens = tokenizer(text)

    # TODO drop default_stopwords and extra_stopwords (if parameters are provided) and punctuation

    # TODO lemmatize each token (only if lemmatizer parameter is provided)

    # TODO stem each token (only if stemmer parameter is provided)

    if min_token_length>0:
        tokens = [t for t in tokens if len(t)>=min_token_length]

    return tokens

The following examples show my implementation of this function for various different keyword arguments.

Example: uses split and no filtering on stopwords and no stemming/lemmatization

1
2
3
for text in sample_text:
    tokens = generate_tokens(text, tokenizer=tokenizer_1)
    print(f"{text}\n\t{tokens}")

has output

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
The movie was crap 
    ['the', 'movie', 'was', 'crap']
The book was kind of good.
    ['the', 'book', 'was', 'kind', 'of', 'good.']
A really bad, horrible book. ๐Ÿคจ
    ['a', 'really', 'bad,', 'horrible', 'book.']
Today sux
    ['today', 'sux']
Today kinda sux! But I'll get by, lol ๐Ÿ˜ƒ
    ['today', 'kinda', 'sux!', 'but', "i'll", 'get', 'by,', 'lol']
sentiment analysis is shit. 
    ['sentiment', 'analysis', 'is', 'shit.']
sentiment analysis is the shit.
    ['sentiment', 'analysis', 'is', 'the', 'shit.']
I like to hate Michael Bay films, but I couldn't fault this one
    ['i', 'like', 'to', 'hate', 'michael', 'bay', 'films,', 'but', 'i', "couldn't", 'fault', 'this', 'one']

Example: uses default nltk tokeniser and filters stopwords and no stemming/lemmatization

1
2
3
for text in sample_text:
    tokens = generate_tokens(text, tokenizer=tokenizer_1, default_stopwords=default_stopwords_1)
    print(f"{text}\n\t{tokens}")

has output

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
The movie was crap 
    ['movie', 'crap']
The book was kind of good.
    ['book', 'kind', 'good.']
A really bad, horrible book. ๐Ÿคจ
    ['really', 'bad,', 'horrible', 'book.']
Today sux
    ['today', 'sux']
Today kinda sux! But I'll get by, lol ๐Ÿ˜ƒ
    ['today', 'kinda', 'sux!', "i'll", 'get', 'by,', 'lol']
sentiment analysis is shit. 
    ['sentiment', 'analysis', 'shit.']
sentiment analysis is the shit.
    ['sentiment', 'analysis', 'shit.']
I like to hate Michael Bay films, but I couldn't fault this one
    ['like', 'hate', 'michael', 'bay', 'films,', 'fault', 'one']

Example: uses nltk regex-tokeniser and filters stopwords and lemmatization

1
2
3
for text in sample_text:
    tokens = generate_tokens(text, tokenizer=tokenizer_3, default_stopwords=default_stopwords_1, lemmatizer=lemmatizer)
    print(f"{text}\n\t{tokens}")

has output

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
The movie was crap 
    ['movie', 'crap']
The book was kind of good.
    ['book', 'kind', 'good']
A really bad, horrible book. ๐Ÿคจ
    ['really', 'bad', 'horrible', 'book']
Today sux
    ['today', 'sux']
Today kinda sux! But I'll get by, lol ๐Ÿ˜ƒ
    ['today', 'kinda', 'sux', 'get', 'lol']
sentiment analysis is shit. 
    ['sentiment', 'analysis', 'shit']
sentiment analysis is the shit.
    ['sentiment', 'analysis', 'shit']
I like to hate Michael Bay films, but I couldn't fault this one
    ['like', 'hate', 'michael', 'bay', 'film', 'fault', 'one']

TODO

OK, now we have a decent token generator function, but they are still improvements, such as:

Function generate_ngrams

This function is nearly exactly (only name changed) the same as given in Trump false claim practial.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
import gensim
from gensim.models import Phrases

# create tokens of sample_text for testing
docs = [
    generate_tokens(doc, tokenizer=tokenizer_3, lemmatizer=lemmatizer, default_stopwords=default_stopwords_1, min_token_length=3)
    for doc in sample_text]

# generate bigrams and trigrams using tokens in docs
bigram = Phrases(docs, min_count=10)
trigram = Phrases(bigram[docs])

def generate_ngrams(tokens): 
    return trigram[bigram[tokens]]

ngrams = [generate_ngrams(doc) for doc in docs]

ngrams

which generates

1
2
3
4
5
6
7
8
[['movie', 'crap'],
 ['book', 'kind', 'good'],
 ['really', 'bad', 'horrible', 'book'],
 ['today', 'sux'],
 ['today', 'kinda', 'sux', 'get', 'lol'],
 ['sentiment', 'analysis', 'shit'],
 ['sentiment', 'analysis', 'shit'],
 ['like', 'hate', 'michael', 'bay', 'film', 'fault', 'one']]

Now, if we (you)

then the