Preprocessing

Setup Project

As usual:

Dataframe after loading.

Examine the target topic

Using

1
df.category.value_counts(dropna=False)

you should see that there are 18 topics ranging from 3,225 claims on Immigration down to 72 claims on Terrorism. There are also 169 unallocated claims. We could give them a separate topic, say Unknown, but to keep things easy we will just drop them.

Cleaning text

We are going to use NLK to perform the text processing, so first you need to install this. Also when using NLTK you will be prompted to install additional components as you work through the practical.

1
2
3
4
5
6
7
import string
import nltk 
from nltk.corpus import stopwords 
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
stopwords_eng = stopwords.words('english')

The column claim contains the source text that we need to clean. This is a standard task so I would typically wrap the steps in a function called clean_text which would have the following outline

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
def clean_text(text, lower=True, min_token_len=3):
    """
    Given a string, separate into tokens and clean. Return list of tokens.
    """

    clean_text = text.lower() if lower else text

    # TODO 
    # tokenize text
    # drop stopwords and punctuation (see string.punctuation)
    # lemmatize each token
    # drop tokens less than min_token_len

    return clean_text

Test this function using sample entries from the claim column. For example

Output of clean_text on row[0] data.

Next we create a column in the dataframe to contain the tokens (this takes about 18s)

1
df['tokens'] = df.claim.apply(clean_text)

Generating ngrams

In natural language processing an n-gram is a contiguous sequence of n items generated from a given sample of text where the items can be characters or words and n can be any numbers like 1, 2, 3, etc.

We can generate ngrams using NLTK but since we are going to use a separate library, gensim, for topic modelling so we will use this library for generating ngrams.

Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora.

1
2
import gensim
from gensim.models import Phrases

First we select the (cleaned/tokenised) documents that we wish to process

1
docs = df['tokens']

Next we generate the bigrams — pairs of tokens — using

1
bigram = Phrases(docs, min_count=10)

using bigram.export_phrases() we can see the generate bigrams

Generated bigrams.

The trigrams are generated from the bigrams using

1
trigram = Phrases(bigram[docs])

Next wrap the bi/tri-grams in a function such as

1
2
3
def ngrams(tokens): 
    # return bigram[tokens]
    return trigram[bigram[tokens]]

and then create a new column in the dataframe using

1
df['ngrams'] = df.tokens.apply(ngrams)

To see the bi/tri-gram for a single document (claim) use (note no trigrams were generated)

Generated bi/tri-grams for a single document (claim).