Temporal View

Task: Given a list of topics and date posted generate a visualisation to illustrate topic relative prevalence over time.

So I want you to generate something like the following. Notice the progression of the corona virus, and later the election.

Topic frequency over time

Hints (takes about 10 lines):

Scroll down for solution

Solution

Step 1:

Group the data by month and category. The grouping by month is more difficult since the date column is in days. However pandas has functions to deal with dates, for example to group columns date by month we use pd.Grouper(key='date', freq='1M').

1
2
3
4
columns = [pd.Grouper(key='date', freq='1M'),"category"]
df_tmp = df.groupby(columns)["id"].count().reset_index()

df_tmp

which results in a dataframe containing the count of each category by month:

Count of each category by month.

Step 2:

Pivot the column category so that each category label is a separate column:

1
2
3
df_tmp = df_tmp.pivot(index='date',columns="category", values="id").fillna(0)

df_tmp

which results in the following cross table. Note the filling of missing values by zero.

Cross table of data (month) vs category.

Step 3:

We want to divide each row by the sum of occurrences (frequencies) in each row

1
2
3
df_tmp = df_tmp.div(df_tmp.sum(axis=1), axis=0)

df_tmp

so we have

Cross table of data with rows divided by row total.

Step 4:

We want to calculate the cumulative sum along each row:

1
2
3
df_tmp = df_tmp.cumsum(axis=1)

df_tmp

so we have

Cross table of data with cumulative sum along each row.

Step 5:

Finally, generate the plot. We draw the topics in reverse order so that each topic area is visible.

1
2
3
4
5
6
7
fig = plt.figure(figsize=(14,8))
for c in df_tmp.columns[::-1]:
    df_tmp[c].plot.area(label=c)
plt.legend(bbox_to_anchor=(1, 1))

plt.title("Topic frequency over time")
plt.show()