Temporal View

Task: Given a list of topics and date posted generate a visualisation to illustrate topic relative prevalence over time.

So I want you to generate something like the following. Notice the progression of the corona virus, and later the election.

Hints (takes about 10 lines):

Group date by months (otherwise data is too noisy).
Group by date and cateegory and then rest_index.
Pivot and fillna.
Scale byy row totals and generate cumsum (across rows)
Plot using plot.area in reverse order.

Scroll down for solution

Solution

Step 1:

Group the data by month and category. The grouping by month is more difficult since the date column is in days. However pandas has functions to deal with dates, for example to group columns date by month we use pd.Grouper(key='date', freq='1M').

columns = [pd.Grouper(key='date', freq='1M'),"category"]
df_tmp = df.groupby(columns)["id"].count().reset_index()

df_tmp

which results in a dataframe containing the count of each category by month:

Step 2:

Pivot the column category so that each category label is a separate column:

df_tmp = df_tmp.pivot(index='date',columns="category", values="id").fillna(0)

df_tmp

which results in the following cross table. Note the filling of missing values by zero.

Cross table of data (month) vs category.

Step 3:

We want to divide each row by the sum of occurrences (frequencies) in each row

df_tmp = df_tmp.div(df_tmp.sum(axis=1), axis=0)

df_tmp

so we have

Cross table of data with rows divided by row total.

Step 4:

We want to calculate the cumulative sum along each row:

df_tmp = df_tmp.cumsum(axis=1)

df_tmp

so we have

Cross table of data with cumulative sum along each row.

Step 5:

Finally, generate the plot. We draw the topics in reverse order so that each topic area is visible.

fig = plt.figure(figsize=(14,8))
for c in df_tmp.columns[::-1]:
    df_tmp[c].plot.area(label=c)
plt.legend(bbox_to_anchor=(1, 1))

plt.title("Topic frequency over time")
plt.show()