(Minimal Data Mining) Project Setup

There are many workflows/project structures that you can use when analysing datasets. The following is easy to implement and helps avoid repetition of unneeded computation so we will use this structure for this module.

Use a separate folder/project for each dataset, and use the follow project structure:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
datasets
    ├── 01-Tips
       ├── ...
    └── 05-Churn
        ├── orig                Location of data before any editing/cleaning
        ├── data                Location of cleaned/processed/etc data
        ├── output              All plots, tables etc generated  
        ├── Churn_-_01_-_Import.ipynb       Download, parse, and initial clean/standardisation of data
        ├── Churn_-_02_-_EDA.ipynb          Exploratory Data Analysis (EDA)
        ├── Churn_-_03_-_Baseline.ipynb     Build baseline model
        ├── Churn_-_??_-_TASK.ipynb         Notebook to do ...

Each notebook uses markdown cells for sectioning and has a first cell with at least the following code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from IPython.display import display, Markdown
sns.set_style("darkgrid")
pd.set_option('display.max_columns', None)  

import sys, os, yaml

DATASET = "Churn"

ROOT = "./"
COLAB = 'google.colab' in sys.modules
if COLAB:
    ROOT = f"/content/gdrive/MyDrive/datasets/{DATASET.replace(' ','_')}/"

DEBUG = False
SEED = 666