(Minimal Data Mining) Project Setup
There are many workflows/project structures that you can use when analysing datasets. The following is easy to implement and helps avoid repetition of unneeded computation so we will use this structure for this module.
Use a separate folder/project for each dataset, and use the follow project structure:
| datasets
├── 01-Tips
│ ├── ...
└── 05-Churn
├── orig Location of data before any editing/cleaning
├── data Location of cleaned/processed/etc data
├── output All plots, tables etc generated
├── Churn_-_01_-_Import.ipynb Download, parse, and initial clean/standardisation of data
├── Churn_-_02_-_EDA.ipynb Exploratory Data Analysis (EDA)
├── Churn_-_03_-_Baseline.ipynb Build baseline model
├── Churn_-_??_-_TASK.ipynb Notebook to do ...
|
Each notebook uses markdown cells for sectioning and has a first cell with at least the following code:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20 | import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display, Markdown
sns.set_style("darkgrid")
pd.set_option('display.max_columns', None)
import sys, os, yaml
DATASET = "Churn"
ROOT = "./"
COLAB = 'google.colab' in sys.modules
if COLAB:
ROOT = f"/content/gdrive/MyDrive/datasets/{DATASET.replace(' ','_')}/"
DEBUG = False
SEED = 666
|