Preprocessing

To keep the process flexible across all datasets, we will only do the minimal amount of preprocessing — fill NA by zero (and convert multi-class to binary for XGBoost.)

1
2
3
4
print(df.shape)
df.fillna(0, inplace=True)
print(df.shape)
X, y = df[attribute_names].values, df[target].values

Split the dataset into train, validation and test set:

1
2
3
4
5
X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=.2, random_state=SEED)

X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=.25, random_state=SEED)

X_train.shape, X_val.shape, X_test.shape

Visualize the first element in X_train

1
2
3
4
5
if name=='mnist_784':
    plt.imshow(X_train[0].reshape(28,28), cmap='Greys')
    plt.show()
else:
    display(X_train[0])

and the first target

1
y_train[0]