Base Learners/Estimators

First I load a selection of heterogeneous learners. There are two reasons for this:

Performance of heterogeneous learners can give insights into the data.
Stacked learners typically are build using heterogeneous learners.

So I imported the following

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from xgboost import XGBClassifier

but ended up not using SVC because I am impatient and it took too long to train. Then I created a dictionary of the models

models = {
    "LR": LogisticRegression(max_iter=1000),
    "DT": DecisionTreeClassifier(),
    "KNN": KNeighborsClassifier(),
    "RF": RandomForestClassifier(),
    "ET": ExtraTreesClassifier(),
    "XGB": XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=SEED)
}

and imported the usual functions to score and report on performance

from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

So now I run over each of the individual learners to see how they preform

cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=SEED)
for name,model in models.items():
    scores = cross_val_score(model, df_train, y_train, cv=cv)
    print(name, scores.mean())

Surprisingly LogisticRegression is the best performing learner with a accuracy score of 79.3%, followed by XGBoost at 78.8% and then RandomForest at 77.5%.

Next we will build a stacked model and see if we can improved on this.