Base Learners/Estimators
First I load a selection of heterogeneous learners. There are two reasons for this:
- Performance of heterogeneous learners can give insights into the data.
- Stacked learners typically are build using heterogeneous learners.
So I imported the following
| from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from xgboost import XGBClassifier
|
but ended up not using SVC because I am impatient and it took too long to train. Then I created a dictionary of the models
| models = {
"LR": LogisticRegression(max_iter=1000),
"DT": DecisionTreeClassifier(),
"KNN": KNeighborsClassifier(),
"RF": RandomForestClassifier(),
"ET": ExtraTreesClassifier(),
"XGB": XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=SEED)
}
|
and imported the usual functions to score and report on performance
| from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
|
So now I run over each of the individual learners to see how they preform
| cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=SEED)
for name,model in models.items():
scores = cross_val_score(model, df_train, y_train, cv=cv)
print(name, scores.mean())
|
- Surprisingly LogisticRegression is the best performing learner with a accuracy score of 79.3%, followed by XGBoost at 78.8% and then RandomForest at 77.5%.
Next we will build a stacked model and see if we can improved on this.