Decision Trees and Random Forests

This notebook is a revised version of the oustanding lecture notebook by Professor Hug.

Imports

As with other notebooks we will use the same set of standard imports.

In [1]:
import numpy as np
import pandas as pd
np.random.seed(23)
In [2]:
import plotly.offline as py
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.figure_factory as ff
import cufflinks as cf
cf.set_config_file(offline=True, sharing=False, theme='ggplot');
import matplotlib.pyplot as plt
import seaborn as sns
In [3]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV

Loading the Data

For this notebook we will use the classic Iris Dataset. The goal is to predict the species of Iris) based on measurements of the flower.

In [4]:
iris = datasets.load_iris()
column_names = [n.replace("(cm)", "").strip() for n in iris['feature_names']]
iris_data = pd.DataFrame(iris['data'], columns=column_names)
iris_data['species'] = iris['target_names'][iris['target']]
iris_data['target'] = iris['target']
iris_data
Out[4]:
sepal length sepal width petal length petal width species target
0 5.1 3.5 1.4 0.2 setosa 0
1 4.9 3.0 1.4 0.2 setosa 0
2 4.7 3.2 1.3 0.2 setosa 0
3 4.6 3.1 1.5 0.2 setosa 0
4 5.0 3.6 1.4 0.2 setosa 0
... ... ... ... ... ... ...
145 6.7 3.0 5.2 2.3 virginica 2
146 6.3 2.5 5.0 1.9 virginica 2
147 6.5 3.0 5.2 2.0 virginica 2
148 6.2 3.4 5.4 2.3 virginica 2
149 5.9 3.0 5.1 1.8 virginica 2

150 rows × 6 columns

In [5]:
px.scatter(iris_data, x="petal length", y="petal width", color="species")

Notice that there are three classes of flower. This is a not a binary classification problem but instead a multiclass classification problem. There are several simple extensions of the logistic regression model to support multiple classes.

Multiclass Logistic Regression

The logistic regression model can be extended to multiple classes using several techniques. Perhaps the simplest is the one-versus-rest approach where the multiclass prediction problem is divided into separate binary prediction problems.

In [6]:
from sklearn.linear_model import LogisticRegression

lr_setosa = LogisticRegression(solver='lbfgs')
lr_setosa.fit(iris_data[['petal length', 'petal width']], 
              iris_data['species'] == 'setosa')

lr_versicolor = LogisticRegression(solver='lbfgs')
lr_versicolor.fit(iris_data[['petal length', 'petal width']], 
                  iris_data['species'] == 'versicolor')

lr_virginica = LogisticRegression(solver='lbfgs')
lr_virginica.fit(iris_data[['petal length', 'petal width']], 
                iris_data['species'] == 'virginica');
In [7]:
def predict_class(X):
    most_likely_class = np.argmax(np.vstack([
        lr_setosa.predict_proba(X)[:,1],
        lr_versicolor.predict_proba(X)[:,1],
        lr_virginica.predict_proba(X)[:,1]
    ]), axis=0)
    return iris['target_names'][most_likely_class]
In [8]:
predict_class(iris_data[['petal length', 'petal width']])
Out[8]:
array(['setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa',
       'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa',
       'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa',
       'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa',
       'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa',
       'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa',
       'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa',
       'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa',
       'setosa', 'setosa', 'versicolor', 'versicolor', 'versicolor',
       'versicolor', 'versicolor', 'versicolor', 'versicolor',
       'versicolor', 'versicolor', 'versicolor', 'versicolor',
       'versicolor', 'versicolor', 'versicolor', 'versicolor',
       'versicolor', 'versicolor', 'versicolor', 'versicolor',
       'versicolor', 'virginica', 'versicolor', 'versicolor',
       'versicolor', 'versicolor', 'versicolor', 'versicolor',
       'virginica', 'versicolor', 'versicolor', 'versicolor',
       'versicolor', 'versicolor', 'virginica', 'versicolor',
       'versicolor', 'versicolor', 'versicolor', 'versicolor',
       'versicolor', 'versicolor', 'versicolor', 'versicolor',
       'versicolor', 'versicolor', 'versicolor', 'versicolor',
       'versicolor', 'setosa', 'versicolor', 'virginica', 'virginica',
       'virginica', 'virginica', 'virginica', 'virginica', 'versicolor',
       'virginica', 'virginica', 'virginica', 'virginica', 'virginica',
       'virginica', 'virginica', 'virginica', 'virginica', 'virginica',
       'virginica', 'virginica', 'virginica', 'virginica', 'virginica',
       'virginica', 'virginica', 'virginica', 'virginica', 'virginica',
       'virginica', 'virginica', 'virginica', 'virginica', 'virginica',
       'virginica', 'virginica', 'virginica', 'virginica', 'virginica',
       'virginica', 'virginica', 'virginica', 'virginica', 'virginica',
       'virginica', 'virginica', 'virginica', 'virginica', 'virginica',
       'virginica', 'virginica', 'virginica'], dtype='<U10')

How accurate is the model?

In [9]:
Y = iris_data['species']
Y_hat = predict_class(iris_data[['petal length', 'petal width']]) 
accuracy = np.mean(Y == Y_hat)
print("Prediction Accuracy:", accuracy)
Prediction Accuracy: 0.9666666666666667

Scikit-learn has a built-in implementation of one versus rest.

In [10]:
lr_model = LogisticRegression(multi_class = 'ovr', solver='lbfgs')
lr_model.fit(iris_data[["petal length", "petal width"]], 
             iris_data["species"])
Out[10]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='ovr', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)
In [11]:
Y_hat = lr_model.predict(iris_data[['petal length', 'petal width']])
accuracy = np.mean(Y == Y_hat)
print("Prediction Accuracy:", accuracy)
Prediction Accuracy: 0.9666666666666667

We can also visualize the predictions. The following code constructs a plot illustrating the decision we would make for each possible value of petal length and petal width. You don't need to understand the details of the code but the basic idea is evaluate the model on a grid (mesgrid) of features and color the plot the color the integer value for that feature.

In [12]:
def plot_decision_boundaries(model, X, n=50):
    categories, z_int = np.unique(Y, return_inverse=True)
    # Make contour plot
    u = np.linspace(X[:,0].min()-0.5, X[:,0].max()+0.5, n)
    v = np.linspace(X[:,1].min()-0.5, X[:,1].max()+0.5, n)
    us,vs = np.meshgrid(u, v)
    X_test = np.c_[us.ravel(), vs.ravel()]
    z_str = model.predict(X_test)
    categories, z_int = np.unique(z_str, return_inverse=True)
    return go.Contour(x=X_test[:,0], y=X_test[:,1], z=z_int, 
#                      contours=dict(start=0,end=2,size=1),
                     colorscale=px.colors.qualitative.Plotly[:3],
                     showscale=False,
                     )

In the following plot we can see the data points and the predicted class assignment for all combinations of peta width and petal length.

In [13]:
fig = px.scatter(iris_data, x="petal length", y="petal width", color="species")
fig.update_traces(marker=dict(size=12, line=dict(width=2, color='black')),
                  selector=dict(mode='markers'))
fig.add_trace(
    plot_decision_boundaries(lr_model,
                             iris_data[["petal length", "petal width"]].to_numpy())
)

Decision Tree Classification

In lecture we introduced decision trees. In this part of the notebook, we walk through the construction of a decision tree using scikit-learn. The scikit-learn decision tree overview provides a good overview of decision trees and is worth skimming.

In [14]:
from sklearn import tree
dt_model = tree.DecisionTreeClassifier()
dt_model.fit(iris_data[["petal length", "petal width"]], 
             iris_data["species"])
Out[14]:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

Notice that there are many hyperparameters that we can configure with decision trees. The parameters that you may want to pay attention to are max_depth and min_samples_split which control overfitting. max_depth determines how many times to divide the feature space (deeper models may overfit) and min_samples_split also determines the depth of the tree by preventing further splits when there are too few samples.

As with logistic regression we can make predictions using the predict function:

In [15]:
dt_model.predict(iris_data[["petal length", "petal width"]])
Out[15]:
array(['setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa',
       'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa',
       'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa',
       'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa',
       'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa',
       'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa',
       'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa',
       'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa',
       'setosa', 'setosa', 'versicolor', 'versicolor', 'versicolor',
       'versicolor', 'versicolor', 'versicolor', 'versicolor',
       'versicolor', 'versicolor', 'versicolor', 'versicolor',
       'versicolor', 'versicolor', 'versicolor', 'versicolor',
       'versicolor', 'versicolor', 'versicolor', 'versicolor',
       'versicolor', 'virginica', 'versicolor', 'versicolor',
       'versicolor', 'versicolor', 'versicolor', 'versicolor',
       'versicolor', 'versicolor', 'versicolor', 'versicolor',
       'versicolor', 'versicolor', 'versicolor', 'versicolor',
       'versicolor', 'versicolor', 'versicolor', 'versicolor',
       'versicolor', 'versicolor', 'versicolor', 'versicolor',
       'versicolor', 'versicolor', 'versicolor', 'versicolor',
       'versicolor', 'versicolor', 'versicolor', 'virginica', 'virginica',
       'virginica', 'virginica', 'virginica', 'virginica', 'virginica',
       'virginica', 'virginica', 'virginica', 'virginica', 'virginica',
       'virginica', 'virginica', 'virginica', 'virginica', 'virginica',
       'virginica', 'virginica', 'virginica', 'virginica', 'virginica',
       'virginica', 'virginica', 'virginica', 'virginica', 'virginica',
       'virginica', 'virginica', 'virginica', 'virginica', 'virginica',
       'virginica', 'virginica', 'virginica', 'virginica', 'virginica',
       'virginica', 'virginica', 'virginica', 'virginica', 'virginica',
       'virginica', 'virginica', 'virginica', 'virginica', 'virginica',
       'virginica', 'virginica', 'virginica'], dtype=object)

One of the big advantages of decision trees is that they can be (assuming they are not too deep) interpretable while also fitting complex data. The following code creates a visualization of the tree:

In [16]:
import graphviz
dot_data = tree.export_graphviz(dt_model, out_file=None, 
                      feature_names=["petal length", "petal width"],  
                      class_names=["setosa", "versicolor", "virginica"],  
                      filled=True, rounded=True,  
                      special_characters=True)  
graph = graphviz.Source(dot_data)
#graph.render(format="png", filename="iris_tree")
graph
Out[16]:
Tree 0 petal length ≤ 2.45 gini = 0.667 samples = 150 value = [50, 50, 50] class = setosa 1 gini = 0.0 samples = 50 value = [50, 0, 0] class = setosa 0->1 True 2 petal width ≤ 1.75 gini = 0.5 samples = 100 value = [0, 50, 50] class = versicolor 0->2 False 3 petal length ≤ 4.95 gini = 0.168 samples = 54 value = [0, 49, 5] class = versicolor 2->3 12 petal length ≤ 4.85 gini = 0.043 samples = 46 value = [0, 1, 45] class = virginica 2->12 4 petal width ≤ 1.65 gini = 0.041 samples = 48 value = [0, 47, 1] class = versicolor 3->4 7 petal width ≤ 1.55 gini = 0.444 samples = 6 value = [0, 2, 4] class = virginica 3->7 5 gini = 0.0 samples = 47 value = [0, 47, 0] class = versicolor 4->5 6 gini = 0.0 samples = 1 value = [0, 0, 1] class = virginica 4->6 8 gini = 0.0 samples = 3 value = [0, 0, 3] class = virginica 7->8 9 petal length ≤ 5.45 gini = 0.444 samples = 3 value = [0, 2, 1] class = versicolor 7->9 10 gini = 0.0 samples = 2 value = [0, 2, 0] class = versicolor 9->10 11 gini = 0.0 samples = 1 value = [0, 0, 1] class = virginica 9->11 13 gini = 0.444 samples = 3 value = [0, 1, 2] class = virginica 12->13 14 gini = 0.0 samples = 43 value = [0, 0, 43] class = virginica 12->14

Notice at the first level that if $\textbf{petal_width} \leq 0.8$ then the iris is always a setosa. Then as we move further down the tree we are able to split the data into leaf nodes of just one type.

If you look carefully, you can find a leaf (leaf 7 from left to right) which has a value array that has non-zero entries in multiple classes. Why didn't the leaf get further divided? Let's examine that leaf more carefully. We can extract the tree and identify the data with high impurity:

In [17]:
t = dt_model.tree_
leaves = t.apply(iris_data[["petal length", "petal width"]].to_numpy().astype('float32'))
impure_ind = t.impurity[leaves] > 0
iris_data.loc[impure_ind, ["petal length", "petal width", "species"]]
Out[17]:
petal length petal width species
70 4.8 1.8 versicolor
126 4.8 1.8 virginica
138 4.8 1.8 virginica

Or, we can use the predict_proba function to return the prbabilities. Note that only the impure leaves will will have probability less than 1:

In [18]:
impure_ind = dt_model.predict_proba(iris_data[["petal length", "petal width"]]).max(axis=1) < 1.
iris_data.loc[impure_ind, ["petal length", "petal width", "species"]]
Out[18]:
petal length petal width species
70 4.8 1.8 versicolor
126 4.8 1.8 virginica
138 4.8 1.8 virginica

In either case we see that it would not be possible to divide the leaf further since all the flowers have the same features but different classes.

We can also visualize the decision surface.

In [19]:
fig = px.scatter(iris_data, x="petal length", y="petal width", color="species")
fig.update_traces(marker=dict(size=12,
                              line=dict(width=2,
                                        color='DarkSlateGrey')),
                  selector=dict(mode='markers'))
fig.add_trace(
    plot_decision_boundaries(dt_model,
                             iris_data[["petal length", "petal width"]].to_numpy())
)

Notice that the decision boundaries are axis-aligned. Why is this? Recall that we are dividing on one dimension at each node in the tree.

We can also compute the final accuracy using scikit-learn.

In [20]:
from sklearn.metrics import accuracy_score
predictions = dt_model.predict(iris_data[["petal length", "petal width"]])
accuracy_score(predictions, iris_data["species"])
Out[20]:
0.9933333333333333

Overfitting

Let's examine overfitting with decision trees:

In [21]:
from sklearn.model_selection import train_test_split
train_iris_data, test_iris_data = train_test_split(iris_data, test_size=0.25, random_state=42)
In [22]:
#sort so that the color labels match what we had in the earlier part of lecture
train_iris_data = train_iris_data.sort_values(by="species")
test_iris_data = test_iris_data.sort_values(by="species")
In [23]:
from sklearn import tree
dt_model = tree.DecisionTreeClassifier()
dt_model.fit(train_iris_data[["petal length", "petal width"]], train_iris_data["species"])
Out[23]:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')
In [24]:
fig = px.scatter(train_iris_data, x="petal length", y="petal width", color="species")
fig.update_traces(marker=dict(size=12,
                              line=dict(width=2,
                                        color='DarkSlateGrey')),
                  selector=dict(mode='markers'))
fig.add_trace(
    plot_decision_boundaries(dt_model,
                             iris_data[["petal length", "petal width"]].to_numpy())
)
fig.update_layout(title="Training Data")
In [25]:
fig = px.scatter(test_iris_data, x="petal length", y="petal width", color="species")
fig.update_traces(marker=dict(size=12,
                              line=dict(width=2,
                                        color='DarkSlateGrey')),
                  selector=dict(mode='markers'))
fig.add_trace(
    plot_decision_boundaries(dt_model,
                             iris_data[["petal length", "petal width"]].to_numpy())
)
fig.update_layout(title="Test Data")
In [26]:
accuracy_score(dt_model.predict(train_iris_data[["petal length", "petal width"]]), train_iris_data["species"])
Out[26]:
0.9910714285714286
In [27]:
predictions = dt_model.predict(test_iris_data[["petal length", "petal width"]])
accuracy_score(predictions, test_iris_data["species"])
Out[27]:
1.0
In [28]:
from sklearn import tree
sepal_dt_model = tree.DecisionTreeClassifier()
sepal_dt_model.fit(train_iris_data[["sepal length", "sepal width"]], train_iris_data["species"])
Out[28]:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')
In [29]:
sns.scatterplot(data = iris_data, x = "sepal length", y="sepal width", hue="species", legend=False)
fig = plt.gcf()
fig.savefig("iris_scatter_plot_with_petal_data_sepal_only.png", dpi=300, bbox_inches = "tight")
In [30]:
from matplotlib.colors import ListedColormap
sns_cmap = ListedColormap(np.array(sns.color_palette())[0:3, :])

xx, yy = np.meshgrid(np.arange(4, 8, 0.02),
                     np.arange(1.9, 4.5, 0.02))

Z_string = sepal_dt_model.predict(np.c_[xx.ravel(), yy.ravel()])
categories, Z_int = np.unique(Z_string, return_inverse=True)
Z_int = Z_int 
Z_int = Z_int.reshape(xx.shape)
cs = plt.contourf(xx, yy, Z_int, cmap=sns_cmap)
fig = plt.gcf()
fig.savefig("iris_sepal_decision_boundaries_no_data.png", dpi=300, bbox_inches = "tight")
In [31]:
from matplotlib.colors import ListedColormap
sns_cmap = ListedColormap(np.array(sns.color_palette())[0:3, :])

xx, yy = np.meshgrid(np.arange(4, 8, 0.02),
                     np.arange(1.9, 4.5, 0.02))

Z_string = sepal_dt_model.predict(np.c_[xx.ravel(), yy.ravel()])
categories, Z_int = np.unique(Z_string, return_inverse=True)
Z_int = Z_int 
Z_int = Z_int.reshape(xx.shape)
cs = plt.contourf(xx, yy, Z_int, cmap=sns_cmap)
sns.scatterplot(data = train_iris_data, x = "sepal length", y="sepal width", hue="species", legend=False)
fig = plt.gcf()
fig.savefig("iris_sepal_decision_boundaries_model_training_only.png", dpi=300, bbox_inches = "tight")
In [32]:
from matplotlib.colors import ListedColormap
sns_cmap = ListedColormap(np.array(sns.color_palette())[0:3, :])

xx, yy = np.meshgrid(np.arange(4, 8, 0.02),
                     np.arange(1.9, 4.5, 0.02))

Z_string = sepal_dt_model.predict(np.c_[xx.ravel(), yy.ravel()])
categories, Z_int = np.unique(Z_string, return_inverse=True)
Z_int = Z_int 
Z_int = Z_int.reshape(xx.shape)
cs = plt.contourf(xx, yy, Z_int, cmap=sns_cmap)
sns.scatterplot(data = test_iris_data, x = "sepal length", y="sepal width", hue="species", legend=False)
fig = plt.gcf()
fig.savefig("iris_sepal_decision_boundaries_model_test_only.png", dpi=300, bbox_inches = "tight")
#fig = plt.gcf()
#fig.savefig("iris_decision_boundaries_model_train_test_split.png", dpi=300, bbox_inches = "tight")
In [33]:
dot_data = tree.export_graphviz(sepal_dt_model, out_file=None, 
                      feature_names=["sepal_length", "sepal_width"],  
                      class_names=["setosa", "versicolor", "virginica"],  
                      filled=True, rounded=True,  
                      special_characters=True)  
graph = graphviz.Source(dot_data)
# graph.render(format="png", filename="sepal_tree")
graph
Out[33]:
Tree 0 sepal_length ≤ 5.45 gini = 0.666 samples = 112 value = [35, 39, 38] class = versicolor 1 sepal_width ≤ 2.8 gini = 0.296 samples = 40 value = [33, 6, 1] class = setosa 0->1 True 14 sepal_length ≤ 6.15 gini = 0.525 samples = 72 value = [2, 33, 37] class = virginica 0->14 False 2 sepal_length ≤ 4.7 gini = 0.449 samples = 7 value = [1, 5, 1] class = versicolor 1->2 9 sepal_length ≤ 5.35 gini = 0.059 samples = 33 value = [32, 1, 0] class = setosa 1->9 3 gini = 0.0 samples = 1 value = [1, 0, 0] class = setosa 2->3 4 sepal_length ≤ 4.95 gini = 0.278 samples = 6 value = [0, 5, 1] class = versicolor 2->4 5 sepal_width ≤ 2.45 gini = 0.5 samples = 2 value = [0, 1, 1] class = versicolor 4->5 8 gini = 0.0 samples = 4 value = [0, 4, 0] class = versicolor 4->8 6 gini = 0.0 samples = 1 value = [0, 1, 0] class = versicolor 5->6 7 gini = 0.0 samples = 1 value = [0, 0, 1] class = virginica 5->7 10 gini = 0.0 samples = 28 value = [28, 0, 0] class = setosa 9->10 11 sepal_width ≤ 3.2 gini = 0.32 samples = 5 value = [4, 1, 0] class = setosa 9->11 12 gini = 0.0 samples = 1 value = [0, 1, 0] class = versicolor 11->12 13 gini = 0.0 samples = 4 value = [4, 0, 0] class = setosa 11->13 15 sepal_width ≤ 3.7 gini = 0.478 samples = 33 value = [2, 22, 9] class = versicolor 14->15 48 sepal_width ≤ 2.6 gini = 0.405 samples = 39 value = [0, 11, 28] class = virginica 14->48 16 sepal_length ≤ 5.75 gini = 0.412 samples = 31 value = [0, 22, 9] class = versicolor 15->16 47 gini = 0.0 samples = 2 value = [2, 0, 0] class = setosa 15->47 17 sepal_length ≤ 5.55 gini = 0.245 samples = 14 value = [0, 12, 2] class = versicolor 16->17 28 sepal_width ≤ 3.1 gini = 0.484 samples = 17 value = [0, 10, 7] class = versicolor 16->28 18 gini = 0.0 samples = 5 value = [0, 5, 0] class = versicolor 17->18 19 sepal_width ≤ 2.55 gini = 0.346 samples = 9 value = [0, 7, 2] class = versicolor 17->19 20 gini = 0.0 samples = 1 value = [0, 0, 1] class = virginica 19->20 21 sepal_length ≤ 5.65 gini = 0.219 samples = 8 value = [0, 7, 1] class = versicolor 19->21 22 sepal_width ≤ 2.9 gini = 0.375 samples = 4 value = [0, 3, 1] class = versicolor 21->22 27 gini = 0.0 samples = 4 value = [0, 4, 0] class = versicolor 21->27 23 sepal_width ≤ 2.75 gini = 0.5 samples = 2 value = [0, 1, 1] class = versicolor 22->23 26 gini = 0.0 samples = 2 value = [0, 2, 0] class = versicolor 22->26 24 gini = 0.0 samples = 1 value = [0, 1, 0] class = versicolor 23->24 25 gini = 0.0 samples = 1 value = [0, 0, 1] class = virginica 23->25 29 sepal_length ≤ 6.05 gini = 0.498 samples = 15 value = [0, 8, 7] class = versicolor 28->29 46 gini = 0.0 samples = 2 value = [0, 2, 0] class = versicolor 28->46 30 sepal_width ≤ 2.75 gini = 0.496 samples = 11 value = [0, 5, 6] class = virginica 29->30 43 sepal_width ≤ 2.7 gini = 0.375 samples = 4 value = [0, 3, 1] class = versicolor 29->43 31 sepal_width ≤ 2.65 gini = 0.49 samples = 7 value = [0, 4, 3] class = versicolor 30->31 38 sepal_width ≤ 2.9 gini = 0.375 samples = 4 value = [0, 1, 3] class = virginica 30->38 32 sepal_width ≤ 2.4 gini = 0.444 samples = 3 value = [0, 2, 1] class = versicolor 31->32 35 sepal_length ≤ 5.9 gini = 0.5 samples = 4 value = [0, 2, 2] class = versicolor 31->35 33 gini = 0.5 samples = 2 value = [0, 1, 1] class = versicolor 32->33 34 gini = 0.0 samples = 1 value = [0, 1, 0] class = versicolor 32->34 36 gini = 0.444 samples = 3 value = [0, 1, 2] class = virginica 35->36 37 gini = 0.0 samples = 1 value = [0, 1, 0] class = versicolor 35->37 39 gini = 0.0 samples = 1 value = [0, 0, 1] class = virginica 38->39 40 sepal_length ≤ 5.95 gini = 0.444 samples = 3 value = [0, 1, 2] class = virginica 38->40 41 gini = 0.5 samples = 2 value = [0, 1, 1] class = versicolor 40->41 42 gini = 0.0 samples = 1 value = [0, 0, 1] class = virginica 40->42 44 gini = 0.0 samples = 1 value = [0, 0, 1] class = virginica 43->44 45 gini = 0.0 samples = 3 value = [0, 3, 0] class = versicolor 43->45 49 gini = 0.0 samples = 2 value = [0, 2, 0] class = versicolor 48->49 50 sepal_length ≤ 7.05 gini = 0.368 samples = 37 value = [0, 9, 28] class = virginica 48->50 51 sepal_width ≤ 3.25 gini = 0.444 samples = 27 value = [0, 9, 18] class = virginica 50->51 76 gini = 0.0 samples = 10 value = [0, 0, 10] class = virginica 50->76 52 sepal_length ≤ 6.55 gini = 0.483 samples = 22 value = [0, 9, 13] class = virginica 51->52 75 gini = 0.0 samples = 5 value = [0, 0, 5] class = virginica 51->75 53 sepal_width ≤ 2.95 gini = 0.375 samples = 12 value = [0, 3, 9] class = virginica 52->53 64 sepal_length ≤ 6.75 gini = 0.48 samples = 10 value = [0, 6, 4] class = versicolor 52->64 54 sepal_length ≤ 6.45 gini = 0.469 samples = 8 value = [0, 3, 5] class = virginica 53->54 63 gini = 0.0 samples = 4 value = [0, 0, 4] class = virginica 53->63 55 sepal_width ≤ 2.85 gini = 0.408 samples = 7 value = [0, 2, 5] class = virginica 54->55 62 gini = 0.0 samples = 1 value = [0, 1, 0] class = versicolor 54->62 56 gini = 0.0 samples = 4 value = [0, 0, 4] class = virginica 55->56 57 sepal_length ≤ 6.25 gini = 0.444 samples = 3 value = [0, 2, 1] class = versicolor 55->57 58 gini = 0.0 samples = 1 value = [0, 1, 0] class = versicolor 57->58 59 sepal_length ≤ 6.35 gini = 0.5 samples = 2 value = [0, 1, 1] class = versicolor 57->59 60 gini = 0.0 samples = 1 value = [0, 0, 1] class = virginica 59->60 61 gini = 0.0 samples = 1 value = [0, 1, 0] class = versicolor 59->61 65 sepal_width ≤ 3.05 gini = 0.32 samples = 5 value = [0, 4, 1] class = versicolor 64->65 68 sepal_length ≤ 6.95 gini = 0.48 samples = 5 value = [0, 2, 3] class = virginica 64->68 66 gini = 0.0 samples = 3 value = [0, 3, 0] class = versicolor 65->66 67 gini = 0.5 samples = 2 value = [0, 1, 1] class = versicolor 65->67 69 sepal_width ≤ 3.05 gini = 0.375 samples = 4 value = [0, 1, 3] class = virginica 68->69 74 gini = 0.0 samples = 1 value = [0, 1, 0] class = versicolor 68->74 70 gini = 0.0 samples = 1 value = [0, 0, 1] class = virginica 69->70 71 sepal_width ≤ 3.15 gini = 0.444 samples = 3 value = [0, 1, 2] class = virginica 69->71 72 gini = 0.5 samples = 2 value = [0, 1, 1] class = versicolor 71->72 73 gini = 0.0 samples = 1 value = [0, 0, 1] class = virginica 71->73
In [34]:
accuracy_score(sepal_dt_model.predict(train_iris_data[["sepal length", "sepal width"]]), 
               train_iris_data["species"])
Out[34]:
0.9553571428571429
In [35]:
accuracy_score(sepal_dt_model.predict(test_iris_data[["sepal length", "sepal width"]]), 
               test_iris_data["species"])
Out[35]:
0.6578947368421053
In [36]:
dt_model_4d = tree.DecisionTreeClassifier()
all_features = ["petal length", "petal width", "sepal length", "sepal width"]
dt_model_4d.fit(train_iris_data[all_features], train_iris_data["species"])
Out[36]:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')
In [37]:
predictions = dt_model_4d.predict(train_iris_data[all_features])
accuracy_score(predictions, train_iris_data["species"])
Out[37]:
1.0
In [38]:
predictions = dt_model_4d.predict(test_iris_data[all_features])
accuracy_score(predictions, test_iris_data["species"])
Out[38]:
1.0
In [39]:
dot_data = tree.export_graphviz(dt_model_4d, out_file=None, 
                      feature_names=all_features,  
                      class_names=["setosa", "versicolor", "virginica"],  
                      filled=True, rounded=True,  
                      special_characters=True)  
graph = graphviz.Source(dot_data)
graph
Out[39]:
Tree 0 petal length ≤ 2.45 gini = 0.666 samples = 112 value = [35, 39, 38] class = versicolor 1 gini = 0.0 samples = 35 value = [35, 0, 0] class = setosa 0->1 True 2 petal length ≤ 4.75 gini = 0.5 samples = 77 value = [0, 39, 38] class = versicolor 0->2 False 3 petal width ≤ 1.65 gini = 0.056 samples = 35 value = [0, 34, 1] class = versicolor 2->3 6 petal width ≤ 1.75 gini = 0.21 samples = 42 value = [0, 5, 37] class = virginica 2->6 4 gini = 0.0 samples = 34 value = [0, 34, 0] class = versicolor 3->4 5 gini = 0.0 samples = 1 value = [0, 0, 1] class = virginica 3->5 7 petal length ≤ 4.95 gini = 0.5 samples = 8 value = [0, 4, 4] class = versicolor 6->7 14 petal length ≤ 4.85 gini = 0.057 samples = 34 value = [0, 1, 33] class = virginica 6->14 8 gini = 0.0 samples = 2 value = [0, 2, 0] class = versicolor 7->8 9 petal width ≤ 1.55 gini = 0.444 samples = 6 value = [0, 2, 4] class = virginica 7->9 10 gini = 0.0 samples = 3 value = [0, 0, 3] class = virginica 9->10 11 sepal length ≤ 6.95 gini = 0.444 samples = 3 value = [0, 2, 1] class = versicolor 9->11 12 gini = 0.0 samples = 2 value = [0, 2, 0] class = versicolor 11->12 13 gini = 0.0 samples = 1 value = [0, 0, 1] class = virginica 11->13 15 sepal width ≤ 3.1 gini = 0.444 samples = 3 value = [0, 1, 2] class = virginica 14->15 18 gini = 0.0 samples = 31 value = [0, 0, 31] class = virginica 14->18 16 gini = 0.0 samples = 2 value = [0, 0, 2] class = virginica 15->16 17 gini = 0.0 samples = 1 value = [0, 1, 0] class = versicolor 15->17
In [40]:
graph.render(format="png", filename="iris_4d_tree")
Out[40]:
'iris_4d_tree.png'

Creating Decision Trees

In [41]:
def entropy(x):
    normalized_x = x / np.sum(x)
    return sum(-normalized_x * np.log2(normalized_x))
In [42]:
-np.log2(0.33)*0.33
Out[42]:
0.5278224832373695
In [43]:
-np.log2(0.36)*0.36
Out[43]:
0.5306152277996684
In [44]:
entropy([34, 36, 40])
Out[44]:
1.581649163979848
In [45]:
entropy([149, 1, 1])
Out[45]:
0.11485434496175385
In [46]:
entropy([50, 50])
Out[46]:
1.0
In [47]:
entropy([50, 50, 50])
Out[47]:
1.584962500721156
In [48]:
entropy([31, 4, 1])
Out[48]:
0.6815892897202809
In [49]:
#entropy([50, 46, 3])
#entropy([4, 47])
#entropy([41, 50])
#entropy([50, 50])
In [50]:
def weighted_average_entropy(x1, x2):
    N1 = sum(x1)
    N2 = sum(x2)
    N = N1/(N1 + N2)
    return (N1 * entropy(x1) + N2 * entropy(x2)) / (N1 + N2)
In [51]:
weighted_average_entropy([50, 46, 3], [4, 47])
Out[51]:
0.9033518322003758
In [52]:
weighted_average_entropy([50, 9], [41, 50])
Out[52]:
0.8447378399375686
In [53]:
weighted_average_entropy([2, 50, 50], [48])
Out[53]:
0.761345106024134
In [54]:
weighted_average_entropy([50, 50], [50])
Out[54]:
0.6666666666666666

Random Forests

In [55]:
ten_decision_tree_models = []
ten_training_sets = []
for i in range(10):
    current_model = tree.DecisionTreeClassifier()
    temp_iris_training_data, temp_iris_test_data = np.split(iris_data.sample(frac=1), [110])
    temp_iris_training_data = temp_iris_training_data.sort_values("species")
    current_model.fit(temp_iris_training_data[["sepal length", "sepal width"]], temp_iris_training_data["species"])
    ten_decision_tree_models.append(current_model)
    ten_training_sets.append(temp_iris_training_data)
In [56]:
def plot_decision_tree(decision_tree_model, data = None, disable_axes = False):
    from matplotlib.colors import ListedColormap
    sns_cmap = ListedColormap(np.array(sns.color_palette())[0:3, :])

    xx, yy = np.meshgrid(np.arange(4, 8, 0.02),
                     np.arange(1.9, 4.5, 0.02))

    Z_string = decision_tree_model.predict(np.c_[xx.ravel(), yy.ravel()])
    categories, Z_int = np.unique(Z_string, return_inverse=True)
    Z_int = Z_int.reshape(xx.shape)
    cs = plt.contourf(xx, yy, Z_int, cmap=sns_cmap)
    if data is not None:
        sns.scatterplot(data = data, x = "sepal length", y="sepal width", hue="species", legend=False)

    if disable_axes:
        plt.axis("off")
#    if disable_axes:
#        
#        plt.gca().xaxis.label.set_visible(False)
#        plt.gca().yaxis.label.set_visible(False)        
In [57]:
m_num = 0
plot_decision_tree(ten_decision_tree_models[m_num], ten_training_sets[m_num])
plt.savefig("random_forest_model_1_example.png", dpi = 300, bbox_inches = "tight")
In [58]:
m_num = 7
plot_decision_tree(ten_decision_tree_models[m_num], ten_training_sets[m_num])
plt.savefig("random_forest_model_2_example.png", dpi = 300, bbox_inches = "tight")
In [59]:
import matplotlib.gridspec as gridspec
gs1 = gridspec.GridSpec(3, 3)
gs1.update(wspace=0.025, hspace=0.025) # set the spacing between axes. 

for i in range(0, 9):
    plt.subplot(gs1[i]) #3, 3, i)
    plot_decision_tree(ten_decision_tree_models[i], None, True)    
    
plt.savefig("random_forest_model_9_examples.png", dpi = 300, bbox_inches = "tight")    
In [ ]: