Python for Data Science

Joe McCarthy, Data Scientist, Indeed

In [1]:
from IPython.display import display, Image, HTML

Notebooks in this primer:

4. Using Python to Build and Use a Simple Decision Tree Classifier

Decision Trees

Wikipedia offers the following description of a decision tree (with italics added to emphasize terms that will be elaborated below):

A decision tree is a flowchart-like structure in which each internal node represents a test of an attribute, each branch represents an outcome of that test and each leaf node represents class label (a decision taken after testing all attributes in the path from the root to the leaf). Each path from the root to a leaf can also be represented as a classification rule.

[Decision trees can also be used for regression, wherein the goal is to predict a continuous value rather than a class label, but we will focus here solely on their use for classification.]

The image below depicts a decision tree created from the UCI mushroom dataset that appears on Andy G's blog post about Decision Tree Learning, where

  • a white box represents an internal node (and the label represents the attribute being tested)
  • a blue box represents an attribute value (an outcome of the test of that attribute)
  • a green box represents a leaf node with a class label of edible
  • a red box represents a leaf node with a class label of poisonous

It is important to note that the UCI mushroom dataset consists entirely of categorical variables, i.e., every variable (or attribute) has an enumerated set of possible values. Many datasets include numeric variables that can take on int or float values. Tests for such variables typically use comparison operators, e.g., $age < 65$ or $36,250 < adjusted\_gross\_income <= 87,850$. [Aside: Python supports boolean expressions containing multiple comparison operators, such as the expression comparing adjusted_gross_income in the preceding example.]

Our simple decision tree will only accommodate categorical variables. We will closely follow a version of the decision tree learning algorithm implementation offered by Chris Roach.

Our goal in the following sections is to use Python to

  • create a simple decision tree using a set of training instances
  • classify (predict class labels) for a set of test instances using a simple decision tree
  • evaluate the performance of a simple decision tree on classifying a set of test instances

First, we will explore some concepts and algorithms used in building and using decision trees.

Entropy

When building a supervised classification model, the frequency distribution of attribute values is a potentially important factor in determining the relative importance of each attribute at various stages in the model building process.

In data modeling, we can use frequency distributions to compute entropy, a measure of disorder (impurity) in a set.

We compute the entropy of multiplying the proportion of instances with each class label by the log of that proportion, and then taking the negative sum of those terms.

More precisely, for a 2-class (binary) classification task:

$entropy(S) = - p_1 log_2 (p_1) - p_2 log_2 (p_2)$

where $p_i$ is proportion (relative frequency) of class i within the set S.

From the output above, we know that the proportion of clean_instances that are labeled 'e' (class edible) in the UCI dataset is $3488 \div 5644 = 0.618$, and the proportion labeled 'p' (class poisonous) is $2156 \div 5644 = 0.382$.

After importing the Python math module, we can use the math.log(x[, base]) function in computing the entropy of the clean_instances of the UCI mushroom data set as follows.

In [ ]:
import math

entropy = \
    - (3488 / 5644) * math.log(3488 / 5644, 2) \
    - (2156 / 5644) * math.log(2156 / 5644, 2)
print(entropy)

Exercise 6: define entropy()

Define a function, entropy(instances), that computes the entropy of instances. You may assume the class label is in position 0; we will later see how to specify default parameter values in function definitions.

[Note: the class label in many data files is the last rather than the first item on each line.]

In [ ]:
# your function definition here

# delete 'simple_ml.' in the function call below to test your function
print(simple_ml.entropy(clean_instances))

Information Gain

Informally, a decision tree is constructed from a set of instances using a recursive algorithm that

  • selects the best attribute
  • splits the set into subsets based on the values of that attribute (each subset is composed of instances from the original set that have the same value for that attribute)
  • repeats the process on each of these subsets until a stopping condition is met (e.g., a subset has no instances or has instances which all have the same class label)

Entropy is a metric that can be used in selecting the best attribute for each split: the best attribute is the one resulting in the largest decrease in entropy for a set of instances. [Note: other metrics can be used for determining the best attribute]

Information gain measures the decrease in entropy that results from splitting a set of instances based on an attribute.

$IG(S, a) = entropy(S) - [p(s_1) × entropy(s_1) + p(s_2) × entropy(s_2) ... + p(s_n) × entropy(s_n)]$

Where

  • $n$ is the number of distinct values of attribute $a$
  • $s_i$ is the subset of $S$ where all instances have the $i$th value of $a$
  • $p(s_i)$ is the proportion of instances in $S$ that have the $i$th value of $a$

We'll use the definition of information_gain() in simple_ml to print the information gain for each of the attributes in the mushroom dataset ... before asking you to write your own definition of the function.

In [ ]:
print('Information gain for different attributes:', end='\n\n')
for i in range(1, len(attribute_names)):
    print('{:5.3f}  {:2} {}'.format(
        simple_ml.information_gain(clean_instances, i), i, attribute_names[i]))

We can sort the attributes based in decreasing order of information gain, which shows that odor is the best attribute for the first split in a decision tree that models the instances in this dataset.

In [ ]:
print('Information gain for different attributes:', end='\n\n')
sorted_information_gain_indexes = sorted([(simple_ml.information_gain(clean_instances, i), i)
                                          for i in range(1, len(attribute_names))], 
                                         reverse=True)
for gain, i in sorted_information_gain_indexes:
    print('{:5.3f}  {:2} {}'.format(gain, i, attribute_names[i]))

[The following variation does not use a list comprehension]

In [ ]:
print('Information gain for different attributes:', end='\n\n')

information_gain_values = []
for i in range(1, len(attribute_names)):
    information_gain_values.append((simple_ml.information_gain(clean_instances, i), i))
    
sorted_information_gain_indexes = sorted(information_gain_values, 
                                         reverse=True)
for gain, i in sorted_information_gain_indexes:
    print('{:5.3f}  {:2} {}'.format(gain, i, attribute_names[i]))

Exercise 7: define information_gain()

Define a function, information_gain(instances, i), that returns the information gain achieved by selecting the ith attribute to split instances. It should exhibit the same behavior as the simple_ml version of the function.

In [ ]:
# your definition of information_gain(instances, i) here

# delete 'simple_ml.' in the function call below to test your function
sorted_information_gain_indexes = sorted([(simple_ml.information_gain(clean_instances, i), i) 
                                          for i in range(1, len(attribute_names))], 
                                         reverse=True)

print('Information gain for different attributes:', end='\n\n')
for gain, i in sorted_information_gain_indexes:
    print('{:5.3f}  {:2} {}'.format(gain, i, attribute_names[i]))

Building a Simple Decision Tree

We will implement a modified version of the ID3 algorithm for building a simple decision tree.

ID3 (Examples, Target_Attribute, Candidate_Attributes)
    Create a Root node for the tree
    If all examples have the same value of the Target_Attribute, 
        Return the single-node tree Root with label = that value 
    If the list of Candidate_Attributes is empty,
        Return the single node tree Root,
            with label = most common value of Target_Attribute in the examples.
    Otherwise Begin
        A ← The Attribute that best classifies examples (most information gain)
        Decision Tree attribute for Root = A.
        For each possible value, v_i, of A,
            Add a new tree branch below Root, corresponding to the test A = v_i.
            Let Examples(v_i) be the subset of examples that have the value v_i for A
            If Examples(v_i) is empty,
                Below this new branch add a leaf node 
                    with label = most common target value in the examples
            Else 
                Below this new branch add the subtree 
                    ID3 (Examples(v_i), Target_Attribute, Attributes – {A})
    End
    Return Root

[Note: the algorithm above is recursive, i.e., the there is a recursive call to ID3 within the definition of ID3. Covering recursion is beyond the scope of this primer, but there are a number of other resources on using recursion in Python. Familiarity with recursion will be important for understanding both the tree construction and classification functions below.]

In building a decision tree, we will need to split the instances based on the index of the best attribute, i.e., the attribute that offers the highest information gain. We will use separate utility functions to handle these subtasks. To simplify the functions, we will rely exclusively on attribute indexes rather than attribute names.

First, we will define a function, split_instances(instances, attribute_index), to split a set of instances based on any attribute. This function will return a dictionary where each key is a distinct value of the specified attribute_index, and the value of each key is a list representing the subset of instances that have that attribute_index value.

We will use a defaultdict, a specialized dictionary class in the collections module, which automatically creates an appropriate default value for a new key. For example, a defaultdict(int) automatically initializes a new dictionary entry to 0 (zero); a defaultdict(list) automatically initializes a new dictionary entry to the empty list ([]).

In [ ]:
from collections import defaultdict

def split_instances(instances, attribute_index):
    '''Returns a list of dictionaries, splitting a list of instances 
        according to their values of a specified attribute index
    
    The key of each dictionary is a distinct value of attribute_index,
    and the value of each dictionary is a list representing 
       the subset of instances that have that value for the attribute
    '''
    partitions = defaultdict(list)
    for instance in instances:
        partitions[instance[attribute_index]].append(instance)
    return partitions

To test the function, we will partition the clean_instances based on the odor attribute (index position 5) and print out the size (number of instances) in each partition rather than the lists of instances in each partition.

In [ ]:
partitions = split_instances(clean_instances, 5)
print([(partition, len(partitions[partition])) for partition in partitions])

Now that we can split instances based on a particular attribute, we would like to be able to choose the best attribute with which to split the instances, where best is defined as the attribute that provides the greatest information gain if instances were split based on that attribute. We will want to restrict the candidate attributes so that we don't bother trying to split on an attribute that was used higher up in the decision tree (or use the target attribute as a candidate).

Exercise 8: define choose_best_attribute_index()

Define a function, choose_best_attribute_index(instances, candidate_attribute_indexes), that returns the index in the list of candidate_attribute_indexes that provides the highest information gain if instances are split based on that attribute index.

In [ ]:
# your function here

# delete 'simple_ml.' in the function call below to test your function
print('Best attribute index:', 
      simple_ml.choose_best_attribute_index(clean_instances, range(1, len(attribute_names))))

A leaf node in a decision tree represents the most frequently occurring - or majority - class value for that path through the tree. We will need a function that determines the majority value for the class index among a set of instances. One way to do this is to use the Counter class introduced above.

In [ ]:
class_counts = Counter([instance[0] for instance in clean_instances])
print('class_counts: {}\n  most_common(1): {}\n  most_common(1)[0][0]: {}'.format(
    class_counts,  # the Counter object
    class_counts.most_common(1),  # returns a list in which the 1st element is a tuple with the most common value and its count
    class_counts.most_common(1)[0][0]))  # the most common value (1st element in that tuple)

[The following variation does not use a list comprehension]

In [ ]:
class_counts = Counter()  # create an empty counter
for instance in clean_instances:
    class_counts[instance[0]] += 1
    
print ('class_counts: {}\n  most_common(1): {}\n  most_common(1)[0][0]: {}'.format(
    class_counts,
    class_counts.most_common(1), 
    class_counts.most_common(1)[0][0]))

It is often useful to compute the number of unique values and/or the total number of values in a Counter.

The number of unique values is simply the number of dictionary entries.

The total number of values can be computed by taking the sum() of all the counts (the value of each key: value pair ... or key, value tuple, if we use Counter().most_common()).

In [ ]:
print('Number of unique values: {}'.format(len(class_counts)))
print('Total number of values:  {}'.format(sum([v 
                                                for k, v in class_counts.most_common()])))

Before putting all this together to define a decision tree construction function, we will cover a few additional aspects of Python used in that function.

Truth values in Python

Python offers a very flexible mechanism for the testing of truth values: in an if condition, any null object, zero-valued numerical expression or empty container (string, list, dictionary or tuple) is interpreted as False (i.e., not True):

In [ ]:
for x in [False, None, 0, 0.0, "", [], {}, ()]:
    print('"{}" is'.format(x), end=' ')
    if x:
        print(True)
    else:
        print(False)

Sometimes, particularly with function parameters, it is helpful to differentiate None from empty lists and other data structures with a False truth value (one common use case is illustrated in create_decision_tree() below).

In [ ]:
for x in [False, None, 0, 0.0, "", [], {}, ()]:
    print('"{} is None" is'.format(x), end=' ')
    if x is None:
        print(True)
    else:
        print(False)

Conditional expressions (ternary operators)

Python also offers a conditional expression (ternary operator) that allows the functionality of an if/else statement that returns a value to be implemented as an expression. For example, the if/else statement in the code above could be implemented as a conditional expression as follows:

In [ ]:
for x in [False, None, 0, 0.0, "", [], {}, ()]:
    print('"{}" is {}'.format(x, True if x else False)) 

More on optional parameters in Python functions

Python function definitions can specify default parameter values indicating the value those parameters will have if no argument is explicitly provided when the function is called. Arguments can also be passed using keyword parameters indicting which parameter will be assigned a specific argument value (which may or may not correspond to the order in which the parameters are defined).

The Python Tutorial page on default parameters includes the following warning:

Important warning: The default value is evaluated only once. This makes a difference when the default is a mutable object such as a list, dictionary, or instances of most classes.

Thus it is generally better to use the Python null object, None, rather than an empty list ([]), dict ({}) or other mutable data structure when specifying default parameter values for any of those data types.

In [ ]:
def parameter_test(parameter1=None, parameter2=None):
    '''Prints the values of parameter1 and parameter2'''
    print('parameter1: {}; parameter2: {}'.format(parameter1, parameter2))
    
parameter_test()  # no args are required
parameter_test(1)  # if any args are provided, 1st arg gets assigned to parameter1
parameter_test(1, 2)  # 2nd arg gets assigned to parameter2
parameter_test(2)  # remember: if only 1 arg, 1st arg gets assigned to arg1
parameter_test(parameter2=2)  # can use keyword to provide a value only for parameter2
parameter_test(parameter2=2, parameter1=1) #  can use keywords for either arg, in either order

Exercise 9: define majority_value()

Define a function, majority_value(instances, class_index), that returns the most frequently occurring value of class_index in instances. The class_index parameter should be optional, and have a default value of 0 (zero).

Your function definition should support the use of optional arguments as used in the function calls below.

In [ ]:
# your definition of majority_value(instances) here

# delete 'simple_ml.' in the function calls below to test your function

print('Majority value of index {}: {}'.format(
    0, simple_ml.majority_value(clean_instances))) 

# although there is only one class_index for the dataset, 
# we'll test the function by specifying other indexes using optional / keyword arguments
print('Majority value of index {}: {}'.format(
    1, simple_ml.majority_value(clean_instances, 1)))  # using argument order
print('Majority value of index {}: {}'.format(
    2, simple_ml.majority_value(clean_instances, class_index=2)))  # using keyword argument

Building a Simple Decision Tree

The recursive create_decision_tree() function below uses an optional parameter, class_index, which defaults to 0. This is to accommodate other datasets in which the class label is the last element on each line (which would be most easily specified by using a -1 value). Most data files in the UCI Machine Learning Repository have the class labels as either the first element or the last element.

To show how the decision tree is being built, an optional trace parameter, when non-zero, will generate some trace information as the tree is constructed. The indentation level is incremented with each recursive call via the use of the conditional expression (ternary operator), trace + 1 if trace else 0.

In [ ]:
def create_decision_tree(instances, 
                         candidate_attribute_indexes=None, 
                         class_index=0, 
                         default_class=None, 
                         trace=0):
    '''Returns a new decision tree trained on a list of instances.
    
    The tree is constructed by recursively selecting and splitting instances based on 
    the highest information_gain of the candidate_attribute_indexes.
    
    The class label is found in position class_index.
    
    The default_class is the majority value for the current node's parent in the tree.
    A positive (int) trace value will generate trace information 
        with increasing levels of indentation.
    
    Derived from the simplified ID3 algorithm presented in Building Decision Trees in Python 
        by Christopher Roach,
    http://www.onlamp.com/pub/a/python/2006/02/09/ai_decision_trees.html?page=3
    '''
    
    # if no candidate_attribute_indexes are provided, 
    # assume that we will use all but the target_attribute_index
    # Note that None != [], 
    # as an empty candidate_attribute_indexes list is a recursion stopping condition
    if candidate_attribute_indexes is None:
        candidate_attribute_indexes = [i 
                                       for i in range(len(instances[0])) 
                                       if i != class_index]
        # Note: do not use candidate_attribute_indexes.remove(class_index)
        # as this would destructively modify the argument,
        # causing problems during recursive calls
        
    class_labels_and_counts = Counter([instance[class_index] for instance in instances])

    # If the dataset is empty or the candidate attributes list is empty, 
    # return the default value
    if not instances or not candidate_attribute_indexes:
        if trace:
            print('{}Using default class {}'.format('< ' * trace, default_class))
        return default_class
    
    # If all the instances have the same class label, return that class label
    elif len(class_labels_and_counts) == 1:
        class_label = class_labels_and_counts.most_common(1)[0][0]
        if trace:
            print('{}All {} instances have label {}'.format(
                '< ' * trace, len(instances), class_label))
        return class_label
    else:
        default_class = simple_ml.majority_value(instances, class_index)

        # Choose the next best attribute index to best classify the instances
        best_index = simple_ml.choose_best_attribute_index(
            instances, candidate_attribute_indexes, class_index)        
        if trace:
            print('{}Creating tree node for attribute index {}'.format(
                    '> ' * trace, best_index))

        # Create a new decision tree node with the best attribute index 
        # and an empty dictionary object (for now)
        tree = {best_index:{}}

        # Create a new decision tree sub-node (branch) for each of the values 
        # in the best attribute field
        partitions = simple_ml.split_instances(instances, best_index)

        # Remove that attribute from the set of candidates for further splits
        remaining_candidate_attribute_indexes = [i 
                                                 for i in candidate_attribute_indexes 
                                                 if i != best_index]
        for attribute_value in partitions:
            if trace:
                print('{}Creating subtree for value {} ({}, {}, {}, {})'.format(
                    '> ' * trace,
                    attribute_value, 
                    len(partitions[attribute_value]), 
                    len(remaining_candidate_attribute_indexes), 
                    class_index, 
                    default_class))
                
            # Create a subtree for each value of the the best attribute
            subtree = create_decision_tree(
                partitions[attribute_value],
                remaining_candidate_attribute_indexes,
                class_index,
                default_class,
                trace + 1 if trace else 0)

            # Add the new subtree to the empty dictionary object 
            # in the new tree/node we just created
            tree[best_index][attribute_value] = subtree

    return tree

# split instances into separate training and testing sets
training_instances = clean_instances[:-20]
test_instances = clean_instances[-20:]
tree = create_decision_tree(training_instances, trace=1)  # remove trace=1 to turn off tracing
print(tree)

The structure of the tree shown above is rather difficult to discern from the normal printed representation of a dictionary.

The Python pprint module has a number of useful methods for pretty-printing or formatting objects in a more human readable way.

The pprint.pprint(object, stream=None, indent=1, width=80, depth=None) method will print object to a stream (a default value of None will dictate the use of sys.stdout, the same destination as print function output), using indent spaces to differentiate nesting levels, using up to a maximum width columns and up to to a maximum nesting level depth (None indicating no maximum).

In [ ]:
from pprint import pprint

pprint(tree)

Classifying Instances with a Simple Decision Tree

Usually, when we construct a decision tree based on a set of training instances, we do so with the intent of using that tree to classify a set of one or more test instances.

We will define a function, classify(tree, instance, default_class=None), to use a decision tree to classify a single instance, where an optional default_class can be specified as the return value if the instance represents a set of attribute values that don't have a representation in the decision tree.

We will use a design pattern in which we will use a series of if statements, each of which returns a value if the condition is true, rather than a nested series of if, elif and/or else clauses, as it helps constrain the levels of indentation in the function.

In [ ]:
def classify(tree, instance, default_class=None):
    '''Returns a classification label for instance, given a decision tree'''
    if not tree:  # if the node is empty, return the default class
        return default_class
    if not isinstance(tree, dict):  # if the node is a leaf, return its class label
        return tree
    attribute_index = list(tree.keys())[0]  # using list(dict.keys()) for Python 3 compatibility
    attribute_values = list(tree.values())[0]
    instance_attribute_value = instance[attribute_index]
    if instance_attribute_value not in attribute_values:  # this value was not in training data
        return default_class
    # recursively traverse the subtree (branch) associated with instance_attribute_value
    return classify(attribute_values[instance_attribute_value], instance, default_class)

for instance in test_instances:
    predicted_label = classify(tree, instance)
    actual_label = instance[0]
    print('predicted: {}; actual: {}'.format(predicted_label, actual_label))

Evaluating the Accuracy of a Simple Decision Tree

It is often helpful to evaluate the performance of a model using a dataset not used in the training of that model. In the simple example shown above, we used all but the last 20 instances to train a simple decision tree, then classified those last 20 instances using the tree.

The advantage of this training/test split is that visual inspection of the classifications (sometimes called predictions) is relatively straightforward, revealing that all 20 instances were correctly classified.

There are a variety of metrics that can be used to evaluate the performance of a model. Scikit Learn's Model Evaluation library provides an overview and implementation of several possible metrics. For now, we'll simply measure the accuracy of a model, i.e., the percentage of test instances that are correctly classified (true positives and true negatives).

The accuracy of the model above, given the set of 20 test instances, is 100% (20/20).

The function below calculates the classification accuracy of a tree over a set of test_instances (with an optional class_index parameter indicating the position of the class label in each instance).

In [ ]:
def classification_accuracy(tree, test_instances, class_index=0, default_class=None):
    '''Returns the accuracy of classifying test_instances with tree, 
    where the class label is in position class_index'''
    num_correct = 0
    for i in range(len(test_instances)):
        prediction = classify(tree, test_instances[i], default_class)
        actual_value = test_instances[i][class_index]
        if prediction == actual_value:
            num_correct += 1
    return num_correct / len(test_instances)

print(classification_accuracy(tree, test_instances))

In addition to showing the percentage of correctly classified instances, it may be helpful to return the actual counts of correctly and incorrectly classified instances, e.g., if we want to compile a total count of correctly and incorrectly classified instances over a collection of test instances.

In order to do so, we'll use the zip([iterable, ...]) function, which combines 2 or more sequences or iterables; the function returns a list of tuples, where the ith tuple contains the ith element from each of the argument sequences or iterables.

In [ ]:
zip([0, 1, 2], ['a', 'b', 'c'])

We can use list comprehensions, the Counter class and the zip() function to modify classification_accuracy() so that it returns a packed tuple with

  • the percentage of instances correctly classified
  • the number of correctly classified instances
  • the number of incorrectly classified instances

We'll also modify the function to use instances rather than test_instances, as we sometimes want to be able to valuate the accuracy of a model when tested on the training instances used to create it.

In [ ]:
def classification_accuracy(tree, instances, class_index=0, default_class=None):
    '''Returns the accuracy of classifying test_instances with tree, 
    where the class label is in position class_index'''
    predicted_labels = [classify(tree, instance, default_class) 
                        for instance in instances]
    actual_labels = [x[class_index] 
                     for x in instances]
    counts = Counter([x == y 
                      for x, y in zip(predicted_labels, actual_labels)])
    return counts[True] / len(instances), counts[True], counts[False]

print(classification_accuracy(tree, test_instances))

We sometimes want to partition the instances into subsets of equal sizes to measure performance. One metric this partitioning allows us to compute is a learning curve, i.e., assess how well the model performs based on the size of its training set. Another use of these partitions (aka folds) would be to conduct an n-fold cross validation evaluation.

The following function, partition_instances(instances, num_partitions), partitions a set of instances into num_partitions relatively equally sized subsets.

We'll use this as yet another opportunity to demonstrate the power of using list comprehensions, this time, to condense the use of nested for loops.

In [ ]:
def partition_instances(instances, num_partitions):
    '''Returns a list of relatively equally sized disjoint sublists (partitions) 
    of the list of instances'''
    return [[instances[j] 
             for j in range(i, len(instances), num_partitions)]
            for i in range(num_partitions)]

Before testing this function on the 5644 clean_instances from the UCI mushroom dataset, we'll create a small number of simplified instances to verify that the function has the desired behavior.

In [ ]:
instance_length = 3
num_instances = 5

simplified_instances = [[j 
                         for j in range(i, instance_length + i)] 
                        for i in range(num_instances)]

print('Instances:', simplified_instances)
partitions = partition_instances(simplified_instances, 2)
print('Partitions:', partitions)

[The following variations do not use list comprehensions]

In [ ]:
def partition_instances(instances, num_partitions):
    '''Returns a list of relatively equally sized disjoint sublists (partitions) 
    of the list of instances'''
    partitions = []
    for i in range(num_partitions):
        partition = []
        # iterate over instances starting at position i in increments of num_paritions
        for j in range(i, len(instances), num_partitions): 
            partition.append(instances[j])
        partitions.append(partition)
    return partitions

simplified_instances = []
for i in range(num_instances):
    new_instance = []
    for j in range(i, instance_length + i):
        new_instance.append(j)
    simplified_instances.append(new_instance)

print('Instances:', simplified_instances)
partitions = partition_instances(simplified_instances, 2)
print('Partitions:', partitions)

The enumerate(sequence, start=0) function creates an iterator that successively returns the index and value of each element in a sequence, beginning at the start index.

In [ ]:
for i, x in enumerate(['a', 'b', 'c']):
    print(i, x)

We can use enumerate() to facilitate slightly more rigorous testing of our partition_instances function on our simplified_instances.

Note that since we are printing values rather than accumulating values, we will not use nested list comprehensions for this task.

In [ ]:
for i in range(num_instances):
    print('\n# partitions:', i)
    for j, partition in enumerate(partition_instances(simplified_instances, i)):
        print('partition {}: {}'.format(j, partition))

Returning our attention to the UCI mushroom dataset, the following will partition our clean_instances into 10 relatively equally sized disjoint subsets. We will use a list comprehension to print out the length of each partition

In [ ]:
partitions = partition_instances(clean_instances, 10)
print([len(partition) for partition in partitions])

[The following variation does not use a list comprehension]

In [ ]:
for partition in partitions:
    print(len(partition), end=' ')
print()

The following shows the different trees that are constructed based on partition 0 (first 10th) of clean_instances, partitions 0 and 1 (first 2/10ths) of clean_instances and all clean_instances.

In [ ]:
tree0 = create_decision_tree(partitions[0])
print('Tree trained with {} instances:'.format(len(partitions[0])))
pprint(tree0)
print()

tree1 = create_decision_tree(partitions[0] + partitions[1])
print('Tree trained with {} instances:'.format(len(partitions[0] + partitions[1])))
pprint(tree1)
print()

tree = create_decision_tree(clean_instances)
print('Tree trained with {} instances:'.format(len(clean_instances)))
pprint(tree)

The only difference between the first two trees - tree0 and tree1 - is that in the first tree, instances with no odor (attribute index 5 is 'n') and a spore-print-color of white (attribute 20 = 'w') are classified as edible ('e'). With additional training data in the 2nd partition, an additional distinction is made such that instances with no odor, a white spore-print-color and a clustered population (attribute 21 = 'c') are classified as poisonous ('p'), while all other instances with no odor and a white spore-print-color (and any other value for the population attribute) are classified as edible ('e').

Note that there is no difference between tree1 and tree (the tree trained with all instances). This early convergence on an optimal model is uncommon on most datasets (outside the UCI repository).

Learning curves

Now that we can partition our instances into subsets, we can use these subsets to construct different-sized training sets in the process of computing a learning curve.

We will start off with an initial training set consisting only of the first partition, and then progressively extend that training set by adding a new partition during each iteration of computing the learning curve.

The list.extend(L) method enables us to extend list by appending all the items in another list, L, to the end of list.

In [ ]:
x = [1, 2, 3]
x.extend([4, 5])
print(x)

We can now define the function, compute_learning_curve(instances, num_partitions=10), which will take a list of instances, partition it into num_partitions relatively equally sized disjoint partitions, and then iteratively evaluate the accuracy of models trained with an incrementally increasing combination of instances in the first num_partitions - 1 partitions then tested with instances in the last partition, a variant of . That is, a model trained with the first partition will be constructed (and tested), then a model trained with the first 2 partitions will be constructed (and tested), and so on.

The function will return a list of num_partitions - 1 tuples representing the size of the training set and the accuracy of a tree trained with that set (and tested on the num_partitions - 1 set). This will provide some indication of the relative impact of the size of the training set on model performance.

In [ ]:
def compute_learning_curve(instances, num_partitions=10):
    '''Returns a list of training sizes and scores for incrementally increasing partitions.

    The list contains 2-element tuples, each representing a training size and score.
    The i-th training size is the number of instances in partitions 0 through num_partitions - 2.
    The i-th score is the accuracy of a tree trained with instances 
    from partitions 0 through num_partitions - 2
    and tested on instances from num_partitions - 1 (the last partition).'''
    
    partitions = partition_instances(instances, num_partitions)
    test_instances = partitions[-1][:]
    training_instances = []
    accuracy_list = []
    for i in range(0, num_partitions - 1):
        # for each iteration, the training set is composed of partitions 0 through i - 1
        training_instances.extend(partitions[i][:])
        tree = create_decision_tree(training_instances)
        partition_accuracy = classification_accuracy(tree, test_instances)
        accuracy_list.append((len(training_instances), partition_accuracy))
    return accuracy_list

accuracy_list = compute_learning_curve(clean_instances)
print(accuracy_list)

The UCI mushroom dataset is a particularly clean and simple data set, enabling quick convergence on an optimal decision tree for classifying new instances using relatively few training instances.

We can use a larger number of smaller partitions to see a little more variation in accuracy performance.

In [ ]:
accuracy_list = compute_learning_curve(clean_instances, 100)
print(accuracy_list[:10])

Object-Oriented Programming: Defining a Python Class to Encapsulate a Simple Decision Tree

The simple decision tree defined above uses a Python dictionary for its representation. One can imagine using other data structures, and/or extending the decision tree to support confidence estimates, numeric features and other capabilities that are often included in more fully functional implementations. To support future extensibility, and hide the details of the representation from the user, it would be helpful to have a user-defined class for simple decision trees.

Python is an object-oriented programming language, offering simple syntax and semantics for defining classes and instantiating objects of those classes. [It is assumed that the reader is already familiar with the concepts of object-oriented programming]

A Python class starts with the keyword class followed by a class name (identifier), a colon (':'), and then any number of statements, which typically take the form of assignment statements for class or instance variables and/or function definitions for class methods. All statements are indented to reflect their inclusion in the class definition.

The members - methods, class variables and instance variables - of a class are accessed by prepending self. to each reference. Class methods always include self as the first parameter.

All class members in Python are public (accessible outside the class). There is no mechanism for private class members, but identifiers with leading double underscores (__member_identifier) are 'mangled' (translated into _class_name__member_identifier), and thus not directly accessible outside their class, and can be used to approximate private members by Python programmers.

There is also no mechanism for protected identifiers - accessible only within a defining class and its subclasses - in the Python language, and so Python programmers have adopted the convention of using a single underscore (_identifier) at the start of any identifier that is intended to be protected (i.e., not to be accessed outside the class or its subclasses).

Some Python programmers only use the single underscore prefixes and avoid double underscore prefixes due to unintended consequences that can arise when names are mangled. The following warning about single and double underscore prefixes is issued in Code Like a Pythonista:

try to avoid the __private form. I never use it. Trust me. If you use it, you WILL regret it later

We will follow this advice and avoid using the double underscore prefix in user-defined member variables and methods.

Python has a number of pre-defined special method names, all of which are denoted by leading and trailing double underscores. For example, the object.__init__(self[, ...]) method is used to specify instructions that should be executed whenever a new object of a class is instantiated.

Note that other machine learning libraries may use different terminology for some of the functions we defined above. For example, in the sklearn.tree.DecisionTreeClassifier class (and in most sklearn classifier classes), the method for constructing a classifier is named fit() - since it "fits" the data to a model - and the method for classifying instances is named predict() - since it is predicting the class label for an instance.

In keeping with this common terminology, the code below defines a class, SimpleDecisionTree, with a single pseudo-protected member variable _tree, three public methods - fit(), predict() and pprint() - and two pseudo-protected auxilary methods - _create_tree() and _predict() - to augment the fit() and predict() methods, respectively.

The fit() method is identical to the create_decision_tree() function above, with the inclusion of the self parameter (as it is now a class method rather than a function). The predict() method is a similarly modified version of the classify() function, with the added capability to predict the label of either a single instance or a list of instances. The classification_accuracy() method is similar to the function of the same name (with the addition of the self parameter). The pprint() method prints the tree in a human-readable format.

Most comments and the use of the trace parameter have been removed to make the code more compact, but are included in the version found in simple_decision_tree.py.

In [ ]:
class SimpleDecisionTree:

    _tree = {}  # this instance variable becomes accessible to class methods via self._tree

    def __init__(self):
        # this is where we would initialize any parameters to the SimpleDecisionTree
        pass
            
    def fit(self, 
            instances, 
            candidate_attribute_indexes=None,
            target_attribute_index=0,
            default_class=None):
        if not candidate_attribute_indexes:
            candidate_attribute_indexes = [i 
                                           for i in range(len(instances[0]))
                                           if i != target_attribute_index]
        self._tree = self._create_tree(instances,
                                       candidate_attribute_indexes,
                                       target_attribute_index,
                                       default_class)
        
    def _create_tree(self,
                     instances,
                     candidate_attribute_indexes,
                     target_attribute_index=0,
                     default_class=None):
        class_labels_and_counts = Counter([instance[target_attribute_index] 
                                           for instance in instances])
        if not instances or not candidate_attribute_indexes:
            return default_class
        elif len(class_labels_and_counts) == 1:
            class_label = class_labels_and_counts.most_common(1)[0][0]
            return class_label
        else:
            default_class = simple_ml.majority_value(instances, target_attribute_index)
            best_index = simple_ml.choose_best_attribute_index(instances, 
                                                               candidate_attribute_indexes, 
                                                               target_attribute_index)
            tree = {best_index:{}}
            partitions = simple_ml.split_instances(instances, best_index)
            remaining_candidate_attribute_indexes = [i 
                                                     for i in candidate_attribute_indexes 
                                                     if i != best_index]
            for attribute_value in partitions:
                subtree = self._create_tree(
                    partitions[attribute_value],
                    remaining_candidate_attribute_indexes,
                    target_attribute_index,
                    default_class)
                tree[best_index][attribute_value] = subtree
            return tree
    
    def predict(self, instances, default_class=None):
        if not isinstance(instances, list):
            return self._predict(self._tree, instance, default_class)
        else:
            return [self._predict(self._tree, instance, default_class) 
                    for instance in instances]
    
    def _predict(self, tree, instance, default_class=None):
        if not tree:
            return default_class
        if not isinstance(tree, dict):
            return tree
        attribute_index = list(tree.keys())[0]  # using list(dict.keys()) for Py3 compatibiity
        attribute_values = list(tree.values())[0]
        instance_attribute_value = instance[attribute_index]
        if instance_attribute_value not in attribute_values:
            return default_class
        return self._predict(attribute_values[instance_attribute_value],
                             instance,
                             default_class)
    
    def classification_accuracy(self, instances, default_class=None):
        predicted_labels = self.predict(instances, default_class)
        actual_labels = [x[0] for x in instances]
        counts = Counter([x == y for x, y in zip(predicted_labels, actual_labels)])
        return counts[True] / len(instances), counts[True], counts[False]
    
    def pprint(self):
        pprint(self._tree)

The following statements instantiate a SimpleDecisionTree, using all but the last 20 clean_instances, prints out the tree using its pprint() method, and then uses the classify() method to print the classification of the last 20 clean_instances.

In [ ]:
simple_decision_tree = SimpleDecisionTree()
simple_decision_tree.fit(training_instances)
simple_decision_tree.pprint()
print()

predicted_labels = simple_decision_tree.predict(test_instances)
actual_labels = [instance[0] for instance in test_instances]
for predicted_label, actual_label in zip(predicted_labels, actual_labels):
    print('Model: {}; truth: {}'.format(predicted_label, actual_label))
print()
print('Classification accuracy:', simple_decision_tree.classification_accuracy(test_instances))

Notebooks in this primer: