Decision Trees

represent discrete-valued functions

instances described by attribute-value pairs

can represent disjunctive expressions (unlike version spaces)

robust to noisy training data (unlike version spaces)

can handle missing attributes (unlike version spaces)

one of the most popular and practical methods of inductive learning

Idea: ask a series of questions about the attributes of an instance in order to arrive at the correct classification

Some successful applications of decision trees:

classifying patients by disease
classifying equipment malfunctions by cause
classifying loan applications by level of risk
classifying astronomical image data
learning to fly a Cessna by observing pilots on a simulator

data generated by watching three skilled human pilots performing a fixed flight plan 30 times each

new training example created each time pilot took an action by setting a control variable such as thrust or flaps

90,000 training examples in all

each example described by 20 attributes and labelled by the action taken

decision tree learned by Quinlan's C4.5 system and converted to C code

C code inserted into flight simulator's control loop so that it could fly the plane by itself

program learns to fly somewhat better than its teachers

generalization process cleans up mistakes made by humans

Reference: Sammut, C., Hurst, S., Kedzier, D., and Michie, D. (1992). Learning to fly. In Proceedings of the Ninth International Conference on Machine Learning, Aberdeen. Morgan Kaufmann.

Example: Decision tree for the concept PlayTennis

<Outlook=Sunny, Temp=Hot, Humidity=High, Wind=Strong> classified as No

In general, decision trees are disjunctions of conjunctions of constraints on attribute values.

Above tree is equivalent to the following logical expression:

(Outlook=Sunny ^ Humidity=Normal) v (Outlook=Overcast) v (Outlook=Rain ^ Wind=Weak)

Constructing Decision Trees

Idea: Determine which attribute of the examples makes the most difference in their classification and begin there.

Inductive bias: Prefer shallower trees that ask fewer questions.

Example: Boolean function: first bit or second bit on.

Instance Classification

A   000 -

B   001 -

C   010 +

D   011 +

E   100 +

F   101 +

G   110 +

H   111 +

We can think of each bit position as an attribute (with value on or off). Which attribute divides the examples best?

Bits 1 and 2 are equally good starting points, but Bit 3 is less desirable.

Let's start with Bit1. The left branch of this tree is complete -- no more work is necessary. Need to complete right branch.

Consider Bit2 vs. Bit3. Bit2 is better:

This decision tree is equivalent to the following logical expression:

Bit1(on) v (Bit1(off) ^ Bit2(on))

Example: Boolean function: first and last bits equal.

Instance	Classification
A 000	+
B 001	-
C 010	+
D 011	-
E 100	-
F 101	+
G 110	-
H 111	+

Bit1, Bit2, and Bit3 look equally good at the beginning, but it turns out that Bit2 is actually worse in the long run:

These trees are equivalent to the following logical expressions:

(Bit1(on) ^ Bit3(on)) v (Bit1(off) ^ Bit3(off))

(Bit3(on) ^ Bit1(on)) v (Bit3(off) ^ Bit1(off))

(Bit2(on) ^ Bit1(on) ^ Bit3(on)) v (Bit2(on) ^ Bit1(off) ^ Bit3(off)) v (Bit2(off) ^ Bit3(on) ^ Bit1(on)) v (Bit2(off) ^ Bit3(off) ^ Bit1(off))

Constructing a decision tree in this manner is a hill-climbing search through a hypothesis space of decision trees.

Tree with Bit2 attribute at root corresponds to a local (suboptimal) maximum.

How to determine best attribute in general?

Use entropy to compute information gain.

Entropy measures heterogeneity of a set of examples.

Entropy(S) = -p(+) log₂ p(+) - p(-) log₂p(-)

where p(-) and p(+) are the proportion of positive and negative examples in S.

Example:

Entropy([9+, 5-]) = - (9/14) log₂ (9/14) - (5/14) log₂ (5/14) = 0.940

Entropy = 0 implies that all examples in S have the same classification.

Entropy = 1 implies equal numbers of positive and negative examples.

Interpretations of Entropy:

number of bits needed to encode the classification of an arbitrary member of S chosen at random

uncertainty of classification of an arbitrary member of S chosen at random

General case: n possible values for classification

Range of possible values: 0 < Entropy(S) < log₂ n

Information gain measures reduction in entropy caused by partitioning examples according to a particular attribute.

where Values(A) is the set of possible values for attribute A and S_v is the subset of examples with attribute A = v.

The sum term is just the weighted average of the entropies of the partitioned examples (weighted by relative partition size).

Example: Learning the concept of PlayTennis

Values(Wind) = { Weak, Strong }

S: [9+, 5-]

S_Weak: [6+, 2-]

S_Strong: [3+, 3-]

Gain(S, Wind) = Entropy(S) - (8/14) Entropy(S_Weak) - (6/14) Entropy(S_Strong)

= 0.940 - (8/14) 0.811 - (6/14) 1.00

= 0.048

Humidity versus Wind:

Similar computations for Outlook and Temp yield:

Gain(S, Outlook) = 0.246 <--- chosen attribute
Gain(S, Humidity) = 0.151
Gain(S, Wind) = 0.048
Gain(s, Temp) = 0.029

Resulting partially-constructed tree:

Decision Tree Learning Algorithm

DTL(Examples, TargetAttribute, Attributes)

/* Examples are the training examples. TargetAttribute is the attribute whose value is to be predicted by the tree. Attributes is a list of other attributes that may be tested by the learned decision tree. Returns a decision tree that correctly classifies the given Examples. */

create a Root node for the tree

if all Examples are positive, return the single-node tree Root, with label = Yes

if all Examples are negative, return the single-node tree Root, with label = No

if Attributes is empty, return the single-node tree Root, with label = most common value of TargetAttribute in Examples

else begin

A <-- the attribute from Attributes with the highest information gain with respect to Examples
Make A the decision attribute for Root
for each possible value v of A {

add a new tree branch below Root, corresponding to the test A = v
let Examples_v be the subset of Examples that have value v for attribute A
if Examples_v is empty then

add a leaf node below this new branch with label = most common value of TargetAttribute in Examples

else

add the subtree DTL(Examples_v, TargetAttribute, Attributes - { A })

}

end

return Root

DTL performs hill-climbing search using information gain as a heuristic.

Characteristics of DTL

hypothesis space is complete: every finite discrete-valued function can be represented by some decision tree, so target function is always in H

maintains only a single hypothesis -- no backtracking possible

cannot determine how many other trees are consistent with examples

cannot pose new queries that maximize information

uses all training examples at each step to refine hypothesis => much less sensitivity to noise/errors

does not always find the shortest tree (e.g., Bit1=Bit3 function example)

inductive bias: shorter trees preferred; trees with high information gain attributes closer to root preferred

Overfitting

Hypothesis h overfits the data if there exists h' with greater error than h over training examples but less error than h over entire distribution of instances

Serious problem for all inductive learning methods

Trees may grow to include irrelevant attributes (e.g., Date)

Noisy examples may add spurious nodes to tree
Effects of overfitting

Decision Tree Pruning

Decision trees can be pruned to compensate for overfitting.

Reduced-error pruning

1. divide data into training set and validation set

2. build a decision tree using the training set, allowing overfitting to occur

3. for each internal node in tree:

consider effect of removing subtree rooted at node, making it a leaf, and assigning it majority classification for examples at that node
note performance of pruned tree over validation set

4. remove node that most improves accuracy over validation set

5. repeat steps 3 and 4 until further pruning is harmful

Effects of reduced-error pruning

Rule post-pruning

infer decision tree, allowing overfitting to occur
convert tree to a set of rules (one for each path in tree)
prune (generalize) each rule by removing any preconditions (attribute tests) that result in improving its accuracy over the validation set
sort pruned rules by accuracy, and consider them in this order when classifying subsequent instances
example:
```
IF     (Outlook = Sunny) ^ (Humidity = High)
THEN   PlayTennis = No
```
Try removing (Outlook = Sunny) condition or (Humidity = High) condition from the rule and select whichever pruning step leads to the biggest improvement in accuracy on the validation set (or else neither if no improvement results).
advantages of rule representation:
- each path's constraints can be pruned independently of other paths in the tree
- removes distinction between attribute tests that occur near the root and those that occur later in the tree
- converting to rules improves readability

Continuous-Valued Attributes

Dynamically define new discrete-valued attributes that partition the continuous attribute into a discrete set of intervals

Example: Suppose we have training examples with the following Temp values:

Temp 40 48 60 72 80 90

PlayTennis No No Yes Yes Yes No

Split into two intervals: Temp < val and Temp > val

This defines a new boolean attribute Temp_>val

How to choose val threshold?

Consider boundary cases val = (48+60)/2 = 54 and val = (80+90)/2 = 85

Possible attributes: Temp_>54 , Temp_>85

Choose attribute with largest information gain (as before): Temp_>54

This approach can be generalized to multiple thresholds.

Missing Attributes

If a training example in S is missing an attribute value at some node during tree construction, just "guess" an appropriate value by filling in with the most common value among other examples at that node.

If a novel instance is missing an attribute value that needs to be tested, determine possible values for that attribute and ask decision tree for classification assuming each possible value. Then return a final classification based on the individual classifications weighted by the probabilities of each attribute value.

Gain Ratio and Split Information

Variant of Gain heuristic that penalizes attributes that split the data into many small groups (e.g., Date).

GainRatio = Gain / SplitInformation

SplitInformation = Entropy of examples with respect to attribute values.

Reference: Machine Learning, Chapter 3, Tom M. Mitchell, McGraw-Hill, 1997.