Machine Learning Decision Tree Learning [PDF]

Introduction. Decision Tree Learning is a method for approximating discrete-valued functions, using decision trees. Deci

0 downloads 4 Views 337KB Size

Recommend Stories


PDF Machine Learning
Courage doesn't always roar. Sometimes courage is the quiet voice at the end of the day saying, "I will

Machine Learning Theory [PDF]
2.2 Decision stumps. A class that is often used to get a weak learner in boosting are Decision Stumps. We will see below that for this class an ERM rule is efficiently implementable, which additionally makes this class computationally attractive. Dec

Machine Learning - Discriminative Learning
Be who you needed when you were younger. Anonymous

Machine Learning
I want to sing like the birds sing, not worrying about who hears or what they think. Rumi

Machine Learning
Don’t grieve. Anything you lose comes round in another form. Rumi

Machine Learning
Seek knowledge from cradle to the grave. Prophet Muhammad (Peace be upon him)

machine learning
Be grateful for whoever comes, because each has been sent as a guide from beyond. Rumi

Machine Learning
And you? When will you begin that long journey into yourself? Rumi

machine learning
Sorrow prepares you for joy. It violently sweeps everything out of your house, so that new joy can find

Machine learning
Open your mouth only if what you are going to say is more beautiful than the silience. BUDDHA

Idea Transcript


CMPT 882- Machine Learning

Decision Tree Learning Lecture Scribe for week 7 February 20th By: Mona Vajihollahi [email protected]

Overview: Introduction .............................................................................................................2 Decision Tree Hypothesis Space .........................................................................3 Parity Function ........................................................................................................................... 5 Decision Graph .......................................................................................................................... 6 Algebraic Decision Diagram ..................................................................................................... 6

Inductive Bias..........................................................................................................7 Occam’s razor ............................................................................................................................. 7

Avoiding Overfitting the Data.............................................................................7 Training and Validation ............................................................................................................ 9 Reduced Error Pruning.................................................................................................................. 9 Rule Post-Pruning......................................................................................................................... 9

Other Issues in Decision Trees Learning.........................................................10

k-ary Attribute Values: ............................................................................................................. 10 Real Attribute Values ............................................................................................................... 10 Attribute Cost ........................................................................................................................... 10 Missing Values ......................................................................................................................... 10 Other Splitting Criteria ............................................................................................................ 10

References ..............................................................................................................10

Introduction Decision Tree Learning is a method for approximating discrete-valued functions, using decision trees. Decision Trees represent a disjunction of conjunctions of constraints on the attribute values of instances. This disjunction approximates the target function. Decision tree learning is robust to noisy data and is suitable for the target functions with discrete output values. Most of the algorithms for decision learning use a top-down greedy search in the space of possible decision trees to find the best tree which fits the training data. ID3, a basic algorithm in decision tree learning, construct the tree by answering the question “which attribute is the best classifier?” in each step. The answer is the attribute with highest Information Gain, which is the expected reduction in the entropy1 caused by partitioning examples according to the attribute.

Figure 1 The entropy function of a Boolean classifier. The following example (Exercise 3.2, p 77, Machine Learning, Tom Mitchell) illustrates the use of entropy and information gain. Consider the following set of training examples: Instance 1 2 3 4 5 6

1

Classification + + + -

a1 T T T F F F

a2 T T F F T T

The concept of entropy plays an important role in information theory. Roughly speaking, entropy is a mathematical formulation of the uncertainty and/or the amount information in a data set. The (Shannon) entropy of a variable X is defined as:

Figure 1, depicts the form of the entropy function relative to a Boolean classification, as the proportion, p ⊕ , of positive examples varies between 0 and 1. So it is obvious that when the data set contains equal number of positive and negative examples (i.e. p ⊕ = ½) the entropy function has the largest value, that is 1. In the above example there are 3 positive instances and 3 negative instances, so by the above discussion we can conclude that Entropy = 1. Alternatively, we can calculate the entropy using the following equation:

Entropy ( S ) = − p⊕ log 2p⊕ − pθ log 2pθ So:

Entropy ( S ) = − 1 log 2−1 / 2 − 1 log 2−1 / 2 = 1 + 1 = 1 2 2 2 2 Now, we can calculate the Information Gain of a2 relative to this training set, using the following equation:

Gain( s, A) = Entropy ( S ) −

| Sv | Entropy ( S v ) | S | v∈Values ( A )



So:

Gain( S , a 2 ) = Entropy ( S ) − = 1 =0

-

| S a 2 =T | |S | Entropy ( S a 2=T ) − a 2= F Entropy ( S a 2= F ) |S| |S| 2/3 * 1

-

2/3 * 1

Decision Tree Hypothesis Space The hypothesis space in decision tree learning is the set of all possible decision trees. Decision tree learning algorithms search trough this space to find a decision tree that correctly classifies the training data. In the resulted decision tree each Internal node perform a test on attribute x and branch according to the results. Leaf nodes specify class h(x). Thus, the classification will be very easy, using the decision tree and splitting based on some tests. The space of all decision trees is a complete space of finite discrete-valued function. Consequently, every Boolean function can be represented by decision trees and there are usually different decision trees for a single Boolean function. Let us see the use of decision trees in representing Boolean functions in some examples (Ex. 3.1, p 77, Machine Learning, Tom Mitchell). In the following examples we start with testing variable A. However, in actual decision tree building (for learning) this may not be a good idea and we have to choose the variable with the highest information gain for the root.

a ) Α ∧ ¬Β A T

F

B

-

T

+, - are used instead of T, F to distinguish between attribute values and output results.

F

-

+

b) Α ∨ (Β ∧ C ) A T

F

+

B T

F

C

-

T

+

F

-

In the second branch we must test all variables, which is not desirable. Entropy tries to address this kind of problems.

c) Α

XOR Β A

T

-

T

F

B

B F

T

F

+

+

-

In XOR function Entropy does not say anything, so we have to test every variable. Also we can not find any nice dependencies between the variables. Another example of such a function is Parity function.

Parity Function In parity function we can never conclude the result is 1 or 0 until we test all attributes and come to the end. When representing parity function with decision trees we have to split on every variable and keep building the tree all the way down. Learners usually like to decompose the problem and that is why these functions are difficult for Decision Trees and Bayesian Classifiers to be learned. Therefore, Parity function is usually used in testing different learners.

d )( A ∧ B ) ∨ (C ∧ D) Unlike Parity and XOR we have a break here. (We don not need to test all variables)

A F

T

B T

T

F

C

+

T

+

C F

-

D

T

F

T

F

D

-

+

-

F

-

The marked subtrees are identical. To address this problem, that is to avoid replication we could use a different representation, like a graph. We can have a more compact representation by using a Decision Graph.

Decision Graph Decision Graphs, such as the one depicted in Figure 2, are generalizations of decision trees, having decision nodes and leaves. The feature that distinguishes decision graphs from decision trees is that decision graphs may also contain Joins. A Join is represented by two nodes having a common child, and this specifies that two subsets have some common properties, and hence can be considered as one subset. The manner in which objects are categorized by decision graphs is the same as that of decision trees. Each decision tree and decision graph defines a categorization (i.e., some partitioning of the object space into disjoint categories). The set of decision functions representable by graphs is exactly the same as the set representable by trees. However, the set of categorizations which can enter into the definition of a decision function are different. For

(

)

example, the categorizations for A ∧ B ∨ (C ∧ D) given in Figure 2.a and Figure 2.b are different since the decision tree partitions the object space into 7 categories, while the decision graph partitions the object space into 2 categories [1].

b) A Decision Graph for ( A ∧ B ) ∨ (C ∧ D) a) A Decision Tree for ( A ∧ B ) ∨ (C ∧ D) Figure 2 [1]

Algebraic Decision Diagram Another variant of Decision Trees are Algebraic Decision Diagrams, which are used in representing target functions. ADDs are a compact, efficiently manipulable data structure for representing real-valued functions over Boolean variables B n → R . They generalize a treestructured representation by allowing nodes to have multiple parents, leading to the recombination of isomorphic subgraphs and hence to a possible reduction in the representation size [2].

Inductive Bias It was mentioned before that the hypothesis space of ID3 is the power set of instances X. This means that there is no Restrictive Bias for decision tree learning. Restrictive Bias, also called Language Bias, limits the set of available hypothesis and we have no such bias for ID3. On the other hand, there is a Preference Bias, or Search Bias, for decision tree learning. The Preference Bias orders the available hypotheses and in ID3 there is a preference for: 1) Shorter Trees. 2) Trees with high information gain attributes near the root. This bias often goes by the name of Occam ’s razor2.

Occam’s razor Occam’s razor says: “Prefer the simplest hypothesis that fits the data”. In other words the inductive bias for ID3 says that if there is a shorter tree, then it would be taken. So the preference is for shorter trees but it is not always guaranteed. In fact, ID3 is a greedy algorithm so it can not always find the shortest tree. Another question that arises regarding Occam’s razor is that “What is a short tree? A tree with fewer nodes or a shallow tree?” Actually, the general idea behind Occam’s razor is that we want to avoid splitting as much as possible. Therefore, shallower trees would be better for us. There are a number of arguments in favor and against Occam’s razor. There have been many researches on this topic, but we can not clearly answer this question yet: “What is bias really?”

Avoiding Overfitting the Data As discussed before in ID3 the tree is built just deeply enough to perfectly classify the training data. This strategy can produce trees that overfit the training examples. Consider the following Concept C: X1 X2 X3 X4 X5

1 1 0 0 0

Now suppose that the training set contains variables X3, X4 and X5. Then obviously the learner would classify X1 and X2 as 0, i.e. H(X1) = 0 and H(X2) = 0, which is not correct. Here the learner has overfitted the data. Given a hypothesis space H and hypothesis h ∈ H, let

errortrain (h) be the error of the

hypothesis h over the training examples and errorD (h) be the error of the hypothesis over the entire distribution D of data: h is said to overfit the training data, if there exists some other hypothesis h’ ∈ H, such that

errortrain ( h) < errortrain ( h' )

2

and

errorD (h) > errorD (h' ) .

William of Occam was one of the first philosophers who discussed the question of preferring short hypotheses, around the year 1320.

Overfitting usually happens where there are random errors or noise in the training examples, or there is small number of examples for training. In the latter case, we usually have coincidental regularities. This happens when some attribute partitions the training example very well but is irrelevant to the real target function. To see the effect of an error in the training data on building the relevant decision tree, consider the following example. Suppose in the entire distribution of the data we have:

If

then

SUNNY = T

PLAYTENNIS = T

But in the training example there is an instance with the following values of attributes: X1 T

X2 F

… …

SUNNY T

PLAYTENNIS F

So in reality the tree should not split any more in the branch of SUNNY = T (Figure 3.a), but having the above instance in the training example the learned tree has to split on every attribute after Sunny in the branch of SUNNY = T (Figure 3.b). While the resulted tree classifies the training examples well, it fails in classifying the real Data. Obviously, we have an overfitting here.

SUNNY

SUNNY

T

T X1

+

F X1 ... ...

a) The decision tree for real target function

b) The learned decision tree

Figure 3 In general, it is very hard to avoid overfitting because usually all we have is the training data. One way is to go for some statistics. We can do a statistical analysis and use the gathered information to answer the question: “How much representative the sample is?” Then, if the sample is not representative enough, we would have overfitting. For example, in this way statisticians may calculate the error for a given concept and then give a probability measure for all concepts. However, if we have some assumption about the real target function, it would be much easier to find out when overfitting occurs. It is worth to mention that there is a close relation between inductive bias and overfitting and having a good inductive bias can help us to avoid overfitting.

Training and Validation Among the different approaches that have been proposed to avoid overfitting the most common one is called training and validation set approach. In this approach the available training date is separated into two sets: Training Set and the Validation Set. Training set is then used to learn the hypothesis and the validation set is used to evaluate the accuracy of the resulted hypothesis. In this way, we provide a check against overfitting and will use the validation set to prune the tree whenever it overfits the data. The size of validation set has also a significant importance, because it should be large enough to provide a good sample of the instances. Although there are statistical formulas for computing the required size of the validation set, one common heuristic is to use 1/3 of the available examples for the validation set and use the remaining 2/3 for the training. There are different approaches for pruning a tree using the validation set. Two of those approaches are: 1) Reduced Error Pruning 2) Rule Post Pruning

Reduced Error Pruning In this approach, the data is split into the training and validation sets and they are used in pruning. Pruning a decision node is removing its subtree and making it a leaf node, then assigning it the most common classification affiliated with that node. In Reduced Error Pruning each node is considered for pruning. The pruning algorithm is as follows: Do until further pruning is harmful: 1. Evaluate the impact on validation set of pruning each possible node (plus those below) 2. Greedily remove the node that most improves the accuracy on the validation set In this way, each pruning is tested over the validation set and the smallest version of the accurate subtree is produced. To address the problem of limited data, other techniques have been proposed which have the same idea behind. Those techniques usually partition the available data several times in different ways and then average the results.

Rule Post-Pruning Another approach for pruning the learned tree is Rule Post-Pruning. In this approach, we convert the decision tree to the equivalent set of logical rules and then prune each rule. Obviously pruning in the Logical Framework is equivalent to Generalizing the rules. These generalized rules are then sorted according to their accuracy and used in the same order. Here the validation set is used to estimate the accuracy of the resulted rules.

Other Issues in Decision Trees Learning There are still more topics that have been addressed by decision trees. Some of these topics are:

k-ary Attribute Values: Problem: When an attribute has many values, its will be selected because of the high information gain. Solution: Using gain ratio.

Real Attribute Values Problem: The attributes tested in the decision nodes must be discrete valued. Solution: Defining new discrete valued attributes and partitioning the real attribute value into discrete intervals.

Attribute Cost Problem: Sometimes different attributes have different costs. How can we find the tree with the lowest cost? Solution: Including the cost of each attribute in the Gain function.

Missing Values Problem: In some examples the value of an attribute is missed. Solution: There are different strategies, using training examples and sorting through the tree (e.g. If n test A, which has a missing value, assign the most common value among training examples at node n).

Other Splitting Criteria We can use different measures for selecting attributes. For example we can measure not only the entropy but also the size of a set to select an attribute.

References [1] Jonothan J. Oliver; Decision Graphs: An Extension of Decision Trees, 1992 [2] Robert St-Aubin, Jesse Hoey and Craig Boutilier; APRICODD: Approximate Policy Construction using Decision Diagrams [3] Tom M. Mitchell; Machine Learning, 1997 [4] Entropy on the World Wide Web, http://www.math.psu.edu/gunesch/entropy.html

Smile Life

When life gives you a hundred reasons to cry, show life that you have a thousand reasons to smile

Get in touch

© Copyright 2015 - 2024 PDFFOX.COM - All rights reserved.