- 加载中...

**更新时间**: 2018-03-02**来源**: 原创或网络**浏览数**: 29次**字数**: 22749- 发表评论

Trees are great. They provide food, air, shade, and all the other
good stuff we enjoy in life. Decision trees, however, are even cooler.
True to their name, decision trees allow us to figure out *what* to do with all the great data we have in life.

Like it or not, you have been working with decision trees your entire life. When you say, “If it’s raining, I will bring an umbrella,” you’ve just constructed a simple decision tree.

It’s a pretty small tree, and doesn’t account for all situations. Likewise, this simplistic decision making process won’t work very well in the real world. What if it’s windy? “If it’s raining and isn’t too windy,” you’ll say. “I will bring an umbrella. Otherwise, I will bring a rain jacket.”

Better. We’ve added some *leaves* that represent the different choices we can make, and the *branches* of the tree represent “yes” and “no”. Although the decision tree
represented above resembles an upside-down tree, it’s starting to look
large enough to handle more situations. But what if the wind is
extremely strong, like a hurricane? Rain jackets won’t do much good, as
this guy found out (don’t try this at home):

Probably best to stay inside then.

Our tree’s starting to grow bigger, but what if it’s snowing? Or hailing? Our decision tree will need to grow a lot more in order to flourish in our complicated, rainy planet.

Easy to understand? Good, because simplicity is one of the biggest
advantages of decision trees. Decision trees are very interpretable and
understandable—they allow people to see *exactly* how the
computer arrived at its current conclusion. To make a prediction on a
new observation, we first find out which region that observation belongs
to and then return either the mean of the data in that region if we’re
predicting a number, or the most common labeled value in that region if
we’re classifying things.

But there are some major disadvantages of decision trees. Although they are intuitive and interpretable, decision trees by themselves do not have the same level of predictive accuracy as some other popular machine learning algorithms. This is because classification and regression decision trees tend to make overly complicated decision boundaries, resulting in increased model variance that leads to overfitting. This in turn leads to more erratic predictions and increased error on unfamiliar data. In practice, however, we can use pruning algorithms to reduce the depth and complexity of the tree by removing nodes (i.e. questions) larger than a certain depth so that the decision tree does not overfit to the training data.

Let’s say we wanted to classify different species of iris flowers (Iris Setosa, Iris Versicolour, Iris Virginica) given 4 quantities: the width and length of the flower’s sepal (the leaves that used to make up the flower bud), as well as the width and length of the flower’s petals. What kinds of questions should we ask in order to construct a good decision tree?

Say we pick a particular variable, such as flower petal length, and a corresponding value of that particular variable, such as 2.45 cm to reduce the number of possible species that some particular iris could be. Now we can divide all the training data into two groups: data points for which flower petal length is smaller than or equal to 2.45 cm, and the set of all data points for which flower petal length is greater than 2.45 cm. By choosing more variables and values, we can divide the data into different categories, then find out what species each category corresponds to.

How a decision tree partitions the data into different classes. The decision boundaries create regions that can be associated with classes. Non-contiguous regions can also share the same class. Notice how the decision boundaries are always exactly horizontal or vertical; decision trees are great for creating "boxy" decision boundaries.

By varying the height (i.e. number of levels) of the decision tree, we can control the number of regions we split the data into. As always, trees that are too short can underfit, and trees that are too tall can overfit.

Regression and classification both function similarly for decision trees in that we choose values for the variables to partition up the data points. However, instead of assigning a class to a particular region like in classification, regression decision trees return the average of all the data points in that region. Why an average? Because it minimizes the error of the decision tree’s predictions.

There’s also a variation where the decision tree fits a regression line to the data points of each region, creating a jagged piecewise line. However, trees constructed this way are more prone to overfitting, especially in regions with fewer data points, because noise is weighted more than it should be.

How do we tell if the variable and value that we’ve chosen is a good one? It helps to think of the decision tree as an “organizer”. If we were to organize a hundred blue and red socks into several drawers, would it be better if each drawer had a mix of socks of all colors or if each drawer only had socks of one color? Contrary to the wishes of lazy and disorganized children, drawers with socks of one color are more organized and easier to navigate.

Similarly, we’d like our decision tree to organize data points so
that it separates data points (i.e., socks) into regions (i.e., drawers)
that are as “pure” as possible. This means that as we are building the
decision tree, we *always* choose the split that maximizes the
amount of information we can conclude. More concretely, we choose a
value such that each region is largely made up of data points from one
category. By figuring out how “pure” each region is, we can figure out
if our chosen value is a good one.

There’s a story of a fair where visitors would guess the weight of a
bull, and whoever got the actual weight won a prize. The organizer
noticed that although none of the individual guesses were exactly
correct, the *average* of the guesses was surprisingly accurate. Why is this? Intuitively we can think of *who*’s
doing the guessing. Some people are experts on certain things. A farmer
is more likely to know the weight of the bull than someone visiting
from a city, so he can make a reasonably accurate guess. City dwellers
are more likely to be far off the mark (high variance). However, their
lack of knowledge produces an interesting behavior: they are just as
likely to overestimate the bull’s weight than underestimate. Thus the
combined sum of their estimates yields a better estimate than any
individual city dweller’s guess.

Although the cow story concerns regression, we can apply the same concept of aggregating a bunch of guesses to classification as well. If we have a bunch of classifiers, such as decision trees, we can aggregate them to create a much better classifier. Classifiers are simply models of any kind (decision trees, SVMs, neural networks, etc.) that can classify data better than random guessing. That means even though decision trees, like city dwellers, are individually not very accurate classifiers, combining their predictions yields a much more accurate result with lower variance, because it’s less likely for the majority of decision trees to guess incorrectly.

However, having multiple classifiers won’t be useful if they are all identical. So we must train them in ways so that they each specialize in some aspect of the problem. Two examples of this are bagging and boosting, which are covered in the next section.

In bagging, we throw our training data into a proverbial “bag” and
repeatedly sample (take individual data points) from that bag, *putting the data back into the bag every time*.
Then we use those data points to train our classifier. This is called
sampling with replacement, or bootstrapping. By repeating this process
with many classifiers, we can create multiple classifiers, all slightly
different from one another.

Why put the data back into the bag again? Remember that we don’t want to train our classifiers on the exact same type of data. But if we don’t have a lot of data, the amount of data we can use to train each classifier becomes quite limited. Replacing the data point every time we sample allows us to maintain the distribution of our data, and reduces the amount that our final classifiers’ predictions vary from input to input due to the fact that each training subset is statistically representative of the full dataset.

When actually using the ensemble to classify data, we have every classifier make a decision. The ultimate decision of the ensemble is decided by majority vote among the classifiers (in the case of classification) or by an average of all the classifiers’ individual predictions (in the case of regression).

Why does bagging work?

different decision tree models (Z1,Z2,Z3,...ZN), which each have variance σ2
Why does bagging reduce variance of the final ensemble model’s predictions? From basic probability theory, we can represent the decision trees as a set of i.i.d (independent and identically distributed) predictions from N

. These N different decision tree models are trained on N different samples of the original data generated via a statistical technique called Bootstrapping, where we sample observations with replacement from the original datasets to create several different datasets.

But if we combine the N

predictions into one prediction by averaging them, the variance of the combined prediction will be σ2N. Hence, averaging out the predictions of multiple classifiers will drastically reduce the variance of our bagging ensemble classifier which combines these N
We can train many decision trees and use them as classifiers, creating a *random forest*.
However, there remains a major problem with bagging decision trees:
What happens if the individual decision trees are too similar? In other
words, what if they all ask the same questions? Then we lose the benefit
of having multiple decision trees, because they will all behave exactly
the same; if one decision tree misclassifies something, chances are
that the other decision trees will also make the same mistake. Can we
modify our original approach to bagging so that we can create even more
diverse sets of decision trees?

Let us describe the random forests approach more formally. Let’s say we are trying to predict home prices with 9 variables, such as size, location, square footage, proximity to schools, neighborhood characteristics, and the number of bedrooms. We can take our original data and generate several samples from the original dataset using bootstrapping. We now train N decision trees on each of the N different samples (same as before), but with one major caveat. In the process of training these decision trees, whenever we make a new split in a tree, we select amongst a randomly chosen subset (of size m

) of the p variables as possible candidate variables to split upon. At each new split point, we randomly choose m<p variables to be in our subset of candidates, and consider only those m variables. So, for our first decision tree’s first split point, let’s only consider location, square footage, and number of bedrooms (m=3) while ignoring the remaining 6 features (p=9). Typically, mis chosen on the magnitude of the square root of p in order for the random forests procedure to produce reasonably decorrelated decision trees. Once we have N different decision tree models, we can average all of their predictions like we did for bagging to get our ensemble model’s final prediction.

Imagine a bunch of people are sitting around a table making an important business decision for their company, such as deciding how much to buy a startup for. Everyone wants the company to do something slightly different, so how do you, the leader, ultimately decide what to do? One option is to evaluate use past experiences to decide how much you want to trust each person’s decisions. If someone historically has made bad decisions, it would make sense to trust them less than someone who has never led the company astray. For example, an unreliable person suggests to buy the startup for 20 million but an experienced veteran values it at 50 million, and you privately think it’s worth around 30 million. You most likely trust yourself, and you trust the veteran more than the unreliable person, so it makes sense to ultimately set the valuation of the company somewhere between your personal valuation (30 million) and the veteran’s valuation (50 million).

Adaboost, a well-used boosting algorithm, functions on a similar
principle. Given a bunch of regression or classification models, it
judges the credibility of each model by testing it on a test set. If a
particular model has high accuracy, we will trust it more. Conversely,
if a particular model has low accuracy, we will trust it less. In this
case, by “trust” we mean how much influence, or *weight*, the
model’s decision has over the overall result. So instead of taking a
simple average of all the models’ outputs, we take the *weighted sum* of their output, where the weights are directly proportional to the
accuracy of each model. In fact, you can think of the average as a
simple weighted sum where all models are weighted equally.

In the case of regression, the output of the ensemble model will be the raw weighted sum. However, in the case of classification, the output will be based heavily on what the output should be. For example, if the output is “yes” or “no”, then we could arbitrarily make values > 0.5 mean “yes” and values ≤ 0.5 mean “no”. If there are more classes, we could switch to using vectors to represent the different choices, such as <1, 0, 0> to mean class 0, <0, 1, 0> to mean 1, and <0, 0, 1> to mean 2. Then by taking the weighted sum of all the outputs we get a vector such as <0.24, 0.8, 0.7>. Since 0.8 is the largest number, and it’s position indicates its class is 1, then the overall output of the ensemble model is class 1. This is called one-hot encoding.

We can also use another method for boosting regression models. Let’s use the previous scenario of predicting the price of a house.

First, we train a decision tree on all the inputs in the training data, which can be treated as a function f1(x)

. This decision tree is bound to have predictions that are slightly off from the actual values, so we figure out the error for each prediction. For example, if our decision tree guessed 100k for the price of a house when the actual price was 75k, then our error is

error=actual value−predicted value=75k−100k=−25k

.

This tells us that if we had subtracted 25k (or equivalently added -25k), then we would have obtained the correct solution. More generally,

actual value=predicted value+errorHow do we improve our prediction? One way is to create a second decision tree f2(x)

that predicts the *error* of the first tree. Because as observed earlier, if we can add the error
to the predicted value of the first tree, we will obtain the actual
value!

In other words, y≈f1(x)+f2(x)

Nevertheless, our second tree will still have errors in its prediction. As a result, we train a third decision tree to predict the errors of the second, and so on until we have trained some prespecified number of decision trees. Then, we simply add the predictions of all the decision trees to obtain a more accurate result.

However, putting so much effort into correcting errors tends to leave the ensemble, or forest, of decision trees prone to overfitting, because as you build more and more decision trees to correct errors made by the previous decision trees, even small errors and noise in the data will be “predicted” and corrected, resulting in an ensemble that is very accurate on training data, but not as accurate on testing data.

Before we try applying novel forms of ensemble learning to decision tree, let’s understand the basic strategies that both bagging and boosting utilize to create a diverse set of classifiers. In bagging, we create multiple copies of the original training data set using bootstrapping, fit several decision trees to each of the different copies, and take the average of all the predictions of all the trees to make our final, overall prediction in the final ensemble model. On the other hand, in boosting, we iteratively train a decision tree on the error of previous decision trees, slowly generating our final ensemble model which is a linear combination and summation of our individual decision tree models.

Before neural networks became popular, decision trees were the state of the art for Machine Learning. Although current models based on neural networks often outperform decision trees and random forests, there is much to gain by utilizing the techniques for ensemble models outlined in this post. With ensemble models, you can leverage the power of multiple models, including decision trees and neural networks, to compensate for the individual irregularities or weaknesses of each model.

我来评分 :6

```
```转载注明：转自5lulu技术库

本站遵循：署名-非商业性使用-禁止演绎 3.0 共享协议

- mysql insert_id 问题？
发布时间：2011-02-21

- android 实现可伸缩的Listview控件
发布时间：2010-10-25

- 写一个计算1-2+3-4+5-6+7……+n的值(n很大)的函数
发布时间：2010-11-07

- 请教一个多对一的算法
发布时间：2011-01-11

- 【讨论】在PHP中对数组或对象进行序列化/反序列化有哪些好的方案？
发布时间：2010-10-30

- 关于网站内容实现延迟加载的问题？
发布时间：2011-01-19

- 如何将一个acitvity同时设置多个theme
发布时间：2011-01-18

- 如何让Ajax刷新仍显示当前加载的内容？
发布时间：2011-01-12

- LINUX下，MYSQL数据库文件是否会完全缓存进内存？
发布时间：2010-10-30

- 新浪sae如何运行cron
发布时间：2011-03-02

- MYSQL几种存储引擎模式选择
发布时间：2011-01-02

- 电路交换网络、数据报网络和虚电路网络的区别?
发布时间：2015-12-26

露肩紧身裙子修身显瘦包臀裙短裙雪纺连衣裙中长款时尚拼接性感气质修身短袖连衣裙翻边紧身牛仔裤 九分裤针织衫开衫毛衣小外套长袖t恤上衣 宽松女士衣服打底小衫立领气质长袖宽松百搭加绒打底衫白衬衣一字领露肩白色蕾丝连衣裙女性感紧身包臀显瘦雪纺连衣裙露背性感无袖系带针织短裙两件套中长款连衣裙百褶裙显瘦职业女装ol套装时尚紧身裙子海边度假长裙波西米亚雪纺 沙滩裙时尚透视上衣+PU皮裙修身两件套半高领性感开叉修身包臀连衣裙长袖连衣裙 吊带裙两件套学院风蝴蝶结荷叶边针织连衣裙斜肩雪纺超短连衣裙加绒加厚蕾丝打底衫 中长款长袖T恤时尚系带蝴蝶结套头打底针织毛衣甜美少女拼色毛呢外套斗篷毛呢外套 森女系修身显瘦下摆开叉包臀裙针织毛衣半身裙套装修身侧开叉半身裙性感睡裙 真丝打底裙秋冬季修身露肩针织打底裙秋冬季中长款打底毛衣秋冬季韩版女装流苏修身打底裙秋季紧身包臀短裙吊带烫金绷带 抹胸紧身包臀短裙时尚打底裙蕾丝连衣裙女士丝袜连裤袜