- 加载中...

**更新时间**: 2018-03-02**来源**: 原创或网络**浏览数**: 19次**字数**: 18184- 发表评论

Here’s a riddle:

So what does this have to do with machine learning? Well, it turns
out that machine learning algorithms are not that much different from
our friend Doge: *they often run the risk of over-extrapolating or over-interpolating from the data that they are trained on.*

There is a very delicate balancing act when machine learning
algorithms try to predict things. On the one hand, we want our algorithm
to model the training data very closely, otherwise we’ll miss relevant
features and interesting trends. However, on the other hand we don’t
want our model to fit *too* closely, and risk over-interpreting every outlier and irregularity.

The Fukushima power plant disaster is a devastating example of overfitting. When designing the power
plant, engineers had to determine how often earthquakes would occur.
They used a well-known law called the Gutenberg-Richter Law, which gives
a way of predicting the probability of a very strong earthquake from
the frequency that very weak earthquakes occur. This is useful because
weak earthquakes–ones that are so weak that you can’t even feel
them–happen almost all the time and are recorded by geologists, so the
engineers had quite a large dataset to work with. Perhaps the most
important result of this law is that the relationship between the
magnitude of an earthquake and the logarithm of the probability that it
happens is *linear*.

The engineers of the nuclear power plant used earthquake data from the past 400 years to train a regression model. Their prediction looked something like this:

Source: Brian Stacey, Fukushima: The Failure of Predictive Models

The diamonds represent actual data while the thin line shows the engineers’ regression. Notice how their model hugs the data points very closely. In fact, their model makes a kink at around a magnitude of 7.3 – decidedly not linear.

In machine learning jargon, we call this *overfitting*. As the
name implies, overfitting is when we train a predictive model that
“hugs” the training data too closely. In this case, the engineers knew
the relationship should have been a straight line but they used a more
complex model than they needed to.

If the engineers had used the correct linear model, their results would have looked something like this:

Notice there’s no kink this time, so the line isn’t as steeply sloped on the right.

The difference between these two models? The overfitted model predicted one earthquake of at least magnitude 9 about every 13000 years while the correct model predicted one earthquake of at least magnitude 9 just about every 300 years. And because of this, the Fukushima Nuclear Power Plant was built only to withstand an earthquake of magnitude 8.6. The 2011 earthquake that devastated the plant was of magnitude 9 (about 2.5 times stronger than a magnitude 8.6 earthquake).

There is actually a dual problem to overfitting, which is called
underfitting. In our attempt to reduce overfitting, we might actually
begin to head to the other extreme and our model can start to *ignore* important features of our data set. This happens when we choose a model
that is not complex enough to capture these important features, such as
using a linear model when a quadratic is necessary.

While the linear model (the blue line) follows the data (the purple X's), it misses the underlying curving trend of the data. A quadratic model would have been better.

So we want to avoid overfitting because it gives too much predictive power to quirks in our training data. But in our attempt to reduce overfitting we can also begin to underfit, ignoring important features in our training data. So how do we balance the two?

In the field of machine learning this incredibly important problem is known as the bias-variance dilemma. It’s entirely possible to have state-of-the-art algorithms, the fastest computers, and the most recent GPUs, but if your model overfits or underfits to the training data, its predictive powers are going to be terrible no matter how much money or technology you throw at it.

The name bias-variance dilemma comes from two terms in statistics: **bias**, which corresponds to underfitting, and **variance**, which corresponds to overfitting.

Our example of underfitting from above. The blue line is our model, and the purple X's are the data that we are trying to predict.

The drawing above depicts an example of high **bias**. In other words, the model is *underfitting*.
The data points obviously follow some sort of curve, but our predictor
isn’t complex enough to capture that information. Our model is *biased* in that it assumes that the data will behave in a certain fashion
(linear, quadratic, etc.) even though that assumption may not be true. A
key point is that there’s nothing wrong with our training—this is the *best* possible fit that a linear model can achieve. There is, however, something wrong with the *model* itself in that it’s not complex enough to model our data.

Approximating data using a very complex model. Notice how the line tends to over-interpolate between points. Even though there is a general upward trend in the data, the model predicts huge oscillations.

In this drawing, we see an example of a model with very high **variance**. In other words, a model that is *overfitting*.
Again, the data points suggest a sort of graceful curve. However, our
model uses a very complex curve to get as close to every data point as
possible. Consequently, a model with high variance has very low bias
because it makes little to no assumption about the data. In fact, it
adapts *too* much to the data.

Again, there’s nothing wrong with our training. In fact, our predictor hits *every* single data point and is by most metrics perfect. It is actually our *model itself* that’s the problem. It wants to account for every single piece of data
perfectly and thus over-generalizes. A model that over-generalizes has
high variance because it varies too much based on insignificant details
about the data.

An example of high “variance” in everyday life from xkcd. ‘Girls sucking at math’ in this case is an overgeneralization based on only a single data point.

So, why is there a trade-off between bias and variance anyways? How come we can’t have the best of both worlds and have a model that has both low bias and low variance? It turns out that bias and variance are actually side effects of one factor: the complexity of our model.

For the case of high bias, we have a very simple model. In our example above we used a linear model, possibly the most simple model there is. And for the case of high variance, the model we used was super complex (think squiggly).

The image above should make things a bit more clear as to why the bias-variance trade-off exists. Whenever we choose a model with low complexity (like a linear model) and thus low variance, we’re also choosing a model with high bias. If we try to increase the complexity of our model, such as a quadratic model, we sacrifice low variance in exchange for low bias at the cost of high variance. The best we can do is try to settle somewhere in the middle of the spectrum, where the purple pointer is.

*In this visualization we fit a model (the blue line) to some
noisy sine wave data (the red crosses). By moving the slider you can
manually adjust the complexity of the model and watch as it overfits and
underfits.*

*The bottom graph shows the training error and the test error in
red and blue respectively. Notice how training error always decreases
with increasing complexity, but test error will hit a minimum at some
point, and then increase after that. You can always tell whether you are
underfitting or overfitting by looking at the training and test errors.*

The top image shows models of different complexities fitting to the data. The bottom image shows the the error of the model against a training set (red) and a test set (blue).

We are training a locally-weighted linear regression model (LOWESS) on a sine wave with added gaussian noise. The smoothing parameter is used as a proxy for complexity, and the error is the average L1 distance. Credits to agramfort for the LOWESS implementation. You can find the code that generated this visualization here.

Low complexity will result in poor accuracy (and thus high error) for both training and test data. You can see this on the left of the bottom graph where both the red line (training error) and blue line (test error) are very high. This is because the model inherently lacks enough complexity to describe the data at all.

On the other hand, high complexity models will result in a low
training error and a high test error. You can see this happening on the
right side of the bottom graph, where the red data points decrease but
the blue data points increase. This is because a complex model will be
able to model the training data a bit *too* well, and thus can’t generalize to the test data.

Notice that the best complexity lies where the test error reaches a minimum, that is, somewhere in between a very simple and a very complex model.

Here’s another (more mathematical) insight into what bias and variance is all about. In machine learning and data science, we often have a function that we want to model, whether it be a function that associates square footage to house prices or magnitudes of an earthquake to the frequency at which they occur. We assume that there is some perfect, god-given, ideal function that models exactly what we want. But the world isn’t perfect, so when we get real data to train our model it inevitably has noise – random fluctuations from things such as human error and measurement uncertainty.

In math terms, we say that y=f(x)+ϵ

. Where y is the training data we end up measuring, f(x) is that perfect function, and ϵis the random error that we can’t avoid.

Our “perfect” function added to noise is what we end up measuring in the real world.

This noise manifests as disorganized data points. What we want to do is recover the perfect function (in green) from these data points.

For the more mathematically inclined, this sets the stage for us to derive the *Bias-Variance Decomposition*. That is, we will prove that the error of a model is composed of three terms: a bias term, a variance term, and an unavoidable *irreducible error* term. Two things though. Firstly, knowing a bit of probability theory
will probably make this proof a bit easier to swallow (although we’ll
certainly try to explain everything). And secondly, this proof is
straight from Wikipedia (there’s only so many ways to prove things!) but we hope we have a much more elucidating version of the proof here.

So how should we detect overfitting and underfitting and what should we do if we *do* detect it?

As we’ve seen above, overfitting and underfitting have very clear signatures in training and test data. Overfitting results in low training error and high test error, while underfitting results in high errors in both the training and test set.

However, measuring training and test errors is hard when we have
relatively few data points and our algorithms require a fair amount of
data (which unfortunately happens quite often). In this case we can use a
technique called **cross-validation**.

This is where we take our entire dataset and split it into k

groups. For each of the k groups, we train on the remaining k−1 groups andtimes on it.

Cross validation with k=3. We divide the entire training set into three parts, and then train our model three times using each of the three parts as the validation set with the remaining two parts making up our actual training set.

As for what to do after you detect a problem? Well, having high bias is symptomatic of a model that is not complex enough. In that case, your best bet would be to just pick a more complex model.

The problem of high variance is a bit more interesting. One naive approach to reduce high variance is to use more data. Theoretically, with a complex enough model, as the number of samples tends toward infinity the variance tends toward zero. However, this approach is naive because the rate at which the variance decreases is typically fairly slow, and (the larger problem) data is almost always very hard to come across. Unless you’re working at Google Brain (and even then), getting more data takes time, energy, and money.

A better approach to reducing variance is to use regularization. That is, in addition to rewarding your model as it models the training data well, penalize it for growing too complex. Essentially, regularization injects “bias” into the model by telling it not to become too complex. Common regularization techniques include lasso or ridge regression, dropout for neural networks, and soft margin SVMs.

Finally, **ensemble learning** (which we’ll be talking
about next time!) offers a way to reduce the variance of a model without
sacrificing bias. The idea is to have an *ensemble* of multiple
classifiers (typically decision trees) trained on random subsets of the
training data. To actually classify a data point, the ensemble of
classifiers all “vote” on a classification.

As you may be starting to see, machine learning is just as much an art as it is a science. Choosing a model and adjusting each of its parameters to reduce overfitting and underfitting is a task that requires skill and, more importantly, experience. And while there is probably a ‘sweet’ spot out there, us mere mortals will have to rely on our intuition and our cunning to find the best model for the job.

我来评分 :6

```
```转载注明：转自5lulu技术库

本站遵循：署名-非商业性使用-禁止演绎 3.0 共享协议

- 手机号防被抓取的方法（除了生成图片方式）
发布时间：2011-01-31

- 怎样判断文件正在被写入
发布时间：2011-02-16

- php 如何获取请求的xml数据
发布时间：2011-03-17

- 用C或C++怎样实现大数的大数次幂及求模运算？
发布时间：2010-11-03

- 求两个数高位相同的部分
发布时间：2011-02-07

- 大家一般都在什么情况下自己开发PHP扩展？
发布时间：2010-11-07

- jquery mobile 怎么实现图片拖拽效果
发布时间：2011-02-12

- csdn密码被破解，密码存储好的算法？
发布时间：2010-12-29

- IObjectSafety存在是微软设计的缺陷吗？
发布时间：2010-11-08

- android如何实现推门的效果？
发布时间：2011-03-07

- 如何根据url 生成网站快照?
发布时间：2010-11-18

- c++中堆(heap)内存还是堆栈(stack)内存分配的更快？
发布时间：2011-01-21

露肩紧身裙子修身显瘦包臀裙短裙雪纺连衣裙中长款时尚拼接性感气质修身短袖连衣裙翻边紧身牛仔裤 九分裤针织衫开衫毛衣小外套长袖t恤上衣 宽松女士衣服打底小衫立领气质长袖宽松百搭加绒打底衫白衬衣一字领露肩白色蕾丝连衣裙女性感紧身包臀显瘦雪纺连衣裙露背性感无袖系带针织短裙两件套中长款连衣裙百褶裙显瘦职业女装ol套装时尚紧身裙子海边度假长裙波西米亚雪纺 沙滩裙时尚透视上衣+PU皮裙修身两件套半高领性感开叉修身包臀连衣裙长袖连衣裙 吊带裙两件套学院风蝴蝶结荷叶边针织连衣裙斜肩雪纺超短连衣裙加绒加厚蕾丝打底衫 中长款长袖T恤时尚系带蝴蝶结套头打底针织毛衣甜美少女拼色毛呢外套斗篷毛呢外套 森女系修身显瘦下摆开叉包臀裙针织毛衣半身裙套装修身侧开叉半身裙性感睡裙 真丝打底裙秋冬季修身露肩针织打底裙秋冬季中长款打底毛衣秋冬季韩版女装流苏修身打底裙秋季紧身包臀短裙吊带烫金绷带 抹胸紧身包臀短裙时尚打底裙蕾丝连衣裙女士丝袜连裤袜