- 加载中...

**更新时间**: 2018-03-01**来源**: 原创或网络**浏览数**: 31次**字数**: 14597- 发表评论

In this post we’ll talk about one of the most fundamental machine learning algorithms: the perceptron algorithm. This algorithm forms the basis for many modern day ML algorithms, most notably neural networks. In addition, we’ll discuss the perceptron algorithm’s cousin, logistic regression. And then we’ll conclude with an introduction to SVMs, or support vector machines, which are perhaps one of the most flexible algorithms used today.

In machine learning, there are two general classes of algorithms. You’ll remember that in our last post we discussed regression and classification. These two methods fall under the larger umbrella of supervised learning algorithms, one of two classes of machine learning algorithms. The other class of algorithms is called unsupervised algorithms.

Supervised algorithms learn from labeled training data. The algorithms are “supervised” because we know what the correct answer is. If the algorithm receives a bunch of images labeled as apples or oranges it can first guess the object in the image, then use the label to check if its guess is correct.

Unsupervised learning is a bit different in that it finds patterns in data. It works similarly to the way we humans observe patterns (or objects) in random phenomena. Unsupervised learning algorithms do the same thing by looking at unlabeled data. Just like we don’t have a particular goal when looking at an object (other than identifying it), the algorithm doesn’t have a particular goal other than inferring patterns from the data itself.

We’ll talk about unsupervised algorithms in a later blog post. For
now, let’s look at a very simple supervised algorithm, called the **perceptron** algorithm.

One of the goals of machine learning and AI is to do what humans do and even surpass them. Thus, it makes sense to try and copy what makes humans so great at what they do–their brains.

The brain is composed of billions of interconnected neurons that are constantly firing, passing signals from one neuron to another. Together, they allow us to do incredible things such as recognize faces, patterns, and most importantly, think.

The job of an individual neuron is simple: if its inputs match certain criteria, it fires, sending a signal to other neurons connected to it. It’s all very black and white. Of course, the actual explanation is much more complicated than this, but since we’re using computers to simulate a brain, we only need to copy the idea of how a brain works.

The perceptron mimics the function of neurons in machine learning algorithms. The perceptron is one of the oldest machine learning algorithms in existence. When it was first used in 1957 to perform rudimentary image recognition, the New York Times called it:

the embryo of an electronic computer that [the Navy] expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence.

We’re still a little far off from that, but the Times did recognize the potential of a perceptron. Perceptrons now form the basis for more complicated neural networks, which we’ll be talking about in the next post.

Like neurons, perceptrons take in several inputs and spit out an output. For example, a perceptron might take as input temperature and try to answer the question, “Should I wear a sweater today?” If the input temperature is below a certain threshold (say, 70˚F), the perceptron will output 1 (yes). If the temperature is above the threshold, the perceptron will output a 0 (no).

Of course, we could consider more variables than just temperature when deciding whether or not to wear a sweater. Like how a biological neuron typically can have more than one input electrical impulse, we can make our perceptrons have multiple inputs. In that case, we’ll have to also weight each input by a certain amount. If we were to use temperature, wind speed, and something completely random like the number of people showering in Hong Kong as our inputs, we would want to use different weights for each input. Temperature would probably have a negative weight, because the lower the temperature the more you should probably wear a sweater. Wind speed should have a positive weight, because the higher the wind speeds the more likely you will need to put on a sweater. And as for the number of people showering in Hong Kong, the weight should probably be zero (unless you normally factor in the number of people taking showers in Hong Kong into your decision making).

But someone from Canada might be very used to the cold, so their threshold for wearing a sweater might be a lower temperature than someone from, say, Australia. To express this, we use a bias to specify the threshold for the Canadian and the Australian. You can think of the bias as a measure of how difficult it is for the perceptron to say ‘yes’ or ‘no’.

The Canadian is fairly insensitive to cold temperatures, and thus has a low bias

The Australian is biased towards hotter temperatures and thus has a higher threshold temperature

However, life is not as black and white as perceptrons indicate. There is uncertainty in just about anything, even choosing whether or not to put on a sweater. In reality, we don’t immediately put on a sweater the moment it drops below some predefined temperature. It’s more as if at any temperature we have a certain “chance” of putting on a sweater. Maybe at 45 F somebody will have a 95% chance of putting on a sweater, and at 60 F the same person will have a 30% chance of putting on a sweater.

To better model life’s complexities, we use logistic regression to find these probabilities. This involves fitting a logistic curve (like the one below) to our data. To do this, we again use gradient descent to choose the best parameters for the model. That is, we find parameters that minimize some cost function.

The general form of the logistic model is

where θ is a vector of parameters, x is the input variables, and h is the model probabilities. For more information, we suggest you check out Andrew Ng’s notes on logistic regression.

Logistic regression and the perceptron algorithm are very similar to each other. It’s common to think of logistic regression as a kind of perceptron algorithm on steroids, in that a logistic model can predict probabilities while a perceptron can only predict yes or no. In fact, taking a logistic model and setting all values less than .5 to zero, and all values above .5 to one gives a very similar result to just the perceptron algorithm.

Support vector machines, or SVMs for short, are a class of machine learning algorithms that have become incredibly popular in the past few years. They are based on a very intuitive idea. Here, we’ll introduce SVMs and go through the key ideas in the algorithm.

If you remember the section on classification from our last post, we classify data by drawing a line, called a decision boundary, to separate them.

Once we’ve drawn a decision boundary, we can find the **margin** for each datapoint. The margin for a single data point is the distance from that data point to the decision boundary.

In a way, a margin is a way to determine how confident we are in our classification. If a data point has a large margin, and hence is very far away from the decision boundary we can be pretty confident about our prediction. However, if a data point has a very small margin, and is hence very close to the decision boundary then we probably aren’t as sure of our classification.

Now that we’ve defined margins, we can talk about **support vectors**.
A support vector is a vector from the data point with the smallest
margin to the decision boundary. In other words, it’s a vector between
that data point that is closest to the decision boundary and the
boundary itself. This vector, in fact, any margin, will be perpendicular
to the decision boundary because the smallest distance between a point
and a line is a perpendicular line.

The idea behind a support vector machine is to classify data by drawing a decision boundary such that it *maximizes* the support vectors. By maximizing the support vectors, we’re also
maximizing the margins in a data set, and thus the decision boundary is
as far away as possible from the data points.

If you’ve played around with the simulation enough, you’ll notice
that sometimes the algorithm fails completely. This happens only when
the data points aren’t **linearly separable**. Think of it
this way: when you have a set of data points which you can’t draw a
straight line to separate them with, then you have a **linearly inseparable** dataset, and since it’s impossible to draw a line to split them, then the SVM algorithm fails.

So how do we deal with linear inseparability? Turns out we can reformulate our optimization problem. Before, we wanted *every single* data point to be as far away (to the correct side) from the decision
boundary as possible. Now, we’ll allow a data point to stray toward the
wrong side of the boundary, but we’ll add a “cost” to that happening
(remember cost functions from the last post?). This is something that
happens very often in machine learning and is called **regularization**.
It basically allows our model to be a bit more flexible when trying to
classify the data. The cost of violating the decision boundaries can be
as high or as low as we want it to be, and is controlled by something
called the **regularization parameter**.

Mathematically, we implement regularization by adding a term to our cost function. For instance

could be the regularized cost function, where C (the cost) is a function of all the parameters and all the training data, R is the regularization (or penalty) for each data point, and c is the regularization parameter. A large c would mean the penalty for violating the decision boundary would be very high, and vice versa.

This part is a bit mathy, which is why it’s hidden, but it’s also very fascinating. Now the SVMs shown so far can only make straight decision boundaries. But what if we wanted to classify data using a different kind of boundary? Well, that’s where kernels come in. Kernels are (loosely speaking) a fancy term for an inner product.

So why do we need kernels? It turns out that the SVM algorithm never
depends on just a data point itself, but rather it depends on the inner
product *between* data points. For that reason, we never need to
know the coordinates of a data point, we only need to know the inner
products between all the data points.

It turns out that for certain representations of data (or certain
coordinate systems), it’s actually much easier to calculate the kernel
than it is to calculate the actual data point. For example, we can use
data points in an infinite dimensional space. As long as we have a way
of computing the inner product, or the kernel, quickly. And luckily,
there are ways to do this! One example of this is the **Radial Basis Function kernel**, which calculates the inner product for a certain infinite dimensional space. The form of the RBF kernel is

where x and x’ are the two data points in the original coordinate system, ||x-x’|| is the distance between the two points, and sigma is a parameter that controls the model’s behavior. The feature space for the RBF kernel is actually infinite, and would be completely unrepresentable completely in a computer, but using the kernel we can actually use SVMs in that feature space.

Another example of a kernel is the polynomial kernel, which takes the form

where x and x' are the two data points, and c and d are parameters for the kernel. In this case, the feature space is not actually infinite, but the kernel is still helpful because it allows us to change our representation of the data.

Intuitively, you can think of a kernel as mapping a set data from one coordinate system to another coordinate system. In the original coordinate system the data may not be linearly seperable at all, whereas in the new coordinate system if we choose the correct kernel, we should get a set a data set is very easily linearly seperable.

我来评分 :6

```
```转载注明：转自5lulu技术库

本站遵循：署名-非商业性使用-禁止演绎 3.0 共享协议

- nginx 0.8.x的一个bug
发布时间：2010-11-17

- 如何实现PHP实现命名空间类
发布时间：2010-11-19

- 程序如何根据网址计算出一次完整的请求所用的时间？
发布时间：2010-12-27

- TypeError: $(…).live is not a function
发布时间：2014-11-26

- 一条语句求0到100 3的倍数的和
发布时间：2011-02-22

- 如何用一条SQL实现
发布时间：2010-12-01

- 如何让cmd.Cmd 命令输入保持在最后一行
发布时间：2010-10-29

- flash自带的scrollBar可以固定滑块高度吗？
发布时间：2010-11-01

- 手机号防被抓取的方法（除了生成图片方式）
发布时间：2011-01-31

- php设计模式在实际中用得多吗？
发布时间：2011-01-31

- Android分享中，如何过滤指定的应用，并且对不同的分享方式发送不同的内容？
发布时间：2011-01-27

- 如何检查出图片中包含的其他文件？
发布时间：2011-03-02

露肩紧身裙子修身显瘦包臀裙短裙雪纺连衣裙中长款时尚拼接性感气质修身短袖连衣裙翻边紧身牛仔裤 九分裤针织衫开衫毛衣小外套长袖t恤上衣 宽松女士衣服打底小衫立领气质长袖宽松百搭加绒打底衫白衬衣一字领露肩白色蕾丝连衣裙女性感紧身包臀显瘦雪纺连衣裙露背性感无袖系带针织短裙两件套中长款连衣裙百褶裙显瘦职业女装ol套装时尚紧身裙子海边度假长裙波西米亚雪纺 沙滩裙时尚透视上衣+PU皮裙修身两件套半高领性感开叉修身包臀连衣裙长袖连衣裙 吊带裙两件套学院风蝴蝶结荷叶边针织连衣裙斜肩雪纺超短连衣裙加绒加厚蕾丝打底衫 中长款长袖T恤时尚系带蝴蝶结套头打底针织毛衣甜美少女拼色毛呢外套斗篷毛呢外套 森女系修身显瘦下摆开叉包臀裙针织毛衣半身裙套装修身侧开叉半身裙性感睡裙 真丝打底裙秋冬季修身露肩针织打底裙秋冬季中长款打底毛衣秋冬季韩版女装流苏修身打底裙秋季紧身包臀短裙吊带烫金绷带 抹胸紧身包臀短裙时尚打底裙蕾丝连衣裙女士丝袜连裤袜