Logistic Regression in machine learning

Most of us in our lives have played a computer game in our lives. No reason the value of India’s gaming industry is 6200 crores with an estimation of 300 million gamers! Speed is the one of the best advantages of Logistic regressions and is quintessential for the gaming industry. Tencent, the world largest gaming company actually uses this type of regression and provides information about a specific equipment the user may want to possess. But what exactly is Logistic regression? Let’s dive in.

Logistic Regression

The concept of Logistic regression is used for a regression problem. But if there is a classification problem, one cannot use the Logistic regression. The aim of this blog is to find the simplest way in which one can handle the classification problem. In a linear regression, the hypothesis function h(x) was define as

 h(x)=i=0NiXi….(1)

Where N represents the number of predictors or in other words, number of variables. In the above equation, the values iare learnt in order to optimise a certain function, for example minimising the sum of squared errors. 

Suppose there is a classification problem, that is different training points. They are having positive values or negative values. The objective here is to obtain the classification of these points that is, to find out when the points are positive and when they are having negative values. The formula of linear regression will give a real value and thus not appropriate for classification. 

Sigmoid Function

The way out here is that, based on the linear function, another function applies on this linear function so that the result can be used for classification. This can be achieve with the help of logistic regression. In this type of regression, the logistic function or the sigmoid function can be used for this task. The sigmoid function is:

g(z)=11+e-z….(2)

 This function has the profile as given below:-

SOURCE:WIKIPEDIA

The value of g(z) fluctuates between 0 and 1.

At z=0.5, g(z)=0. 

Also, 

z∞g(z)=z∞11+e-z=1

It can also be observed that,

z-∞g(z)=z-∞11+e-z=0

The sigmoid function can be used in the classification as we can obtain the value of the output. If the output is greater than 0.5, the data point belongs to the positive class and for less than 0.5, it belongs to the negative class. Just like in regression, formula 1 is used, for classification with the help of logistic regression, the function used will be

h(x) =g(i=0NiXi) 

More compactly using matrix properties and notations, it can be rewritten in the following way:-

h(x) =g(TX)

h(x)=11+e-TX…..(3)

Thus a linear function ofif passed through the sigmoid function can be used as a classification function. There are some properties of the sigmoid function which attract a lot of Machine learning enthusiasts to use.

Properties of the sigmoid function

A very peculiar feature of the function is observed if the derivative of the function is taken. Since the sigmoid function is a continuous function (no  breaks or jumps in its graph) its derivative using the chain rule is computed as follows:-

g'(z)=ddzg(z)=e-z(1+e-z)2

This can further written with some manipulation as,

g'(z)=11+e-z (1-11+e-z)

g(z)=g(z)(1-g(z)) ….(4)

Thus the derivative can be calculated very easily and this is the property which makes the sigmoid or logistic function very attractive. According to this, the conditional distribution of generating the data can be observed. 

Consider the input X. The probability  P(Y/ X) is to be determined. It is given by

P(Y/ X)=h (X)y (1-h (X))1-y….(5)

If y=1, the probability,

P(Y/ X)=h (X)

If y=0, the probability,

P(Y/ X)=1-h (X)

Using The Gradient Ascent Or Descent Method

The value of h (X) is the same as the one given by equation 3. The function P(Y/ X) can be learn using the gradient ascent or descent method. In other words the conditional probability distribution needs to be learn.

Let py (X, ) is the estimate of probability P(Y/ X), where is the vector whose value is to be learn. To achieve this stochastic gradient descent will be ready. A single training example will be considered, and then gradient descent will be complete with respect to the training data. 

Thus to begin with, the likelihood of the data is define. The approach of maximum likelihood is in use to learn the optimal values of . Therefore, the likelihood of is the probability of witnessing the data givenwere the actual parameters. It can be state as

L()=P(Y / X 😉

Hence for each training example, the probability of yi given xi is find out. Since there are m training examples,

L()=i=1mP(Yi / Xi 😉

L()=i=1mh (Xi)yi (1-h (Xi))1-yi….(6)

Log Likelihood

Thus has to be determined such that equation of L() is maximise. The probabilities are positive. Since L() is an expression of probability, maximising the logarithm of this expression is equivalent to maximising equation ofL()(equation 6). The reason for taking the logarithm is that the product π gets convert to summation Σ. This makes the expression simple to compute. Therefore l()is the log likelihood of and is provide by the following:-

l()= i=1myi log (h (Xi)+(1-yi) (1-h (Xi))….(7) 

To maximise this likelihood, the method of gradient ascent is carry out. According to the gradient ascent, the derivative with respect to is complete and then is update iteratively as follows:-

=+∇ l()….(8)

Stochastic Gradient Ascent

If Stochastic gradient ascent is being carried out this iteration can complete one example at a time. Let there be a single training example (x,y). The aim is to determine the value of next based on this training example and the present .  Thus the derivative will compute and a “step towards” the derivative will be complete. So, 

∂∂(l ())=((y1g(TX))-(1-y)11 – g(TX)∂∂g(TX))

∂∂(l ())=((y(1-g(TX))-(1-y)g(TX))Xj

On further simplification the result obtained is,

∂∂(l ())=( y -h(x)) Xj….(9)

So plugging the equation 9 in equation 8, the value of jth component of is determined which is given by

j=j+(( y -h(x)) Xj ….(10)

conclusion

The above represents the change that is made for a single training example (x,y).  The above equation represents the way in which jcan be updated. The formula represents the method of carrying out stochastic gradient ascent and with the help of this, appropriate values of which is used very frequently in logistic regression.

written by: Aryaman Dubey

Reviewed By: Krishna Heroor

If you are Interested In Machine Learning You Can Check Machine Learning Internship Program
Also Check Other Technical And Non Technical Internship Programs

Leave a Comment

Your email address will not be published. Required fields are marked *