Logistic Regression Introduction
If you want to learn or know about analytics or modelling then more often you have come across this term Logistic Regression. Well you would have searched in google and found out that it is nothing but a regression technique which deals with categorical dependent variable. Well ok, we will take this as definition and now we can differentiate it from Linear Regression where the dependent variable is numerical variable.
Now we have understood the basic difference between logistic regression and linear regression, but first we need to understand why we can’t apply linear regression on categorical variables.
Linear regression follows principle of Ordinary Least Squares which means it calculates the combination of coefficients where the error values of predicted values will be smallest as possible.
In linear regression modelling dependent values was numerical so it was easy for model to perform mathematical function such as addition, subtraction, square etc. But what if dependent variable can take only 2 values yes or no (1 or 0). In most real life situation we face questions like we should take this or not, is it good or bad , would my customer come to me or not etc. . In business you need to predict in terms of yes or no. So method of linear regression will definitely not be best suited for this type of situation.
To deal with this situation you need a function which can help you in predicting yes or no with a regression equation of Y=b0+b1x1+b2x+b3x3+….bnxn+e.
If we think for dependent variable the best, we can do is to count number of yes or no to predict something out of it or if we can think in some statistical way, we can find probability of yes or no. So here we move towards logistic modelling.
Logistic regression calculates the log of odds or simply put log of probability of occurrence vs probability of non-occurrence.
So now equation will be
Log (odds) =log (p/1-p) =b0+b1+b2x2+b3x3+…bnxn+e
Odds will have range of 0 to infinity
But range of log (odds) will be –infinity to +infinity which is similar to linear regression equation
So to expand further we will get
So in simple terms on changing x there will be change in probability of dependent variable.
So now independent variable will effect probability of occurrence instead of numerical value. This is where our hypothesis will be different from linear model.
Why it is called logistic, as it follows logit distribution (s curve) so it was named as logistic regression.
Ok Now we have got of some idea of logistic regression.
We need to discuss some more terms related to logistic modelling
AIC Value– Akaike Information Criterion. It is to compare the 2 logistic model with same almost same dependent and independent variables. Lower the AIC, better will be the model.
K-S Statistics (Kolmogorov Smirnov): It is just measure of separation between good cumulative & bad cumulative (or positive and negative). In other words we can say it is maximum difference between cumulative event & cumulative non-event.
Major guidelines For K-S Statistics
- It should lie between 40-70 %
- It should be in top 3 or 4 deciles.
One major point is that after K-S point there will be reduction in cumulative good proportion, it means at k-S maximum population will be covered and if we add more population after K-S then there will les not significant output
Gini Coefficient: it is measure of in-equality or simply say how dispersed values are. For a good logistic model it should be between 40 to 60%
Concordance: it is similar to R-Squared value of linear regression. It is used to validate the model. Higher the concordance better the model. It is nothing but probability of correctly predicted 1‘s as 1.
Concordance= (pairs with greater probability of 1/total pairs)
To calculate Concordance, we make all pairs of predicted 1’s with 0 and select pairs which has higher probability of 1 in the pair than 0 and divide by total pairs.
Disconcordance will select a those pairs which have high probability of 0 than 1
And if both 1’s and 0’s have same probability in pair then it will be called as tied
So complete model will be defined as
Total=concordance+ disconcordance +tied
Well this was some basics of logistic regression, in next post we will create models of linear regression & logistic regression.