Logistic Regression In R

By | February 19, 2016
Well in our last discussion(Logistic Regression Introduction) we discussed about some basic of logistics regression, now we will see all those concepts with the help of one model.
To introduce data for understanding I am providing a little brief about data.
The data is related with direct marketing campaigns of a Portuguese banking institution. Most of the marketing campaigns were based on phone calls.Now the question that we need to answer whether customer would be go for term deposit.So our target variable is categorical and will take values Yes(1) or No(0). So this is clear case of logistic regression as our target variable is categorical and we can prodict the odds of yes or no or simply saying probability of yes or no based on independent variables.
Independent variables include age, job type, educational background,any persona or housing loan, when was contacted last, employment status, consumer price index,uribor 3 month rate etc. If you want to have complete information then please refer to link (http://archive.ics.uci.edu/ml/datasets/Bank+Marketing).
The tool that we are going to use is “R” to build our model.
Ok let’s start with our model. first we need to import data in R and clean it as per requirement.
data_bank_Orig=read.table("bank-additional.csv",sep=";",header = TRUE)
summary(complete.cases(data_bank_Orig))
##    Mode    TRUE    NA's 
## logical    4119       0
The above lines will import the data into R and will store in variable named “data_bank_Orig”. complete.cases will check for any missing values in data
data_bank_audit=subset(data_bank_Orig,select=-c(duration))
data_bank_audit$termdeposit=ifelse(data_bank_audit$y=="no",0,1)
In above lines of code we have removed one column named duration from data as it was of least use in analysis. Similary we coded our target variable in 1’s and 0’s . Means if customer said yes for termdeposit we coded 1 and if customer said no we coded ‘0’.
I would like to mention one point that please don’t put much focus in syntax of R statements as it will somewhat differnet in other tool but analysis and methodology will almost remain same.
You also need to run bivariate analysis and correlation matrix between dependent and independent variable. In this post I am focussing on basix logistic model so I am skipping bivariate analysis and other thing.
After data editing and bivariarte analysis, next step is to check for multicolinearity.It is very important to check for multicolinearity when you run any multivariate analysis. Multicolinearity basically checks whether indepenent variables are dependent on each other. If you have some independent variables which are dependent on other independent variables then your model need to consider whether to include those variable or not.
We will check for multicolinearity using ViF values. VIF value is calcuated as
ViF=1/1-Rsqaured.With Rsquared you must have guess that it must be related to linear regression. yes it is baiscally to calculate VIF values, we regress each independent numerical variable with other independent variable. And we get Rsquared value. So higher the Rsquared value ,it means more those independent variable are related. higher R squared will give higher Vif value eventually. So all those independent variables greated than 2 need to be reconsider for model.
One important point here is that we can check colinearity for numerical variables only. For categorical colinearity by linear regression is not possible.
library(car)
model_vif=lm(termdeposit~age+campaign+pdays+previous+emp.var.rate+cons.price.idx+cons.conf.idx+euribor3m+nr.employed,data=data_bank_audit_vif)
summary(model_vif)
## 
## Call:
## lm(formula = termdeposit ~ age + campaign + pdays + previous + 
##     emp.var.rate + cons.price.idx + cons.conf.idx + euribor3m + 
##     nr.employed, data = data_bank_audit_vif)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.72875 -0.11186 -0.05877 -0.01508  1.01655 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -4.182e+00  3.176e+00  -1.317   0.1879    
## age             9.970e-04  4.314e-04   2.311   0.0209 *  
## campaign       -2.971e-03  1.755e-03  -1.692   0.0906 .  
## pdays          -3.672e-04  2.923e-05 -12.560  < 2e-16 ***
## previous       -1.215e-02  1.101e-02  -1.104   0.2697    
## emp.var.rate   -3.468e-02  1.597e-02  -2.172   0.0299 *  
## cons.price.idx  8.067e-02  1.918e-02   4.206 2.66e-05 ***
## cons.conf.idx   6.173e-03  1.531e-03   4.031 5.66e-05 ***
## euribor3m      -1.399e-02  2.020e-02  -0.693   0.4886    
## nr.employed    -5.092e-04  3.345e-04  -1.522   0.1280    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2835 on 4109 degrees of freedom
## Multiple R-squared:  0.1779,	Adjusted R-squared:  0.1761 
## F-statistic: 98.79 on 9 and 4109 DF,  p-value: < 2.2e-16
vif(model_vif)
##            age       campaign          pdays       previous   emp.var.rate 
##       1.014261       1.041609       1.613374       1.822222      31.924471 
## cons.price.idx  cons.conf.idx      euribor3m    nr.employed 
##       6.328363       2.537052      62.833530      31.117315
#removd  euribor3m
model_vif=lm(termdeposit~age+campaign+pdays+previous+emp.var.rate+cons.price.idx+cons.conf.idx+nr.employed,data=data_bank_audit_vif)
vif(model_vif)
##            age       campaign          pdays       previous   emp.var.rate 
##       1.014211       1.033898       1.613285       1.822216      23.094715 
## cons.price.idx  cons.conf.idx    nr.employed 
##       5.500287       1.293035      12.919532
#removing emp.var.rate
model_vif=lm(termdeposit~age+campaign+pdays+previous+cons.price.idx+cons.conf.idx+nr.employed,data=data_bank_audit_vif)
vif(model_vif)
##            age       campaign          pdays       previous cons.price.idx 
##       1.014197       1.033614       1.613277       1.814689       1.328530 
##  cons.conf.idx    nr.employed 
##       1.051378       1.801063
In above line of code we used R library car to calculate vif values. for 2 valirables we found the vif value greater than 2 , so we removed those 2 variable from our analysis. Ideal approach should be to check for business point of view before removing any variable. But as this is just a example so we are fine with removing any variable.
The function that we are going to use for logistic regression is glm which is nothing but modelling function for general linear model with family attribute as binomial(‘logit’). It will specify that dependent variable is categorical and will use logit function . So lets use it with our variable.
model_logistic_bank=glm(termdeposit~age+job+marital+education+default+
                          housing+loan+contact+month+day_of_week+
                          campaign+pdays+previous+poutcome+
                          cons.price.idx+cons.conf.idx+nr.employed , family = binomial("logit"),data=data_bank_audit)
summary(model_logistic_bank)
## 
## Call:
## glm(formula = termdeposit ~ age + job + marital + education + 
##     default + housing + loan + contact + month + day_of_week + 
##     campaign + pdays + previous + poutcome + cons.price.idx + 
##     cons.conf.idx + nr.employed, family = binomial("logit"), 
##     data = data_bank_audit)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0136  -0.3971  -0.3218  -0.2522   2.8617  
## 
## Coefficients: (1 not defined because of singularities)
##                                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                   3.956e+01  1.238e+01   3.196 0.001391 ** 
## age                           1.660e-02  6.853e-03   2.422 0.015434 *  
## jobblue-collar               -3.211e-01  2.251e-01  -1.426 0.153805    
## jobentrepreneur              -5.301e-01  3.975e-01  -1.334 0.182302    
## jobhousemaid                 -2.040e-01  3.977e-01  -0.513 0.607977    
## jobmanagement                -4.594e-01  2.486e-01  -1.848 0.064667 .  
## jobretired                   -2.674e-01  3.026e-01  -0.883 0.377008    
## jobself-employed             -5.911e-01  3.489e-01  -1.694 0.090264 .  
## jobservices                  -1.916e-01  2.390e-01  -0.802 0.422638    
## jobstudent                    2.953e-02  3.491e-01   0.085 0.932582    
## jobtechnician                -5.521e-02  1.916e-01  -0.288 0.773260    
## jobunemployed                 1.098e-01  3.332e-01   0.330 0.741776    
## jobunknown                   -4.857e-01  6.359e-01  -0.764 0.444994    
## maritalmarried                1.568e-01  2.012e-01   0.779 0.435797    
## maritalsingle                 2.720e-01  2.305e-01   1.180 0.237943    
## maritalunknown                8.078e-02  1.173e+00   0.069 0.945095    
## educationbasic.6y             2.698e-01  3.392e-01   0.796 0.426268    
## educationbasic.9y             1.471e-01  2.724e-01   0.540 0.589213    
## educationhigh.school          1.149e-01  2.616e-01   0.439 0.660580    
## educationilliterate          -1.190e+01  5.354e+02  -0.022 0.982269    
## educationprofessional.course  2.198e-01  2.848e-01   0.772 0.440397    
## educationuniversity.degree    2.009e-01  2.628e-01   0.765 0.444473    
## educationunknown              2.272e-01  3.441e-01   0.660 0.508960    
## defaultunknown               -2.953e-02  1.780e-01  -0.166 0.868219    
## defaultyes                   -1.030e+01  5.354e+02  -0.019 0.984652    
## housingunknown               -3.679e-01  4.257e-01  -0.864 0.387466    
## housingyes                   -1.073e-01  1.171e-01  -0.916 0.359734    
## loanunknown                          NA         NA      NA       NA    
## loanyes                      -7.931e-02  1.592e-01  -0.498 0.618370    
## contacttelephone             -8.291e-01  2.151e-01  -3.855 0.000116 ***
## monthaug                     -4.856e-01  3.093e-01  -1.570 0.116394    
## monthdec                      5.808e-01  5.462e-01   1.063 0.287678    
## monthjul                     -1.475e-02  2.950e-01  -0.050 0.960132    
## monthjun                      6.073e-01  2.725e-01   2.229 0.025845 *  
## monthmar                      1.334e+00  3.863e-01   3.452 0.000556 ***
## monthmay                     -5.189e-01  2.341e-01  -2.216 0.026663 *  
## monthnov                     -5.432e-01  2.886e-01  -1.882 0.059824 .  
## monthoct                     -4.661e-01  3.783e-01  -1.232 0.217919    
## monthsep                     -7.412e-01  3.950e-01  -1.876 0.060600 .  
## day_of_weekmon               -5.000e-02  1.820e-01  -0.275 0.783503    
## day_of_weekthu                1.182e-02  1.823e-01   0.065 0.948310    
## day_of_weektue               -3.167e-02  1.867e-01  -0.170 0.865338    
## day_of_weekwed                1.378e-01  1.876e-01   0.735 0.462567    
## campaign                     -7.912e-02  3.423e-02  -2.312 0.020804 *  
## pdays                        -4.137e-04  5.879e-04  -0.704 0.481663    
## previous                      1.689e-01  1.640e-01   1.030 0.303097    
## poutcomenonexistent           6.519e-01  2.706e-01   2.409 0.016005 *  
## poutcomesuccess               1.363e+00  5.831e-01   2.337 0.019438 *  
## cons.price.idx                1.130e-01  1.263e-01   0.895 0.370655    
## cons.conf.idx                 4.155e-02  1.544e-02   2.691 0.007130 ** 
## nr.employed                  -9.919e-03  9.725e-04 -10.200  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2845.8  on 4118  degrees of freedom
## Residual deviance: 2212.0  on 4069  degrees of freedom
## AIC: 2312
## 
## Number of Fisher Scoring iterations: 12
predicted_banktermdeposit=predict(model_logistic_bank,data_bank_audit,type="response")
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
final_data_bankdeposit=cbind(data_bank_audit,predicted_banktermdeposit)

write.csv(final_data_bankdeposit,"outputbanknew.csv")
After running this model we got our predicted values using predict function bassed on model. Well again to iterate my previous point predicted values will be probabilities of 1’s and 0’s.
Now we need to validate this model with the use of metrices that we discussed in last post
library(ROCR)
## Loading required package: gplots
## 
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
## 
##     lowess
pred_bank_rocr=prediction(predicted_banktermdeposit,data_bank_audit$termdeposit)
performance_bankmodel= performance(pred_bank_rocr,"tpr","fpr")
KS_bank_model=max(attr(performance_bankmodel,"y.values")[[1]]-attr(performance_bankmodel,"x.values")[[1]])
KS_bank_model
## [1] 0.4893457
plot(performance_bankmodel, main="ROC curve", colorize=T)
abline(0,1, lty = 8, col = "red")
plot of chunk unnamed-chunk-5
With the help of ROCR library in R we calculate K-S statistics which come out to be 47% in this model which is well within range .We have also plotted the predicted values curve .
library(gains)
gains_data_bank=gains(data_bank_audit$termdeposit,predicted_banktermdeposit,groups=100,percents = FALSE)
plot(gains_data_bank$cume.pct.of.total,type="s")
plot of chunk unnamed-chunk-6
library(optiRum)
ginichart_bank=giniChart(predicted_banktermdeposit,data_bank_audit$termdeposit)

ginichart_bank
plot of chunk unnamed-chunk-6
giniCoef(predicted_banktermdeposit,data_bank_audit$termdeposit)
## [1] 0.5851077
In above lines of code we plotted gain chart and calculated gini coefficient which comes out to be well with in range
Well with this we get a model which can predict for customer whether he/she will go for term deposit
This post may seem to be focused more on R statement but we need to focus on methodology and metrics used to evaluate any model. Tool can be different but methodology will be almost same.
Well that’s it for now. I will come up with my next post related to linear regression. Well you must be thinking why I am writing about linear regression after covering logistic regression. well I ll answer this as well in my next post. Till then Good Bye. Happy Reading .

Leave a Reply

Your email address will not be published. Required fields are marked *