**Well in our last discussion(Logistic Regression Introduction) we discussed about some basic of logistics regression, now we will see all those concepts with the help of one model.**

To introduce data for understanding I am providing a little brief about data.

The data is related with direct marketing campaigns of a Portuguese banking institution. Most of the marketing campaigns were based on phone calls.Now the question that we need to answer whether customer would be go for term deposit.So our target variable is categorical and will take values Yes(1) or No(0). So this is clear case of logistic regression as our target variable is categorical and we can prodict the odds of yes or no or simply saying probability of yes or no based on independent variables.

Independent variables include age, job type, educational background,any persona or housing loan, when was contacted last, employment status, consumer price index,uribor 3 month rate etc. If you want to have complete information then please refer to link (http://archive.ics.uci.edu/ml/datasets/Bank+Marketing).

The tool that we are going to use is “R” to build our model.

Ok let’s start with our model. first we need to import data in R and clean it as per requirement.

Ok let’s start with our model. first we need to import data in R and clean it as per requirement.

data_bank_Orig=read.table("bank-additional.csv",sep=";",header = TRUE) summary(complete.cases(data_bank_Orig))

## Mode TRUE NA's ## logical 4119 0

The above lines will import the data into R and will store in variable named “data_bank_Orig”. complete.cases will check for any missing values in data

In above lines of code we have removed one column named duration from data as it was of least use in analysis. Similary we coded our target variable in 1’s and 0’s . Means if customer said yes for termdeposit we coded 1 and if customer said no we coded ‘0’.

I would like to mention one point that please don’t put much focus in syntax of R statements as it will somewhat differnet in other tool but analysis and methodology will almost remain same.

You also need to run bivariate analysis and correlation matrix between dependent and independent variable. In this post I am focussing on basix logistic model so I am skipping bivariate analysis and other thing.

After data editing and bivariarte analysis, next step is to check for multicolinearity.It is very important to check for multicolinearity when you run any multivariate analysis. Multicolinearity basically checks whether indepenent variables are dependent on each other. If you have some independent variables which are dependent on other independent variables then your model need to consider whether to include those variable or not.

We will check for multicolinearity using ViF values. VIF value is calcuated as

ViF=1/1-Rsqaured.With Rsquared you must have guess that it must be related to linear regression. yes it is baiscally to calculate VIF values, we regress each independent numerical variable with other independent variable. And we get Rsquared value. So higher the Rsquared value ,it means more those independent variable are related. higher R squared will give higher Vif value eventually. So all those independent variables greated than 2 need to be reconsider for model.

One important point here is that we can check colinearity for numerical variables only. For categorical colinearity by linear regression is not possible.

We will check for multicolinearity using ViF values. VIF value is calcuated as

ViF=1/1-Rsqaured.With Rsquared you must have guess that it must be related to linear regression. yes it is baiscally to calculate VIF values, we regress each independent numerical variable with other independent variable. And we get Rsquared value. So higher the Rsquared value ,it means more those independent variable are related. higher R squared will give higher Vif value eventually. So all those independent variables greated than 2 need to be reconsider for model.

One important point here is that we can check colinearity for numerical variables only. For categorical colinearity by linear regression is not possible.

library(car) model_vif=lm(termdeposit~age+campaign+pdays+previous+emp.var.rate+cons.price.idx+cons.conf.idx+euribor3m+nr.employed,data=data_bank_audit_vif) summary(model_vif)

## ## Call: ## lm(formula = termdeposit ~ age + campaign + pdays + previous + ## emp.var.rate + cons.price.idx + cons.conf.idx + euribor3m + ## nr.employed, data = data_bank_audit_vif) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.72875 -0.11186 -0.05877 -0.01508 1.01655 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -4.182e+00 3.176e+00 -1.317 0.1879 ## age 9.970e-04 4.314e-04 2.311 0.0209 * ## campaign -2.971e-03 1.755e-03 -1.692 0.0906 . ## pdays -3.672e-04 2.923e-05 -12.560 < 2e-16 *** ## previous -1.215e-02 1.101e-02 -1.104 0.2697 ## emp.var.rate -3.468e-02 1.597e-02 -2.172 0.0299 * ## cons.price.idx 8.067e-02 1.918e-02 4.206 2.66e-05 *** ## cons.conf.idx 6.173e-03 1.531e-03 4.031 5.66e-05 *** ## euribor3m -1.399e-02 2.020e-02 -0.693 0.4886 ## nr.employed -5.092e-04 3.345e-04 -1.522 0.1280 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.2835 on 4109 degrees of freedom ## Multiple R-squared: 0.1779, Adjusted R-squared: 0.1761 ## F-statistic: 98.79 on 9 and 4109 DF, p-value: < 2.2e-16

vif(model_vif)

## age campaign pdays previous emp.var.rate ## 1.014261 1.041609 1.613374 1.822222 31.924471 ## cons.price.idx cons.conf.idx euribor3m nr.employed ## 6.328363 2.537052 62.833530 31.117315

#removd euribor3m model_vif=lm(termdeposit~age+campaign+pdays+previous+emp.var.rate+cons.price.idx+cons.conf.idx+nr.employed,data=data_bank_audit_vif) vif(model_vif)

## age campaign pdays previous emp.var.rate ## 1.014211 1.033898 1.613285 1.822216 23.094715 ## cons.price.idx cons.conf.idx nr.employed ## 5.500287 1.293035 12.919532

#removing emp.var.rate model_vif=lm(termdeposit~age+campaign+pdays+previous+cons.price.idx+cons.conf.idx+nr.employed,data=data_bank_audit_vif) vif(model_vif)

## age campaign pdays previous cons.price.idx ## 1.014197 1.033614 1.613277 1.814689 1.328530 ## cons.conf.idx nr.employed ## 1.051378 1.801063

In above line of code we used R library car to calculate vif values. for 2 valirables we found the vif value greater than 2 , so we removed those 2 variable from our analysis. Ideal approach should be to check for business point of view before removing any variable. But as this is just a example so we are fine with removing any variable.

The function that we are going to use for logistic regression is glm which is nothing but modelling function for general linear model with family attribute as binomial(‘logit’). It will specify that dependent variable is categorical and will use logit function . So lets use it with our variable.

model_logistic_bank=glm(termdeposit~age+job+marital+education+default+ housing+loan+contact+month+day_of_week+ campaign+pdays+previous+poutcome+ cons.price.idx+cons.conf.idx+nr.employed , family = binomial("logit"),data=data_bank_audit) summary(model_logistic_bank)

## ## Call: ## glm(formula = termdeposit ~ age + job + marital + education + ## default + housing + loan + contact + month + day_of_week + ## campaign + pdays + previous + poutcome + cons.price.idx + ## cons.conf.idx + nr.employed, family = binomial("logit"), ## data = data_bank_audit) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.0136 -0.3971 -0.3218 -0.2522 2.8617 ## ## Coefficients: (1 not defined because of singularities) ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 3.956e+01 1.238e+01 3.196 0.001391 ** ## age 1.660e-02 6.853e-03 2.422 0.015434 * ## jobblue-collar -3.211e-01 2.251e-01 -1.426 0.153805 ## jobentrepreneur -5.301e-01 3.975e-01 -1.334 0.182302 ## jobhousemaid -2.040e-01 3.977e-01 -0.513 0.607977 ## jobmanagement -4.594e-01 2.486e-01 -1.848 0.064667 . ## jobretired -2.674e-01 3.026e-01 -0.883 0.377008 ## jobself-employed -5.911e-01 3.489e-01 -1.694 0.090264 . ## jobservices -1.916e-01 2.390e-01 -0.802 0.422638 ## jobstudent 2.953e-02 3.491e-01 0.085 0.932582 ## jobtechnician -5.521e-02 1.916e-01 -0.288 0.773260 ## jobunemployed 1.098e-01 3.332e-01 0.330 0.741776 ## jobunknown -4.857e-01 6.359e-01 -0.764 0.444994 ## maritalmarried 1.568e-01 2.012e-01 0.779 0.435797 ## maritalsingle 2.720e-01 2.305e-01 1.180 0.237943 ## maritalunknown 8.078e-02 1.173e+00 0.069 0.945095 ## educationbasic.6y 2.698e-01 3.392e-01 0.796 0.426268 ## educationbasic.9y 1.471e-01 2.724e-01 0.540 0.589213 ## educationhigh.school 1.149e-01 2.616e-01 0.439 0.660580 ## educationilliterate -1.190e+01 5.354e+02 -0.022 0.982269 ## educationprofessional.course 2.198e-01 2.848e-01 0.772 0.440397 ## educationuniversity.degree 2.009e-01 2.628e-01 0.765 0.444473 ## educationunknown 2.272e-01 3.441e-01 0.660 0.508960 ## defaultunknown -2.953e-02 1.780e-01 -0.166 0.868219 ## defaultyes -1.030e+01 5.354e+02 -0.019 0.984652 ## housingunknown -3.679e-01 4.257e-01 -0.864 0.387466 ## housingyes -1.073e-01 1.171e-01 -0.916 0.359734 ## loanunknown NA NA NA NA ## loanyes -7.931e-02 1.592e-01 -0.498 0.618370 ## contacttelephone -8.291e-01 2.151e-01 -3.855 0.000116 *** ## monthaug -4.856e-01 3.093e-01 -1.570 0.116394 ## monthdec 5.808e-01 5.462e-01 1.063 0.287678 ## monthjul -1.475e-02 2.950e-01 -0.050 0.960132 ## monthjun 6.073e-01 2.725e-01 2.229 0.025845 * ## monthmar 1.334e+00 3.863e-01 3.452 0.000556 *** ## monthmay -5.189e-01 2.341e-01 -2.216 0.026663 * ## monthnov -5.432e-01 2.886e-01 -1.882 0.059824 . ## monthoct -4.661e-01 3.783e-01 -1.232 0.217919 ## monthsep -7.412e-01 3.950e-01 -1.876 0.060600 . ## day_of_weekmon -5.000e-02 1.820e-01 -0.275 0.783503 ## day_of_weekthu 1.182e-02 1.823e-01 0.065 0.948310 ## day_of_weektue -3.167e-02 1.867e-01 -0.170 0.865338 ## day_of_weekwed 1.378e-01 1.876e-01 0.735 0.462567 ## campaign -7.912e-02 3.423e-02 -2.312 0.020804 * ## pdays -4.137e-04 5.879e-04 -0.704 0.481663 ## previous 1.689e-01 1.640e-01 1.030 0.303097 ## poutcomenonexistent 6.519e-01 2.706e-01 2.409 0.016005 * ## poutcomesuccess 1.363e+00 5.831e-01 2.337 0.019438 * ## cons.price.idx 1.130e-01 1.263e-01 0.895 0.370655 ## cons.conf.idx 4.155e-02 1.544e-02 2.691 0.007130 ** ## nr.employed -9.919e-03 9.725e-04 -10.200 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 2845.8 on 4118 degrees of freedom ## Residual deviance: 2212.0 on 4069 degrees of freedom ## AIC: 2312 ## ## Number of Fisher Scoring iterations: 12

predicted_banktermdeposit=predict(model_logistic_bank,data_bank_audit,type="response")

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = ## ifelse(type == : prediction from a rank-deficient fit may be misleading

final_data_bankdeposit=cbind(data_bank_audit,predicted_banktermdeposit) write.csv(final_data_bankdeposit,"outputbanknew.csv")

After running this model we got our predicted values using predict function bassed on model. Well again to iterate my previous point predicted values will be probabilities of 1’s and 0’s.

Now we need to validate this model with the use of metrices that we discussed in last post

library(ROCR)

pred_bank_rocr=prediction(predicted_banktermdeposit,data_bank_audit$termdeposit) performance_bankmodel= performance(pred_bank_rocr,"tpr","fpr") KS_bank_model=max(attr(performance_bankmodel,"y.values")[[1]]-attr(performance_bankmodel,"x.values")[[1]]) KS_bank_model

## [1] 0.4893457

plot(performance_bankmodel, main="ROC curve", colorize=T) abline(0,1, lty = 8, col = "red")

With the help of ROCR library in R we calculate K-S statistics which come out to be 47% in this model which is well within range .We have also plotted the predicted values curve .

library(gains) gains_data_bank=gains(data_bank_audit$termdeposit,predicted_banktermdeposit,groups=100,percents = FALSE) plot(gains_data_bank$cume.pct.of.total,type="s")

library(optiRum) ginichart_bank=giniChart(predicted_banktermdeposit,data_bank_audit$termdeposit) ginichart_bank

giniCoef(predicted_banktermdeposit,data_bank_audit$termdeposit)

## [1] 0.5851077

In above lines of code we plotted gain chart and calculated gini coefficient which comes out to be well with in range

Well with this we get a model which can predict for customer whether he/she will go for term deposit

This post may seem to be focused more on R statement but we need to focus on methodology and metrics used to evaluate any model. Tool can be different but methodology will be almost same.

**Well that’s it for now. I will come up with my next post related to linear regression. Well you must be thinking why I am writing about linear regression after covering logistic regression. well I ll answer this as well in my next post. Till then Good Bye. Happy Reading**.