Author Archives: Vikas Kashyap

Hooked Model

Habits are defined as “behaviors done with little or no conscious thought”.

The Convergence of access,data, and speed is making the world a more habit forming place

Businesses that create customer habits gain a significant competitive advantage

The Hook Model describes an experience designed to connect the user’s problem to a solution frequently enough to form a habit

The Hook Model has four phases: Trigger, Action, Variable Reward and Investment.

For some businesses, forming habits is a critical component to success, but not every business requires habitual user engagement.

When successful, forming strong user habits can have several business benefits including:higher Customer lifetime value(CLTV), greater pricing flexibility, supercharged growth and a sharper competitive edge.

Habits cannot form outside the Habit Zone, where the behavior occurs with enough frequency and perceived utility.

Habits-forming products often start as nice-to-haves (vitamins) but once the habit is formed, they become must-haves (painkillers)

Habit-forming products alleviate users pain by relieving a pronounced itch.

Designing habit-forming products is a form of manipulation. Product builders would benefits from a bit of introspection before attempting to hook users to make sure they are building healthy habits, not unhealthy addictions.

Interview Questions

Ques1. Assumptions in logistic Regression

  1. Logistic regression is used to make relationship between categorical target variable and independent variable (continuous and categorical)
  2. Logistic regression does not make any assumptions of normality, linearity, and homogeneity of variance for the independent variables.
  3. Equation of model is Log (p/ (1-p)) = a + b1 x1 + b2 x2 +….. +bn xn + e
  4. The value produced by logistic regression is a probability value varies between 0 & 1
  5. The independent variables should not be multicollinearity(numeric variables)
  6. All the independent categorical values should be dummy coded.

Ques 2. Assumptions of Multivariate Linear Regression

  1. There should be linear relationship between the target variable and independent variables.
  2. The expected value of the error term, conditional on the independent variables is zero.
  3. The error terms are homoskedastic, i.e. the variance of the error terms should be constant for all the observations.
  4. Error terms should be uncorrelated with each other. We can check it by checking the product of error terms, It should be zero
  5. The error terms should be normally distributed.
  6. The independent variables should not have any linear relationships between each other. It means they should not be multi-collinear.

Introduction to Chaid

Introduction to Chaid

Hello friends, In this post we will discuss about very important analytical technique called CHAID (Chi square Automatic Interaction Detector) .It is a type of decision tree technique to model the data into different categories. For example if you want to run one marketing campaign for database of 100000 users then you would like to choose those set of users from whom you benefitted most or in other words you get maximum response rate from those users . So you need to classify those users based on response rate.

Let’s understand some basic terms first.

Decision Tree Analysis: In short decision tree is one of the predictive modelling approach for mapping the dataset on the basis of target variable. It is used in different approaches like data mining, machine learning etc. It either builds classification trees (if target variable is categorical) or regression trees (if target variable is continuous).

Decision tree starts with target variable mean it act as initial node and splitting happens based on statistical analysis used by different decision trees.

Parametric Technique & Non Parametric Technique:  One main difference is assumptions about distribution of underlying variables. In case of parametric technique we assume that our variables are in some form of distribution (linear, log-linear etc.) but in case of non-parametric techniques we do not make assumption about distribution of variables.

Supervised & Un-Supervised : In supervised learning there will be some target variable for which we need to estimate or predict based on values of independent variables While in case of Un-supervised learning there will be no target variable and objective in these type of learning is to find the structures in the data

Examples of supervised learning algorithms are linear regression, logistic regression, decision tree while examples of un-supervised learning is clustering.

Basic properties of chaid

  1. It is a supervised learning algorithm that means it will work on target variable. Target variable supposed to be categorical but with tools like R and spss it can be performed on continuous variable.
  2. It is non-parametric technique which means there is no assumptions about distribution.
  3. No outlier and missing value treatment required which we need to take care in technique like linear and logistic regression.
  4. Can be implemented very quicky as compared to logistic and linear regression modelling.

Why Chaid if logistic regression is present

  1. Chaid provides good visual representation of data in form of tree which can be easily understood by many.
  2. It can be implemented very quickly as we are not doing much of data correction or checking distributions.
  3. No requirement of data too be present in some type of distribution.

Statistic Criteria for Chaid

Chaid use chi-square value of target variable vs independent variable for splitting. The variable which has highest chi-square will be used for splitting. After splitting same process is repeated again till all population is covered. The minimum number observation required for node is 5% of total observation if this criteria is not met then no further splitting occur.

Terminal Node: last node of tree. It should have minimum if 5% of observation.

Parent node: generally parent has observations of at least 2.5 times of child node.

Selection criteria for selecting nodes for campaign: Those node which has event rate of greater than starting event rate in first node will be selected for campaign

For example: A pizza outlet carried out a promotional campaign and reached out to 1000 individuals in a locality. Out of the 1000 individuals approached, 200 responded by visiting the outlet. The outlet also collected the personal details of customer. After the campaign period was over, the outlet carried out an analysis in order to classify the customers into various classes

For example in below tree when we started building tree, response rate was 20% after building tree using chaid we found response rate of 40% in 2 segments so we will select these nodes.

Married male segment has response rate of 40%

Divorced with no pets has response rate of 40%

This is a brief introduction about chaid algorithm, there are many more algorithm are present in decision tree analysis but chaid is very popular out of them. Other algorithms are CART, Bagging, Random forest etc.

Happy reading.


Basic Methodology Of An Analytical Model

Hello Friends,

Well in this post my focus will be on approach for building the model. I would present my understanding for approach of model. The steps covered in this post are basic steps which will help in building any model.

Step1: Define Business Problem

Whenever we build any model we actually try to solve a business problem so first input for any model is simple one business problem for which we need to provide a solution. For example simple example of business problem can be just to increase sales, or measuring effective ness of marketing campaign, or what will be customer response to company product and so on. So defining business problem becomes first step or we can say input to any model.

Step 2: Convert Business Problem In to an Analytical Problem

In next step we need to analyse this problem and convert into an analytical problem in discussion with important stakeholders. Main important point to consider to analyse any business problem

  1. For which variable we need to provide solution, is it revenue or is it net profit, or is it number of customers etc. Basically we are trying to find our target variable. Most important point that we need to consider is not to think about any independent variable or predictor variable at this stage. Just focus on for what we need to solve our problem.
  2. Select granularity level for which problem need to be solve. For example if we talking about solving problem of retail sales then should we consider one retail point or retail point of particular region or entire retail points.
  3. Historical time period to consider for target variable. It can be say last 3 months, 1 year or say last 3 years. It will depend on modeller that how much historical data he/she want to consider.

Step 3: Generate Hypothesis for all factors Related to Your Problem

Think for all possible hypothesis which will help affect our target variable. The points to build a hypothesis can be from following categories.

  1. Based On demographics variable: Age, gender, social status etc. for example which gender purchases our products more, which age group generates more revenue etc.
  2. Based on macroeconomics factors. For example is inflation or rate of interest affecting our sales?
  3. Based on Industry specific factors like no of transaction, value of each transaction.

There can be many more categories or subcategories which we need to consider based on our target variable and business problem.

Step 4: Prepare Dataset

 Generate independent variables based on each hypothesis framed in previous step. Each hypothesis will give one independent variable. For some independent variables data will be easily present like for example age or gender. But for some variables you will required to ask for data specifically from business owners .Because not all data will be present in one place, sometimes it may happen that you need to generate data after doing some interviews from process owners or some other concerned authorities.

Step 5: Data Audit

In this step you need to basically perform audit check on your dataset. You need to perform some treatment like missing value treatment, outlier treatment, type check for data variable etc.

Step 6: Finding out Relationships between variables

In this step we need to build relationship between independent variables or between dependent & independent variables.

Relationship building methods can be

  1. Between independent variables only: It can be done through correlation matrix, multicollinearity values etc.
  2. Between Dependent & Independent Variables: it can be done through bivariate analysis, chi-square testing, entropy value or information value.

Step 7: Model Building & Validation

In this step we simple build our model. It could be linear regression, logistic regression, or decision tree etc. Will depend on type of dependent variables and problem type. In this step you need to also validate your model based on existing metrics of that model. For example for linear regression model Rsquared and adjusted Rsquared value will do or you can do residual analysis as well. For logistic regression you need consider K-S statistics, ROC curve, gain/lift value. If there are multiple logistic model with same dependent and almost same independent variable then you need to consider AIC value.

Step 8: Production

After validating model you need to run you model on new data or for data for which you need to predict values. If required you need to do some changes and then rerun the model.

According to me, above specified step are basic general steps which are required for most of the model.

Thanks friends for reading this post. Please share your feedback regarding this post.

Happy reading.


Logistic Regression In R

Well in our last discussion(Logistic Regression Introduction) we discussed about some basic of logistics regression, now we will see all those concepts with the help of one model.
To introduce data for understanding I am providing a little brief about data.
The data is related with direct marketing campaigns of a Portuguese banking institution. Most of the marketing campaigns were based on phone calls.Now the question that we need to answer whether customer would be go for term deposit.So our target variable is categorical and will take values Yes(1) or No(0). So this is clear case of logistic regression as our target variable is categorical and we can prodict the odds of yes or no or simply saying probability of yes or no based on independent variables.
Independent variables include age, job type, educational background,any persona or housing loan, when was contacted last, employment status, consumer price index,uribor 3 month rate etc. If you want to have complete information then please refer to link (
The tool that we are going to use is “R” to build our model.
Ok let’s start with our model. first we need to import data in R and clean it as per requirement.
data_bank_Orig=read.table("bank-additional.csv",sep=";",header = TRUE)
##    Mode    TRUE    NA's 
## logical    4119       0
The above lines will import the data into R and will store in variable named “data_bank_Orig”. complete.cases will check for any missing values in data
In above lines of code we have removed one column named duration from data as it was of least use in analysis. Similary we coded our target variable in 1’s and 0’s . Means if customer said yes for termdeposit we coded 1 and if customer said no we coded ‘0’.
I would like to mention one point that please don’t put much focus in syntax of R statements as it will somewhat differnet in other tool but analysis and methodology will almost remain same.
You also need to run bivariate analysis and correlation matrix between dependent and independent variable. In this post I am focussing on basix logistic model so I am skipping bivariate analysis and other thing.
After data editing and bivariarte analysis, next step is to check for multicolinearity.It is very important to check for multicolinearity when you run any multivariate analysis. Multicolinearity basically checks whether indepenent variables are dependent on each other. If you have some independent variables which are dependent on other independent variables then your model need to consider whether to include those variable or not.
We will check for multicolinearity using ViF values. VIF value is calcuated as
ViF=1/1-Rsqaured.With Rsquared you must have guess that it must be related to linear regression. yes it is baiscally to calculate VIF values, we regress each independent numerical variable with other independent variable. And we get Rsquared value. So higher the Rsquared value ,it means more those independent variable are related. higher R squared will give higher Vif value eventually. So all those independent variables greated than 2 need to be reconsider for model.
One important point here is that we can check colinearity for numerical variables only. For categorical colinearity by linear regression is not possible.
## Call:
## lm(formula = termdeposit ~ age + campaign + pdays + previous + 
##     emp.var.rate + cons.price.idx + cons.conf.idx + euribor3m + 
##     nr.employed, data = data_bank_audit_vif)
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.72875 -0.11186 -0.05877 -0.01508  1.01655 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -4.182e+00  3.176e+00  -1.317   0.1879    
## age             9.970e-04  4.314e-04   2.311   0.0209 *  
## campaign       -2.971e-03  1.755e-03  -1.692   0.0906 .  
## pdays          -3.672e-04  2.923e-05 -12.560  < 2e-16 ***
## previous       -1.215e-02  1.101e-02  -1.104   0.2697    
## emp.var.rate   -3.468e-02  1.597e-02  -2.172   0.0299 *  
## cons.price.idx  8.067e-02  1.918e-02   4.206 2.66e-05 ***
## cons.conf.idx   6.173e-03  1.531e-03   4.031 5.66e-05 ***
## euribor3m      -1.399e-02  2.020e-02  -0.693   0.4886    
## nr.employed    -5.092e-04  3.345e-04  -1.522   0.1280    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 0.2835 on 4109 degrees of freedom
## Multiple R-squared:  0.1779,	Adjusted R-squared:  0.1761 
## F-statistic: 98.79 on 9 and 4109 DF,  p-value: < 2.2e-16
##            age       campaign          pdays       previous   emp.var.rate 
##       1.014261       1.041609       1.613374       1.822222      31.924471 
## cons.price.idx  cons.conf.idx      euribor3m    nr.employed 
##       6.328363       2.537052      62.833530      31.117315
#removd  euribor3m
##            age       campaign          pdays       previous   emp.var.rate 
##       1.014211       1.033898       1.613285       1.822216      23.094715 
## cons.price.idx  cons.conf.idx    nr.employed 
##       5.500287       1.293035      12.919532
#removing emp.var.rate
##            age       campaign          pdays       previous cons.price.idx 
##       1.014197       1.033614       1.613277       1.814689       1.328530 
##  cons.conf.idx    nr.employed 
##       1.051378       1.801063
In above line of code we used R library car to calculate vif values. for 2 valirables we found the vif value greater than 2 , so we removed those 2 variable from our analysis. Ideal approach should be to check for business point of view before removing any variable. But as this is just a example so we are fine with removing any variable.
The function that we are going to use for logistic regression is glm which is nothing but modelling function for general linear model with family attribute as binomial(‘logit’). It will specify that dependent variable is categorical and will use logit function . So lets use it with our variable.
                          cons.price.idx+cons.conf.idx+nr.employed , family = binomial("logit"),data=data_bank_audit)
## Call:
## glm(formula = termdeposit ~ age + job + marital + education + 
##     default + housing + loan + contact + month + day_of_week + 
##     campaign + pdays + previous + poutcome + cons.price.idx + 
##     cons.conf.idx + nr.employed, family = binomial("logit"), 
##     data = data_bank_audit)
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0136  -0.3971  -0.3218  -0.2522   2.8617  
## Coefficients: (1 not defined because of singularities)
##                                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                   3.956e+01  1.238e+01   3.196 0.001391 ** 
## age                           1.660e-02  6.853e-03   2.422 0.015434 *  
## jobblue-collar               -3.211e-01  2.251e-01  -1.426 0.153805    
## jobentrepreneur              -5.301e-01  3.975e-01  -1.334 0.182302    
## jobhousemaid                 -2.040e-01  3.977e-01  -0.513 0.607977    
## jobmanagement                -4.594e-01  2.486e-01  -1.848 0.064667 .  
## jobretired                   -2.674e-01  3.026e-01  -0.883 0.377008    
## jobself-employed             -5.911e-01  3.489e-01  -1.694 0.090264 .  
## jobservices                  -1.916e-01  2.390e-01  -0.802 0.422638    
## jobstudent                    2.953e-02  3.491e-01   0.085 0.932582    
## jobtechnician                -5.521e-02  1.916e-01  -0.288 0.773260    
## jobunemployed                 1.098e-01  3.332e-01   0.330 0.741776    
## jobunknown                   -4.857e-01  6.359e-01  -0.764 0.444994    
## maritalmarried                1.568e-01  2.012e-01   0.779 0.435797    
## maritalsingle                 2.720e-01  2.305e-01   1.180 0.237943    
## maritalunknown                8.078e-02  1.173e+00   0.069 0.945095    
## educationbasic.6y             2.698e-01  3.392e-01   0.796 0.426268    
## educationbasic.9y             1.471e-01  2.724e-01   0.540 0.589213    
##          1.149e-01  2.616e-01   0.439 0.660580    
## educationilliterate          -1.190e+01  5.354e+02  -0.022 0.982269    
## educationprofessional.course  2.198e-01  2.848e-01   0.772 0.440397    
##    2.009e-01  2.628e-01   0.765 0.444473    
## educationunknown              2.272e-01  3.441e-01   0.660 0.508960    
## defaultunknown               -2.953e-02  1.780e-01  -0.166 0.868219    
## defaultyes                   -1.030e+01  5.354e+02  -0.019 0.984652    
## housingunknown               -3.679e-01  4.257e-01  -0.864 0.387466    
## housingyes                   -1.073e-01  1.171e-01  -0.916 0.359734    
## loanunknown                          NA         NA      NA       NA    
## loanyes                      -7.931e-02  1.592e-01  -0.498 0.618370    
## contacttelephone             -8.291e-01  2.151e-01  -3.855 0.000116 ***
## monthaug                     -4.856e-01  3.093e-01  -1.570 0.116394    
## monthdec                      5.808e-01  5.462e-01   1.063 0.287678    
## monthjul                     -1.475e-02  2.950e-01  -0.050 0.960132    
## monthjun                      6.073e-01  2.725e-01   2.229 0.025845 *  
## monthmar                      1.334e+00  3.863e-01   3.452 0.000556 ***
## monthmay                     -5.189e-01  2.341e-01  -2.216 0.026663 *  
## monthnov                     -5.432e-01  2.886e-01  -1.882 0.059824 .  
## monthoct                     -4.661e-01  3.783e-01  -1.232 0.217919    
## monthsep                     -7.412e-01  3.950e-01  -1.876 0.060600 .  
## day_of_weekmon               -5.000e-02  1.820e-01  -0.275 0.783503    
## day_of_weekthu                1.182e-02  1.823e-01   0.065 0.948310    
## day_of_weektue               -3.167e-02  1.867e-01  -0.170 0.865338    
## day_of_weekwed                1.378e-01  1.876e-01   0.735 0.462567    
## campaign                     -7.912e-02  3.423e-02  -2.312 0.020804 *  
## pdays                        -4.137e-04  5.879e-04  -0.704 0.481663    
## previous                      1.689e-01  1.640e-01   1.030 0.303097    
## poutcomenonexistent           6.519e-01  2.706e-01   2.409 0.016005 *  
## poutcomesuccess               1.363e+00  5.831e-01   2.337 0.019438 *  
## cons.price.idx                1.130e-01  1.263e-01   0.895 0.370655    
## cons.conf.idx                 4.155e-02  1.544e-02   2.691 0.007130 ** 
## nr.employed                  -9.919e-03  9.725e-04 -10.200  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Dispersion parameter for binomial family taken to be 1)
##     Null deviance: 2845.8  on 4118  degrees of freedom
## Residual deviance: 2212.0  on 4069  degrees of freedom
## AIC: 2312
## Number of Fisher Scoring iterations: 12
## Warning in predict.lm(object, newdata,, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

After running this model we got our predicted values using predict function bassed on model. Well again to iterate my previous point predicted values will be probabilities of 1’s and 0’s.
Now we need to validate this model with the use of metrices that we discussed in last post
## Loading required package: gplots
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##     lowess
performance_bankmodel= performance(pred_bank_rocr,"tpr","fpr")
## [1] 0.4893457
plot(performance_bankmodel, main="ROC curve", colorize=T)
abline(0,1, lty = 8, col = "red")
plot of chunk unnamed-chunk-5
With the help of ROCR library in R we calculate K-S statistics which come out to be 47% in this model which is well within range .We have also plotted the predicted values curve .
gains_data_bank=gains(data_bank_audit$termdeposit,predicted_banktermdeposit,groups=100,percents = FALSE)
plot of chunk unnamed-chunk-6

plot of chunk unnamed-chunk-6
## [1] 0.5851077
In above lines of code we plotted gain chart and calculated gini coefficient which comes out to be well with in range
Well with this we get a model which can predict for customer whether he/she will go for term deposit
This post may seem to be focused more on R statement but we need to focus on methodology and metrics used to evaluate any model. Tool can be different but methodology will be almost same.
Well that’s it for now. I will come up with my next post related to linear regression. Well you must be thinking why I am writing about linear regression after covering logistic regression. well I ll answer this as well in my next post. Till then Good Bye. Happy Reading .

Introduction Of Logistic Regression

Logistic Regression Introduction 

If you want to learn or know about analytics or modelling then more often you have come across this term Logistic Regression. Well you would have searched in google and found out that it is nothing but a regression technique which deals with categorical dependent variable.  Well ok, we will take this as definition and now we can differentiate it from Linear Regression where the dependent variable is numerical variable.

Now we have understood the basic difference between logistic regression and linear regression, but first we need to understand why we can’t apply linear regression on categorical variables.

Linear regression follows principle of Ordinary Least Squares which means it calculates the combination of coefficients where the error values of predicted values will be smallest as possible.


In linear regression modelling dependent values was numerical so it was easy for model to perform mathematical function such as addition, subtraction, square etc. But what if dependent variable can take only 2 values yes or no (1 or 0). In most real life situation we face questions like we should take this or not, is it good or bad , would my customer come to me or not etc. . In business you need to predict in terms of yes or no. So method of linear regression will definitely not be best suited for this type of situation.

To deal with this situation you need a function which can help you in predicting yes or no with a regression equation of Y=b0+b1x1+b2x+b3x3+….bnxn+e.

If we think for dependent variable the best, we can do is to count number of yes or no to predict something out of it or if we can think in some statistical way, we can find probability of yes or no. So here we move towards logistic modelling.

Logistic regression calculates the log of odds or simply put log of probability of occurrence vs probability of non-occurrence.

So now equation will be

Log (odds) =log (p/1-p) =b0+b1+b2x2+b3x3+…bnxn+e

Odds will have range of 0 to infinity

But range of log (odds) will be –infinity to +infinity which is similar to linear regression equation

So to expand further we will get


Where z=b0+b1x1+b2x2+…bnxn+e

So in simple terms on changing x there will be change in probability of dependent variable.

So now independent variable will effect probability of occurrence instead of numerical value. This is where our hypothesis will be different from linear model.

Why it is called logistic, as it follows logit distribution (s curve) so it was named as logistic regression.

Ok Now we have got of some idea of logistic regression.

We need to discuss some more terms related to logistic modelling

AIC Value– Akaike Information Criterion. It is to compare the 2 logistic model with same almost same dependent and independent variables. Lower the AIC, better will be the model.

K-S Statistics (Kolmogorov Smirnov): It is just measure of separation between good cumulative & bad cumulative (or positive and negative). In other words we can say it is maximum difference between cumulative event & cumulative non-event.

Major guidelines For K-S Statistics

  1. It should lie between 40-70 %
  2. It should be in top 3 or 4 deciles.

One major point is that after K-S point there will be reduction in cumulative good proportion, it means at k-S maximum population will be covered  and if we add more population after K-S then there will les not significant output

Gini Coefficient: it is measure of in-equality or simply say how dispersed values are. For a good logistic model it should be between 40 to 60%


Concordance: it is similar to R-Squared value of linear regression. It is used to validate the model. Higher the concordance better the model. It is nothing but probability of correctly predicted 1‘s as 1.

Concordance= (pairs with greater probability of 1/total pairs)

To calculate Concordance, we make all pairs of predicted 1’s with 0 and select pairs which has higher probability of 1 in the pair than 0 and divide by total pairs.

Disconcordance will select a those pairs which have high probability of 0 than 1

And if both 1’s and 0’s have same probability in pair then it will be called as tied

So complete model will be defined as

Total=concordance+ disconcordance +tied

Well this was some basics of logistic regression, in next post we will create models of linear regression & logistic regression.