Introduction to Chaid
Hello friends, In this post we will discuss about very important analytical technique called CHAID (Chi square Automatic Interaction Detector) .It is a type of decision tree technique to model the data into different categories. For example if you want to run one marketing campaign for database of 100000 users then you would like to choose those set of users from whom you benefitted most or in other words you get maximum response rate from those users . So you need to classify those users based on response rate.
Let’s understand some basic terms first.
Decision Tree Analysis: In short decision tree is one of the predictive modelling approach for mapping the dataset on the basis of target variable. It is used in different approaches like data mining, machine learning etc. It either builds classification trees (if target variable is categorical) or regression trees (if target variable is continuous).
Decision tree starts with target variable mean it act as initial node and splitting happens based on statistical analysis used by different decision trees.
Parametric Technique & Non Parametric Technique: One main difference is assumptions about distribution of underlying variables. In case of parametric technique we assume that our variables are in some form of distribution (linear, log-linear etc.) but in case of non-parametric techniques we do not make assumption about distribution of variables.
Supervised & Un-Supervised : In supervised learning there will be some target variable for which we need to estimate or predict based on values of independent variables While in case of Un-supervised learning there will be no target variable and objective in these type of learning is to find the structures in the data
Examples of supervised learning algorithms are linear regression, logistic regression, decision tree while examples of un-supervised learning is clustering.
Basic properties of chaid
- It is a supervised learning algorithm that means it will work on target variable. Target variable supposed to be categorical but with tools like R and spss it can be performed on continuous variable.
- It is non-parametric technique which means there is no assumptions about distribution.
- No outlier and missing value treatment required which we need to take care in technique like linear and logistic regression.
- Can be implemented very quicky as compared to logistic and linear regression modelling.
Why Chaid if logistic regression is present
- Chaid provides good visual representation of data in form of tree which can be easily understood by many.
- It can be implemented very quickly as we are not doing much of data correction or checking distributions.
- No requirement of data too be present in some type of distribution.
Statistic Criteria for Chaid
Chaid use chi-square value of target variable vs independent variable for splitting. The variable which has highest chi-square will be used for splitting. After splitting same process is repeated again till all population is covered. The minimum number observation required for node is 5% of total observation if this criteria is not met then no further splitting occur.
Terminal Node: last node of tree. It should have minimum if 5% of observation.
Parent node: generally parent has observations of at least 2.5 times of child node.
Selection criteria for selecting nodes for campaign: Those node which has event rate of greater than starting event rate in first node will be selected for campaign
For example: A pizza outlet carried out a promotional campaign and reached out to 1000 individuals in a locality. Out of the 1000 individuals approached, 200 responded by visiting the outlet. The outlet also collected the personal details of customer. After the campaign period was over, the outlet carried out an analysis in order to classify the customers into various classes
For example in below tree when we started building tree, response rate was 20% after building tree using chaid we found response rate of 40% in 2 segments so we will select these nodes.
Married male segment has response rate of 40%
Divorced with no pets has response rate of 40%
This is a brief introduction about chaid algorithm, there are many more algorithm are present in decision tree analysis but chaid is very popular out of them. Other algorithms are CART, Bagging, Random forest etc.