Predicting Attrited Bank Customers Using Machine Learning
Customer attrition which is also known as customer churn, customer turnover is defined as the loss of customers in a business, it is one of the biggest concerns especially in banking since customers are considered as the most valuable part of it.
We are using the bank customer’s data to predict possible attrited customers using Machine Learning to help prevent any possible attrition that may happen in the future.
Dataset
This data was obtained from Kaggle and it is a sample of credit card customer’s accounts starting from March 2013 till when the attrition was defined within the next 6 months (April 2016 — October 2013).
The data has 10,127 customers sample and contain features about their demographic profile such as gender, age, education level, etc. and their transactions history.
Exploratory Data Analysis (EDA)
Our goal here is to understand the bank customers by drawing insights through visualization and identifying the possible reasons for attrition.
Class Distribution
The data is split into 2 classes that identify the customer’s attrition status: Existing Customer, Attrited Customer. Existing Customers represent 83.9% of our sample and the Attrited Customers represent 16.1%, we clearly see that the classes ratio are imbalanced which might make it difficult for the model to identify attrited customers.
Age Range
The highest number of Attrited customers are in the age between 40–54 years, and the lowest numbers are customers in the age between 65–69 years.
The highest number of Existing customers are in the age between 40–54 years, and the lowest numbers are customers in the age between 75–79 years.
Gender
The number of Female customers is higher in both Attrited and Existing Customers.
Credit Card Type
The highest number of attrited and existing customers are blue cardholders, and the lowest number of attrited and existing customers are platinum cardholders.
Annual Income
The highest number of attrited and existing customers have less than 40K dollars as an annual income, and the lowest number of attrited and existing customers have more than 120K dollars as an annual income.
Period Of Relationship With The Bank
The longest period that existing and attrited customers spent with the bank is between 35–39 months, and the shortest period of they spent is between 10–14 months.
Preprocessing The Data
In order to prepare for modeling, we have to encode its categorical features into numbers. The categorical features were encoded ordinary since most of the categorical features were ordinary such as the education level, credit card type and income category, etc.
Then the data was split to 80% train to train the model and 20% test to test our trained model and how it performs.
Modeling
First, created a Baseline model that predicts the most occurring class for all the samples in the test set. Then, used a machine learning ensemble method that uses multiple machine learning algorithms at once to obtain a better predictive performance rather than trying one algorithm at a time, then pick the highest-scoring algorithm and tune it using hyperparameter approach (GridSearchCV) to achieve better results.
The machine learning algorithms used in the method are:
Results
Check this article to learn about Classification Model Performace Metrics.
The Baseline Model Scored 0.8460, which was expected since the most accruing class was the existing customer. After the ensemble method, the Random Forest Classifier had the highest-scoring model with 0.968 accuracy, and after tuning it using GridSearchCV we achieved a 0.969 accuracy. The plot below shows the different metrics used to evaluate the baseline model, the Random Forest and the Random Forest after the GridSearchCV (After GridSearchCV).
The Random Forest model after the GridSearchCV had the highest accuracy and highest precision, but the original Random Forest model got the highest recall.
This means that the Random Forest model after GridSearchCV identified the attrited customers better than the original model, but the original model identified the existing customers better as you can see also in the confusion matrix below.
The Random Forest model after the GridSearchCV was correctly identifying the attrited customer as shown in the TP (True Positive) part but identified the existing customer less than the original Random Forest model as shown in TN (True Negative).
As seen in the ROC curve the Random Forest After GridSearch got a higher false-positive rate than the normal Random Forest Model between 0 and 0.1 on the false positive rate axes, which means that the positive class (the existing customer) was identified falsely in the model after GridSearchCV which also can be seen in the confusion matrix above in FP and TP.
To Conclude:
The Best attrition predicting model was the Random Forest after the GridSearchCV with 96.94% accuracy, 99.01% Recall, and 97.42% precision.
You can find my work here on Github.
I encourage you to open the code in Google Colab to view and interact with the plots.