Predicting Attrited Bank Customers Using Machine Learning

Nada Alzahrani
5 min readFeb 24, 2021

Customer attrition which is also known as customer churn, customer turnover is defined as the loss of customers in a business, it is one of the biggest concerns especially in banking since customers are considered as the most valuable part of it.

We are using the bank customer’s data to predict possible attrited customers using Machine Learning to help prevent any possible attrition that may happen in the future.

Dataset

This data was obtained from Kaggle and it is a sample of credit card customer’s accounts starting from March 2013 till when the attrition was defined within the next 6 months (April 2016 — October 2013).

The data has 10,127 customers sample and contain features about their demographic profile such as gender, age, education level, etc. and their transactions history.

Dataset head

Exploratory Data Analysis (EDA)

Our goal here is to understand the bank customers by drawing insights through visualization and identifying the possible reasons for attrition.

Class Distribution

The data is split into 2 classes that identify the customer’s attrition status: Existing Customer, Attrited Customer. Existing Customers represent 83.9% of our sample and the Attrited Customers represent 16.1%, we clearly see that the classes ratio are imbalanced which might make it difficult for the model to identify attrited customers.

Type Of Customer (Class Distribution)

Age Range

The highest number of Attrited customers are in the age between 40–54 years, and the lowest numbers are customers in the age between 65–69 years.

The highest number of Existing customers are in the age between 40–54 years, and the lowest numbers are customers in the age between 75–79 years.

Customer’s Age Range By Their Account Status

Gender

The number of Female customers is higher in both Attrited and Existing Customers.

Customer’s Gender

Credit Card Type

The highest number of attrited and existing customers are blue cardholders, and the lowest number of attrited and existing customers are platinum cardholders.

Customer’s Credit Card Type

Annual Income

The highest number of attrited and existing customers have less than 40K dollars as an annual income, and the lowest number of attrited and existing customers have more than 120K dollars as an annual income.

Period Of Relationship With The Bank

The longest period that existing and attrited customers spent with the bank is between 35–39 months, and the shortest period of they spent is between 10–14 months.

Customer’s Relationship Period With The Bank

Preprocessing The Data

In order to prepare for modeling, we have to encode its categorical features into numbers. The categorical features were encoded ordinary since most of the categorical features were ordinary such as the education level, credit card type and income category, etc.

Then the data was split to 80% train to train the model and 20% test to test our trained model and how it performs.

Modeling

First, created a Baseline model that predicts the most occurring class for all the samples in the test set. Then, used a machine learning ensemble method that uses multiple machine learning algorithms at once to obtain a better predictive performance rather than trying one algorithm at a time, then pick the highest-scoring algorithm and tune it using hyperparameter approach (GridSearchCV) to achieve better results.

The machine learning algorithms used in the method are:

  1. Logistic Regression
  2. K-Nearest Neighbors
  3. Decision Tree Classifier
  4. Random Forest Classifier

Results

Check this article to learn about Classification Model Performace Metrics.

The Baseline Model Scored 0.8460, which was expected since the most accruing class was the existing customer. After the ensemble method, the Random Forest Classifier had the highest-scoring model with 0.968 accuracy, and after tuning it using GridSearchCV we achieved a 0.969 accuracy. The plot below shows the different metrics used to evaluate the baseline model, the Random Forest and the Random Forest after the GridSearchCV (After GridSearchCV).

Comparing Baseline Model with Random Forest Before & After GridSearchCV

The Random Forest model after the GridSearchCV had the highest accuracy and highest precision, but the original Random Forest model got the highest recall.

This means that the Random Forest model after GridSearchCV identified the attrited customers better than the original model, but the original model identified the existing customers better as you can see also in the confusion matrix below.

Confusion Matrix for the Baseline and Random Forest Before & After GridSearch CV

The Random Forest model after the GridSearchCV was correctly identifying the attrited customer as shown in the TP (True Positive) part but identified the existing customer less than the original Random Forest model as shown in TN (True Negative).

ROC Curve For The Random Forest Model Before & After GridSearchCV

As seen in the ROC curve the Random Forest After GridSearch got a higher false-positive rate than the normal Random Forest Model between 0 and 0.1 on the false positive rate axes, which means that the positive class (the existing customer) was identified falsely in the model after GridSearchCV which also can be seen in the confusion matrix above in FP and TP.

To Conclude:

The Best attrition predicting model was the Random Forest after the GridSearchCV with 96.94% accuracy, 99.01% Recall, and 97.42% precision.

You can find my work here on Github.

I encourage you to open the code in Google Colab to view and interact with the plots.

--

--