Predicting Customer Churn: A Machine Learning Approach
As part of my roadmap to becoming a data scientist, I knew I needed to tackle tough projects that would help me build a strong foundation in ML—something that I could showcase to demonstrate my skills. When I came across a video by Infinite Codes titled 22 Machine Learning Projects That Will Make You A God At Data Science, I found both inspiration and a structured plan to build foundational projects. One of the suggested projects was predicting customer churn. As Infinite Codes states:
“This is a business favorite, and for good reason. Instead of just working with random datasets, you’ll build something that could actually save the company millions.”
What is Customer Churn?Permalink
Customer churn is the percentage of customers who stop doing business with a company within a specific period. Churn is a critical issue for businesses, as it directly impacts revenue and growth. The challenge is identifying which customers are at risk of leaving, allowing for timely intervention.
For this project, I used the Telco Customer Churn dataset from Kaggle. This classic dataset required very little cleaning: I only had to map the Senior Citizen feature to “Yes” and “No,” convert TotalCharges from an “object” dtype to “float64,” and impute the missing values based on the TotalCharges median. The main issue in the dataset was class imbalance—out of 7,043 entries, 73.46% were Not Churned, while only 26.54% were Churned.
Exploratory Data Analysis (EDA)Permalink
View my project to see all the visualizations I created
For my EDA, I started by visualizing the class imbalance with a simple count plot, confirming that the dataset was heavily skewed toward non-churned customers. Next, I examined the numeric features using boxplots, differentiating between churned and non-churned customers. These revealed significant median differences in several features. Finally, I created count plots for categorical variables to identify any strong correlations with churn.
Notably, customers with fiber optic internet and those who paid via electronic check showed higher churn rates. This suggests that factors such as service type and payment method might contribute to customer retention issues.
Preprocessing & Feature EngineeringPermalink
For preprocessing and feature engineering, I took the following steps:
- Consolidated redundant categories: The features PhoneService and InternetService already accounted for “No phone service” and “No internet service” in downstream features. I converted those values to simply “No” to improve model clarity.
- Dropped non-informative features: customerID and gender did not impact churn, so they were removed.
-
Encoded categorical variables: Binary columns were label-encoded, and a heatmap was created to filter out weak correlations ( r < .2). - Feature Engineering: Using the correlation matrix, I engineered the following features:
- tenureTerm: Groups tenure into “New,” “Mid,” and “Long-term” customers.
- HighSpender: True if a customer spends more monthly than the median customer.
- HasSupportServices: True if a customer has 2 or more support services.
- LongTermContract: True if the customer does not have a month-to-month contract.
These new features showed meaningful correlations with churn, so they were retained for modeling.
Model Selection & TrainingPermalink
After splitting the data into training and testing sets, I:
- Imputed and scaled numeric data.
- Imputed and one-hot encoded categorical data.
- Created a preprocessing pipeline for feature transformation and model training.
- Addressed class imbalance using Synthetic Minority Oversampling Technique (SMOTE) to balance churned and non-churned samples.
I then trained the following models:
- Logistic Regression
- LinearSVC
- Random Forest
- XGBoost (XGBClassifier)
Evaluation Metrics:Permalink
- Accuracy: Overall correctness: (True Pos + True Neg) / Total Predictions.
- Precision: Accuracy of positive predictions: True Pos / (True Pos + False Pos).
- Recall: Ability to identify actual positive instances: True Pos / (True Pos + False Neg).
- AUC-ROC: Measures how well the model distinguishes between classes. A score of 1.0 indicates a perfect classifier, while 0.5 means the model performs no better than random guessing. The higher the AUC-ROC, the better the model ranks positive instances over negative ones.
Results & Model ComparisonPermalink
After evaluating all models, Logistic Regression emerged as the best performer with the following metrics:
Metric | Score |
---|---|
Accuracy | 79.77% |
ROC AUC | 82.75% |
Precision | 79.00% |
Recall | 80.00% |
This was an ideal outcome since Logistic Regression is the most explainable model among those tested. Model explainability is crucial in real-world business applications—stakeholders must understand why a customer is likely to churn, not just receive a prediction.
Challenges & LearningPermalink
One major takeaway was the impact of SMOTE on my Logistic Regression model. Initially, I applied SMOTE to balance the dataset, but after removing it, my model performance improved significantly. This was likely because the class imbalance wasn’t extreme (73.46% Not Churned, 26.54% Churned), and SMOTE introduced noise that distorted feature distributions.
Moving forward, I’ll explore alternative class imbalance techniques, such as:
- Using weighted loss functions in the model.
- Trying undersampling techniques instead of oversampling.
- Experimenting with hybrid resampling methods to balance minority classes while preserving data integrity.
ConclusionPermalink
At the end of my project, I provided Key Takeaways, Business Recommendations, and Future Steps. While I won’t repeat those here, I will say that this project was a valuable learning experience.
I can tell that my visualization skills are becoming more refined, and the modeling process is becoming more intuitive. This was also a great exercise in feature engineering and understanding the trade-offs of different preprocessing techniques. As I continue building ML projects, I aim to refine my techniques further and explore advanced modeling approaches—such as ensemble methods and interpretability tools like SHAP.
This is only the beginning of my journey in predictive modeling, and I look forward to tackling even more complex challenges ahead!