Churn Model for Predicting Employees Leaving the Company

📚 Table of Contents 📚

Section	Title
1	Introduction
2	Problem Statement
3	Project Goals
4	Data Overview
5	Approach to Problem
6	Libraries Used
7	Data Preprocessing
8	Exploratory Data Analysis
9	Model Development
10	Model Deployment
11	Business Impact: Employee Turnover and Targeted Retention Programs

1. Introduction

Employee turnover can be a costly challenge for any organization, leading to disruptions in operations, lower productivity, and the need for frequent recruitment. To mitigate this, a churn model can help predict which employees are most at risk of leaving the company, allowing HR departments to take proactive steps to retain talent. This project explores the development of a churn prediction model using machine learning algorithms, with a focus on identifying key factors influencing employee turnover and guiding the creation of targeted retention programs.

2. Problem Statement

The goal of this project is to build a scalable churn model that predicts which employees are at risk of leaving an organization. The insights from this model help HR departments proactively address employee concerns, improve retention, and target employees most likely to churn. The model was developed using Random Forest algorithms and deployed for real-time predictions.

3. Project Goals

Identify At-Risk Employees: Use machine learning models to predict which employees may be considering leaving the company.
Understand Turnover Drivers: Analyze data to uncover factors contributing to employee churn, such as job satisfaction, number of projects, working hours, and department.
Enhance Retention Strategies: Use insights from the churn model to inform HR policies and develop programs that improve employee satisfaction and retention.

4. Data Overview

The dataset used for this project consists of several key employee metrics:

Satisfaction Level: Employee’s self-reported job satisfaction.
Last Evaluation: Performance score from the most recent evaluation.
Number of Projects: Total number of projects the employee has worked on.
Average Monthly Hours: Average number of hours worked per month.
Time Spent at Company: Number of years the employee has worked at the company.
Work Accident: Whether the employee has experienced a work-related accident.
Promotion in Last 5 Years: Whether the employee was promoted in the last five years.
Department: The department the employee works in (e.g., IT, Sales, Technical, etc.).
Salary: The salary level of the employee (Low, Medium, High).

The target variable is Quit_the_Company, indicating whether the employee left the company.

5. Approach to Problem

We aim to build a predictive model to identify employees at risk of leaving the company.
The dataset was fetched and merged using Google BigQuery SQL queries.
After preprocessing the data, machine learning models such as Random Forest and Gradient Boosting were used for classification.
The focus of the model is on Recall, as it is essential to identify as many employees at risk of churn as possible.
The final model was deployed using PyCaret.

6. Libraries Used

The project utilized the following key libraries in Python:

Google BigQuery: Used to connect and query datasets stored in Google Cloud's BigQuery.
Pandas: Used for data manipulation and analysis, including reading and merging datasets.
PyCaret: A low-code machine learning library used to train and evaluate multiple machine learning models, including Random Forest, LightGBM, XGBoost, and others.
Scikit-Learn: For machine learning algorithms and preprocessing, including Random Forest Classifier, Decision Trees, and evaluation metrics like accuracy, precision, recall, and F1 scores.
SQL (Google BigQuery): SQL queries were used to join datasets from the tbl_hr_data and tbl_new_employees tables.
Google Colab: Used as the environment for running the notebook and model training.
Google Looker: For creating an interactive data visualization dashboard, allowing users to explore employee churn trends and filter results by department.

7. Data Preprocessing

The employee data was stored in two separate tables: tbl_hr_data for the original dataset and tbl_new_employees for the new employees in the pilot program. These two tables were combined using SQL in Google BigQuery with the following query:

SELECT *, "Original" as Type FROM `data-analysis-end-to-end.employeedata.tbl_hr_data`
UNION ALL
SELECT *, "Pilot" as Type FROM `data-analysis-end-to-end.employeedata.tbl_new_employees`

8. Exploratory Data Analysis

Key insights gained from EDA include:

Satisfaction Level: A major predictor of churn. Employees with lower satisfaction scores are more likely to leave.
Tenure: Employees who have been with the company for more than two years or fewer than one year show higher churn rates.
Work Accident: Surprisingly, having a work accident had little effect on churn probability.
Department Analysis: Certain departments, like Support and Technical, have higher churn rates compared to others.

9. Model Development

The Random Forest model was chosen for its ability to handle large datasets and identify feature importance. Below are the steps involved:

Feature Engineering: Created new features from available data, such as employee engagement based on the number of projects and working hours.
Model Selection: We compared several models, including Random Forest, LightGBM, and XGBoost, using accuracy, recall, AUC, and F1 score as evaluation metrics.

The following machine learning models were tested, and the best-performing model was Random Forest, achieving the highest overall accuracy and performance across multiple metrics:

# set up our model
setup(df, target = 'Quit_the_Company', session_id = 123, ignore_features = ['employee_id'], categorical_features = ['salary', 'Departments'])

# false positve, precision ensures that when an employee churns, it is correct. basically being correct in our predictions
# false negative, recall captures most of our employees that will churn
compare_models()

The best-performing model was the Random Forest Classifier, which achieved an accuracy of 98.86%, an AUC of 0.9912, and a recall of 95.84%.

Next, the feature importance plot shows which variables most influenced the model's decision to predict whether an employee will quit. The updated importance ranking is explained in the comments, with satisfaction_level, time_spend_company, and number_project as the key factors.

# write back to BigQuery
new_predictions.to_gbq('employeedata.pilot_predictions',
                       project_id,
                       chunksize = None,
                       if_exists = 'replace')

# now, find which variable or column(s) led to our model prediction whether an employee will churn
plot_model(rf_model, plot='feature')

X-axis (Variable Importance): Shows the relative importance of each feature in influencing the model’s prediction.
A higher value means that the feature has more influence on whether an employee is predicted to stay or quit.
Y-axis (Features): The specific features or variables from the dataset, such as 'satisfaction_level', 'time_spend_company', and 'number_project'.

Key observations from the updated feature importance are:

'Satisfaction Level' remains the most crucial feature for predicting employee churn. Employees who are less satisfied with their jobs are more likely to quit.
'Time Spent at the Company' is the next most important feature, showing that employees who have been with the company for a longer period may have a higher chance of staying.
'Number of Projects' is another influential factor. Employees who are involved in more projects tend to stay engaged and are less likely to churn.
'Average Monthly Hours' and 'Last Evaluation' also play a moderate role in predicting churn, where higher hours worked and better evaluations are associated with lower churn.
Features like 'Work Accident' and salary levels (low, high, medium) have less influence on the model’s predictions, indicating that these factors are less critical in determining employee churn compared to satisfaction and tenure.

10. Model Deployment

The final churn prediction model was deployed using Google Looker to create an interactive dashboard. The dashboard allows users to:

Visualize key metrics such as satisfaction levels, time spent at the company, and churn predictions.
Filter the data by specific departments (e.g., Sales, Marketing, IT, etc.) or view results for all departments combined.
Gain insights into the factors driving employee churn through interactive charts, with a focus on department-level analysis.

This interactive dashboard provides a user-friendly platform for HR teams to explore employee churn trends and identify actionable insights.

Dashboard Features:

Interactive Department Filtering: Users can filter by individual departments or view data across all departments.
Churn Drivers: Visualizations showing the most important factors contributing to employee churn, such as job satisfaction, time spent at the company, and work accidents.
Employee Retention Metrics: HR teams can quickly identify the number of employees predicted to stay or leave based on the Random Forest model's output.

You may view the dashboard here, or preview a screenshot below:

11. Business Impact: Influencing Employee Turnover and Guiding Targeted Retention Programs

Understanding Employee Turnover

Employee turnover can have a significant financial and operational impact on an organization. High turnover disrupts workflows, decreases productivity, and increases costs associated with recruitment and training. The Churn Prediction Model helps identify employees who are at risk of leaving, allowing HR to address these risks before employees actually resign.

Key Insights from the Model

The model provides insights into which factors are most strongly associated with employee churn. For example:

Satisfaction Level: The model highlights that job satisfaction is the most critical factor in determining whether employees will stay or leave. HR can prioritize strategies to improve employee satisfaction through regular feedback, job enrichment, and recognition programs.
Time Spent at Company: Employees who have spent more time at the company are more likely to churn, which indicates a need for better career progression and development opportunities as employees mature in their roles.
Number of Projects and Average Monthly Hours: Employees who are overworked (or underutilized) may be more prone to leaving. This suggests the importance of workload management and fair distribution of projects to maintain motivation and prevent burnout.

Data-Driven Retention Programs

Using the insights provided by the churn model, HR departments can develop targeted retention programs that focus on the highest-impact areas:

Customized Employee Engagement Plans: Tailoring engagement strategies based on an employee's tenure, department, or project workload to boost satisfaction and reduce turnover.
Career Development Programs: Offering training, mentorship, and promotion opportunities for employees who have been with the company for several years to combat stagnation.
Work-Life Balance Initiatives: For employees showing high project counts and long working hours, implementing flexible work schedules or project rotation can mitigate burnout risks.
Regular Job Satisfaction Surveys: Gathering frequent feedback on job satisfaction and adjusting policies to ensure employees' needs are met, which directly influences retention.

By leveraging these insights, HR can proactively focus resources on the areas most likely to drive turnover, ultimately reducing employee attrition rates and fostering a more engaged workforce.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
Image Assets		Image Assets
Employee Churn Model.ipynb		Employee Churn Model.ipynb
README.md		README.md
tbl_hr_data.csv		tbl_hr_data.csv
tbl_new_employees.csv		tbl_new_employees.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Churn Model for Predicting Employees Leaving the Company

📚 Table of Contents 📚

1. Introduction

2. Problem Statement

3. Project Goals

4. Data Overview

5. Approach to Problem

6. Libraries Used

7. Data Preprocessing

8. Exploratory Data Analysis

9. Model Development

Key observations from the updated feature importance are:

10. Model Deployment

11. Business Impact: Influencing Employee Turnover and Guiding Targeted Retention Programs

Understanding Employee Turnover

Key Insights from the Model

Data-Driven Retention Programs

About

Releases

Packages

Languages

charmieboo/churn-model-for-employees-leaving

Folders and files

Latest commit

History

Repository files navigation

Churn Model for Predicting Employees Leaving the Company

📚 Table of Contents 📚

1. Introduction

2. Problem Statement

3. Project Goals

4. Data Overview

5. Approach to Problem

6. Libraries Used

7. Data Preprocessing

8. Exploratory Data Analysis

9. Model Development

Key observations from the updated feature importance are:

10. Model Deployment

11. Business Impact: Influencing Employee Turnover and Guiding Targeted Retention Programs

Understanding Employee Turnover

Key Insights from the Model

Data-Driven Retention Programs

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages