Customer Churn Prediction with SQL

Predicting customer churn – the rate at which customers stop doing business with a company – is crucial for business sustainability. This challenge requires you to leverage advanced SQL techniques to build a simple churn prediction model directly within a database, using historical customer data. You'll be using window functions, aggregations, and potentially other advanced SQL features to identify patterns and create a churn risk score.

Problem Description

You are provided with a table named customer_data containing historical data about customers of a telecommunications company. The table includes information such as customer ID, contract length, monthly charges, total charges, usage patterns, and whether the customer churned. Your task is to write SQL queries to calculate a churn risk score for each customer based on their usage and billing patterns.

Specifically, you need to:

Calculate Average Monthly Usage: For each customer, determine their average monthly usage based on their total charges and contract length. Handle cases where contract length is zero to avoid division by zero errors.
Calculate Usage Deviation: Calculate how much each customer's average monthly usage deviates from the overall average monthly usage across all customers. This will highlight customers with unusually high or low usage.
Calculate Billing Anomaly: Determine if a customer's monthly charges are significantly higher than the average monthly charges for customers with similar contract lengths.
Combine Metrics into a Churn Risk Score: Create a churn risk score by combining the usage deviation and billing anomaly scores. A simple approach is to sum the normalized scores (scaling them to a range of 0-1).
Predict Churn: Based on the calculated churn risk score, predict whether a customer is likely to churn. Define a threshold (e.g., a score above 0.7 indicates high churn risk).

Expected Behavior:

The final query should return the customer ID, calculated churn risk score, and a predicted churn flag (1 for likely to churn, 0 for not likely to churn) for each customer in the customer_data table.

Edge Cases to Consider:

Zero Contract Length: Handle cases where contract_length is zero to prevent division by zero errors when calculating average monthly usage.
Null Values: Consider how to handle null values in total_charges and monthly_charges. You might choose to exclude them or impute them with a reasonable value (e.g., 0).
Normalization: Ensure that the usage deviation and billing anomaly scores are normalized to a consistent scale (e.g., 0-1) before combining them into the churn risk score.
Threshold Selection: The churn risk score threshold (0.7 in the example) is arbitrary and may need to be adjusted based on the specific dataset and business requirements.

Examples

Example 1:

Input:
customer_data:
| customerID | contract_length | monthly_charges | total_charges | avg_monthly_usage | churn |
|---|---|---|---|---|---|
| 1 | 12 | 50.00 | 600.00 | 50.00 | 0 |
| 2 | 24 | 75.00 | 1800.00 | 75.00 | 1 |
| 3 | 6 | 60.00 | 360.00 | 60.00 | 0 |
| 4 | 12 | 80.00 | 960.00 | 80.00 | 1 |

Output:
| customerID | churn_risk_score | predicted_churn |
|---|---|---|
| 1 | 0.2 | 0 |
| 2 | 0.8 | 1 |
| 3 | 0.3 | 0 |
| 4 | 0.7 | 1 |

Explanation:  (Simplified - actual calculations would be more complex) Customer 2 has a high billing anomaly and usage deviation, resulting in a high churn risk score and a predicted churn of 1. Customer 1 has low scores and a predicted churn of 0.

Example 2:

Input:
customer_data:
| customerID | contract_length | monthly_charges | total_charges | avg_monthly_usage | churn |
|---|---|---|---|---|---|
| 1 | 0 | 50.00 | 0.00 | NULL | 1 |
| 2 | 24 | 75.00 | 1800.00 | 75.00 | 0 |

Output:
| customerID | churn_risk_score | predicted_churn |
|---|---|---|
| 1 | 0.5 | 1 |
| 2 | 0.3 | 0 |

Explanation: Customer 1 has a zero contract length, resulting in a NULL average monthly usage.  The churn risk score is calculated based on other factors, and the predicted churn is 1. Customer 2 has a normal profile and a predicted churn of 0.

Constraints

contract_length is an integer representing the contract length in months.
monthly_charges and total_charges are numeric values (e.g., DECIMAL, FLOAT).
The dataset may contain up to 10,000 rows.
The query should execute within a reasonable time frame (e.g., less than 5 seconds).
Assume the database system supports window functions and common aggregate functions.

Notes

This is a simplified churn prediction model. Real-world models often involve more sophisticated techniques and features.
Consider using Common Table Expressions (CTEs) to break down the query into smaller, more manageable parts.
Normalization is key to ensuring that the different metrics contribute equally to the churn risk score. You can use techniques like min-max scaling or z-score normalization.
The choice of the churn risk score threshold will depend on the desired balance between precision and recall.
Focus on using SQL features effectively to perform the calculations within the database itself, minimizing the need for external processing.
The avg_monthly_usage column is not provided in the input table; it needs to be calculated.
The churn column is provided for validation purposes but should not be used directly in the prediction. The goal is to predict churn based on other features.

Customer Churn Prediction with SQL

Problem Description

Specifically, you need to:

Calculate Average Monthly Usage: For each customer, determine their average monthly usage based on their total charges and contract length. Handle cases where contract length is zero to avoid division by zero errors.

Calculate Usage Deviation: Calculate how much each customer's average monthly usage deviates from the overall average monthly usage across all customers. This will highlight customers with unusually high or low usage.

Calculate Billing Anomaly: Determine if a customer's monthly charges are significantly higher than the average monthly charges for customers with similar contract lengths.

Combine Metrics into a Churn Risk Score: Create a churn risk score by combining the usage deviation and billing anomaly scores. A simple approach is to sum the normalized scores (scaling them to a range of 0-1).

Predict Churn: Based on the calculated churn risk score, predict whether a customer is likely to churn. Define a threshold (e.g., a score above 0.7 indicates high churn risk).

Expected Behavior:

Edge Cases to Consider:

Zero Contract Length: Handle cases where contract_length is zero to prevent division by zero errors when calculating average monthly usage.

Null Values: Consider how to handle null values in total_charges and monthly_charges. You might choose to exclude them or impute them with a reasonable value (e.g., 0).

Normalization: Ensure that the usage deviation and billing anomaly scores are normalized to a consistent scale (e.g., 0-1) before combining them into the churn risk score.

Threshold Selection: The churn risk score threshold (0.7 in the example) is arbitrary and may need to be adjusted based on the specific dataset and business requirements.

Examples

Example 1:

Input: customer_data: | customerID | contract_length | monthly_charges | total_charges | avg_monthly_usage | churn | |---|---|---|---|---|---| | 1 | 12 | 50.00 | 600.00 | 50.00 | 0 | | 2 | 24 | 75.00 | 1800.00 | 75.00 | 1 | | 3 | 6 | 60.00 | 360.00 | 60.00 | 0 | | 4 | 12 | 80.00 | 960.00 | 80.00 | 1 | Output: | customerID | churn_risk_score | predicted_churn | |---|---|---| | 1 | 0.2 | 0 | | 2 | 0.8 | 1 | | 3 | 0.3 | 0 | | 4 | 0.7 | 1 | Explanation: (Simplified - actual calculations would be more complex) Customer 2 has a high billing anomaly and usage deviation, resulting in a high churn risk score and a predicted churn of 1. Customer 1 has low scores and a predicted churn of 0.

Example 2:

Input: customer_data: | customerID | contract_length | monthly_charges | total_charges | avg_monthly_usage | churn | |---|---|---|---|---|---| | 1 | 0 | 50.00 | 0.00 | NULL | 1 | | 2 | 24 | 75.00 | 1800.00 | 75.00 | 0 | Output: | customerID | churn_risk_score | predicted_churn | |---|---|---| | 1 | 0.5 | 1 | | 2 | 0.3 | 0 | Explanation: Customer 1 has a zero contract length, resulting in a NULL average monthly usage. The churn risk score is calculated based on other factors, and the predicted churn is 1. Customer 2 has a normal profile and a predicted churn of 0.

Constraints

contract_length is an integer representing the contract length in months.

monthly_charges and total_charges are numeric values (e.g., DECIMAL, FLOAT).

The dataset may contain up to 10,000 rows.

The query should execute within a reasonable time frame (e.g., less than 5 seconds).

Assume the database system supports window functions and common aggregate functions.

Notes

This is a simplified churn prediction model. Real-world models often involve more sophisticated techniques and features.

Consider using Common Table Expressions (CTEs) to break down the query into smaller, more manageable parts.

Normalization is key to ensuring that the different metrics contribute equally to the churn risk score. You can use techniques like min-max scaling or z-score normalization.

The choice of the churn risk score threshold will depend on the desired balance between precision and recall.

Focus on using SQL features effectively to perform the calculations within the database itself, minimizing the need for external processing.

The avg_monthly_usage column is not provided in the input table; it needs to be calculated.

The churn column is provided for validation purposes but should not be used directly in the prediction. The goal is to predict churn based on other features.