Feature Engineering for Customer Churn Prediction
Feature engineering is a crucial step in machine learning, often having a greater impact on model performance than the choice of algorithm itself. This challenge asks you to implement several common feature engineering techniques on a dataset representing customer information and their likelihood of churn. The goal is to transform raw data into features that better represent the underlying patterns and improve the predictive power of a churn prediction model.
Problem Description
You are given a dataset (represented as a Pandas DataFrame) containing information about customers of a telecommunications company. The dataset includes features like contract length, monthly charges, total charges, and demographic information. Your task is to implement a series of feature engineering steps to create new, potentially more informative features from the existing ones. Specifically, you need to:
- Handle Missing Values: Address missing values in the 'total_charges' column by imputing them with the median total charge for customers with similar contract lengths.
- Create Interaction Feature: Generate a new feature called 'price_per_month' by dividing 'total_charges' by 'tenure' (customer tenure in months). Handle potential division by zero errors gracefully.
- Binning/Discretization: Create a new feature called 'tenure_group' by binning the 'tenure' column into three groups: 'New' (0-12 months), 'Mid' (13-48 months), and 'Loyal' (49+ months).
- One-Hot Encoding: Convert the 'contract' column (which contains categorical values like 'Month-to-month', 'One year', 'Two year') into a set of binary (0/1) features using one-hot encoding.
- Combine Features: Create a new feature 'family_size_per_tenure' by dividing 'dependents' by 'tenure'. Handle potential division by zero errors.
You should return a Pandas DataFrame containing the original features plus the newly engineered features.
Examples
Example 1:
Input:
DataFrame with columns: ['customerID', 'gender', 'seniorCitizen', 'partner', 'dependents', 'tenure', 'phoneService', 'multipleLines', 'internetService', 'onlineSecurity', 'onlineBackup', 'deviceProtection', 'techSupport', 'streamingTV', 'streamingMovies', 'contract', 'paperlessBilling', 'paymentMethod', 'monthlyCharges', 'totalCharges', 'churn'] and some sample data including missing values in 'total_charges' and various categorical values.
Output:
DataFrame with the original columns plus 'price_per_month', 'tenure_group', one-hot encoded 'contract' columns (e.g., 'contract_Month-to-month', 'contract_One year', 'contract_Two year'), and 'family_size_per_tenure'. Missing 'total_charges' values are imputed. Division by zero errors are handled.
Explanation: The output DataFrame includes the original data along with the newly engineered features, demonstrating the successful application of the specified feature engineering techniques.
Example 2:
Input:
DataFrame with 'tenure' values ranging from 0 to 72 months.
Output:
DataFrame with 'tenure_group' column containing values 'New', 'Mid', and 'Loyal' based on the defined tenure bins.
Explanation: The 'tenure_group' column correctly categorizes customers based on their tenure, demonstrating the binning functionality.
Constraints
- The input DataFrame will always contain the columns specified in the Problem Description.
- 'total_charges' may contain missing values (NaN).
- 'tenure' will always be a non-negative integer.
- The number of unique values in the 'contract' column will be between 2 and 4.
- The solution should be efficient and avoid unnecessary loops where possible. Pandas operations are preferred.
- The solution should handle division by zero errors gracefully, assigning a value of 0 in such cases.
Notes
- Use Pandas for data manipulation.
- Consider using
pd.cutfor binning. - Use
pd.get_dummiesfor one-hot encoding. - Pay close attention to handling missing values and division by zero errors.
- The order of the columns in the output DataFrame does not matter, but all the required features should be present.
- Focus on clarity and readability of your code. Good variable names and comments are encouraged.