Hone logo
Hone
Problems

Feature Engineering for Customer Churn Prediction

Feature engineering is a crucial step in machine learning, often having a greater impact on model performance than the choice of algorithm itself. This challenge asks you to implement several common feature engineering techniques on a dataset representing customer information and their likelihood of churn. The goal is to transform raw data into features that better represent the underlying patterns and improve the predictive power of a churn prediction model.

Problem Description

You are given a dataset (represented as a Pandas DataFrame) containing information about customers of a telecommunications company. The dataset includes features like contract length, monthly charges, total charges, and demographic information. Your task is to implement a series of feature engineering steps to create new, potentially more informative features from the existing ones. Specifically, you need to:

  1. Handle Missing Values: Address missing values in the 'total_charges' column by imputing them with the median total charge for customers with similar contract lengths.
  2. Create Interaction Feature: Generate a new feature called 'price_per_month' by dividing 'total_charges' by 'tenure' (customer tenure in months). Handle potential division by zero errors gracefully.
  3. Binning/Discretization: Create a new feature called 'tenure_group' by binning the 'tenure' column into three groups: 'New' (0-12 months), 'Mid' (13-48 months), and 'Loyal' (49+ months).
  4. One-Hot Encoding: Convert the 'contract' column (which contains categorical values like 'Month-to-month', 'One year', 'Two year') into a set of binary (0/1) features using one-hot encoding.
  5. Combine Features: Create a new feature 'family_size_per_tenure' by dividing 'dependents' by 'tenure'. Handle potential division by zero errors.

You should return a Pandas DataFrame containing the original features plus the newly engineered features.

Examples

Example 1:

Input:
DataFrame with columns: ['customerID', 'gender', 'seniorCitizen', 'partner', 'dependents', 'tenure', 'phoneService', 'multipleLines', 'internetService', 'onlineSecurity', 'onlineBackup', 'deviceProtection', 'techSupport', 'streamingTV', 'streamingMovies', 'contract', 'paperlessBilling', 'paymentMethod', 'monthlyCharges', 'totalCharges', 'churn'] and some sample data including missing values in 'total_charges' and various categorical values.

Output:
DataFrame with the original columns plus 'price_per_month', 'tenure_group', one-hot encoded 'contract' columns (e.g., 'contract_Month-to-month', 'contract_One year', 'contract_Two year'), and 'family_size_per_tenure'.  Missing 'total_charges' values are imputed. Division by zero errors are handled.

Explanation: The output DataFrame includes the original data along with the newly engineered features, demonstrating the successful application of the specified feature engineering techniques.

Example 2:

Input:
DataFrame with 'tenure' values ranging from 0 to 72 months.

Output:
DataFrame with 'tenure_group' column containing values 'New', 'Mid', and 'Loyal' based on the defined tenure bins.

Explanation: The 'tenure_group' column correctly categorizes customers based on their tenure, demonstrating the binning functionality.

Constraints

  • The input DataFrame will always contain the columns specified in the Problem Description.
  • 'total_charges' may contain missing values (NaN).
  • 'tenure' will always be a non-negative integer.
  • The number of unique values in the 'contract' column will be between 2 and 4.
  • The solution should be efficient and avoid unnecessary loops where possible. Pandas operations are preferred.
  • The solution should handle division by zero errors gracefully, assigning a value of 0 in such cases.

Notes

  • Use Pandas for data manipulation.
  • Consider using pd.cut for binning.
  • Use pd.get_dummies for one-hot encoding.
  • Pay close attention to handling missing values and division by zero errors.
  • The order of the columns in the output DataFrame does not matter, but all the required features should be present.
  • Focus on clarity and readability of your code. Good variable names and comments are encouraged.
Loading editor...
python