Building a Simple ETL Pipeline in Python
This challenge asks you to build a basic Extract, Transform, and Load (ETL) pipeline in Python. ETL processes are fundamental in data engineering, used to move data from various sources, clean and transform it, and load it into a target destination for analysis or reporting. This exercise will help you understand the core concepts of ETL and practice working with data manipulation in Python.
Problem Description
You are tasked with creating an ETL pipeline that extracts data from a CSV file, transforms it by converting a specific column to a different data type and adding a new calculated column, and then loads the transformed data into a new CSV file. The input CSV file contains customer data with columns: customer_id, name, email, and total_spent. The total_spent column is currently stored as a string.
What needs to be achieved:
- Extract: Read the data from an input CSV file.
- Transform:
- Convert the
total_spentcolumn from string to float. Handle potentialValueErrorexceptions if the string cannot be converted to a float (set these values to0.0). - Create a new column called
discounted_spentby applying a 10% discount to thetotal_spent(now a float).
- Convert the
- Load: Write the transformed data (including the new
discounted_spentcolumn) to a new CSV file.
Key Requirements:
- The code should be modular and well-structured.
- Error handling is crucial, especially when converting the
total_spentcolumn. - The output CSV file should have a header row with the column names:
customer_id,name,email,total_spent,discounted_spent.
Expected Behavior:
The program should read the input CSV, perform the transformations as described, and write the transformed data to the output CSV. If a total_spent value cannot be converted to a float, it should be treated as 0.0. The discounted_spent column should accurately reflect a 10% discount on the total_spent.
Edge Cases to Consider:
- Empty input CSV file.
- CSV file with missing values in any column.
total_spentcolumn containing non-numeric values that cannot be converted to floats.- CSV file with incorrect number of columns.
Examples
Example 1:
Input:
customer_id,name,email,total_spent
1,Alice,alice@example.com,100.00
2,Bob,bob@example.com,200.50
3,Charlie,charlie@example.com,abc
4,David,david@example.com,300
Output:
customer_id,name,email,total_spent,discounted_spent
1,Alice,alice@example.com,100.0,90.0
2,Bob,bob@example.com,200.5,180.45
3,Charlie,charlie@example.com,0.0,0.0
4,David,david@example.com,300.0,270.0
Explanation: The total_spent values are converted to floats. "abc" is converted to 0.0. A 10% discount is applied to each total_spent value to calculate discounted_spent.
Example 2:
Input:
customer_id,name,email,total_spent
1,Alice,alice@example.com,
2,Bob,bob@example.com,200.50
Output:
customer_id,name,email,total_spent,discounted_spent
1,Alice,alice@example.com,0.0,0.0
2,Bob,bob@example.com,200.5,180.45
Explanation: An empty string in total_spent is treated as 0.0.
Constraints
- The input CSV file will be relatively small (less than 1MB).
- The input CSV file will be comma-separated.
- The code should be able to handle a reasonable number of rows (up to 1000).
- The output CSV file should also be comma-separated.
- The code should be written in Python 3.
Notes
- You can use the
csvmodule for reading and writing CSV files. - Consider using a
try-exceptblock to handle potentialValueErrorexceptions during the conversion of thetotal_spentcolumn. - Think about how to handle missing values in the input CSV file gracefully. Treating them as 0.0 for
total_spentis a reasonable approach. - Focus on creating a clear, readable, and maintainable solution. Modularity is key.
- The discount calculation should be accurate to two decimal places.