Hone logo
Hone
Problems

Data Quality Assurance with SQL

Data quality is paramount for reliable decision-making. This challenge requires you to implement a series of SQL queries to perform data quality checks on a given dataset, identifying potential issues like missing values, invalid data types, and inconsistencies. Successfully completing this challenge demonstrates your ability to ensure data integrity using SQL.

Problem Description

You are provided with a table named customer_data containing customer information. Your task is to write SQL queries to perform the following data quality checks and report the findings:

  1. Missing Values: Identify the number of rows with missing values in each column.
  2. Data Type Validation: Verify that the order_date column contains valid dates. Report the number of rows where the order_date is not a valid date.
  3. Range Validation: The order_total column should contain values between 0 and 1000 (inclusive). Report the number of rows where order_total falls outside this range.
  4. Uniqueness Validation: Ensure that the customer_id column contains unique values. Report the number of duplicate customer_id values.
  5. Consistency Check: Verify that the country column only contains values from a predefined list: 'USA', 'Canada', 'UK', 'Germany', 'France'. Report the number of rows where the country value is not in this list.

The output for each check should be a single row with two columns: check_name and error_count.

Table Schema:

customer_data

Column NameData Type
customer_idINT
first_nameVARCHAR(255)
last_nameVARCHAR(255)
emailVARCHAR(255)
order_dateVARCHAR(255)
order_totalDECIMAL(10, 2)
countryVARCHAR(255)

Examples

Example 1:

Input:
customer_data table with some missing values, invalid dates in order_date, order_totals outside the range, duplicate customer_ids, and countries not in the allowed list.

Output:
| check_name | error_count |
|---|---|
| Missing Values | 15 |
| Data Type Validation | 5 |
| Range Validation | 3 |
| Uniqueness Validation | 2 |
| Consistency Check | 7 |

Explanation:
The output reflects the number of errors found for each data quality check. For instance, 15 rows have missing values across various columns, 5 rows have invalid dates in the `order_date` column, and so on.

Example 2:

Input:
customer_data table with all values valid and no missing data.

Output:
| check_name | error_count |
|---|---|
| Missing Values | 0 |
| Data Type Validation | 0 |
| Range Validation | 0 |
| Uniqueness Validation | 0 |
| Consistency Check | 0 |

Explanation:
Since all data is valid, the error count for each check is 0.

Example 3: (Edge Case - Empty Table)

Input:
customer_data table is empty.

Output:
| check_name | error_count |
|---|---|
| Missing Values | 0 |
| Data Type Validation | 0 |
| Range Validation | 0 |
| Uniqueness Validation | 0 |
| Consistency Check | 0 |

Explanation:
An empty table results in zero errors for all checks.

Constraints

  • The customer_data table will contain between 0 and 1000 rows.
  • The order_date column will contain strings representing dates in various formats (e.g., 'YYYY-MM-DD', 'MM/DD/YYYY', invalid strings).
  • The order_total column will contain decimal numbers.
  • The country column will contain strings.
  • The SQL dialect should be compatible with standard SQL (e.g., PostgreSQL, MySQL, SQL Server). Assume a standard date format conversion function is available (e.g., TRY_CAST in SQL Server, TO_DATE in PostgreSQL/MySQL).
  • Performance is not a primary concern for this challenge, but avoid excessively inefficient queries.

Notes

  • Consider using COUNT(*) and CASE statements to efficiently count errors.
  • For data type validation, use appropriate functions to attempt conversion to the expected data type and handle errors gracefully. TRY_CAST or similar functions are recommended to avoid query failures due to invalid data.
  • For range validation, use WHERE clauses with appropriate comparison operators.
  • For uniqueness validation, use GROUP BY and HAVING clauses to identify duplicate values.
  • For consistency checks, use WHERE clauses with IN operators to filter for allowed values.
  • The check_name column in the output should be exactly as specified in the problem description.
  • The error_count column should be an integer representing the number of errors found for each check.
  • You are expected to provide a single query that combines all the checks into a single result set. Using multiple queries is acceptable, but a single query demonstrating efficient data quality assessment is preferred.
  • Assume that the database system supports common SQL functions like COUNT, CASE, GROUP BY, HAVING, IN, and date conversion functions.
Loading editor...
plaintext