Hone logo
Hone
Problems

Implementing a Data Archiving Strategy in SQL

Data archiving is a crucial practice for maintaining database performance and managing storage costs. This challenge asks you to design and implement a strategy for archiving older data from a table to a separate archive table, ensuring data integrity and efficient retrieval of archived records. This is useful for compliance, historical analysis, and reducing the load on the primary database.

Problem Description

You are tasked with creating a SQL-based archiving strategy for a table named orders. The orders table contains order information, including an order_id, customer_id, order_date, and total_amount. The goal is to move orders older than a specified date (e.g., one year ago) to an archive table named orders_archive.

What needs to be achieved:

  1. Create an orders_archive table with the same structure as the orders table.
  2. Move all rows from the orders table where the order_date is older than the specified archive date to the orders_archive table.
  3. Delete the archived rows from the orders table.
  4. Ensure data integrity by preserving all data from the original table in the archive.

Key Requirements:

  • The archiving process should be efficient to minimize downtime.
  • The orders_archive table should have the same schema as the orders table.
  • The archiving process should be transactional, meaning that either all rows are archived and deleted, or none are. This prevents data inconsistencies if the process is interrupted.
  • The archive date should be configurable (passed as a parameter).

Expected Behavior:

  • The script should successfully move all qualifying rows to the archive table.
  • The original orders table should be updated to reflect the removal of archived rows.
  • The script should handle cases where the orders_archive table already exists.
  • The script should handle cases where no rows need to be archived.

Edge Cases to Consider:

  • What happens if the orders_archive table already exists? (It should be dropped and recreated).
  • What happens if the order_date column contains NULL values? (NULL values should not be archived).
  • What happens if the archive date is in the future? (No rows should be archived).
  • What happens if the orders table is very large? (Consider batch processing or other optimization techniques, though full optimization is not required for this challenge).

Examples

Example 1:

Input: archive_date = '2022-12-31'
orders table:
order_id | customer_id | order_date  | total_amount
---------|-------------|-------------|--------------
1        | 101         | 2023-01-15  | 100.00
2        | 102         | 2022-11-20  | 50.00
3        | 101         | 2022-10-05  | 75.00
4        | 103         | 2023-02-10  | 200.00
Output:
orders table:
order_id | customer_id | order_date  | total_amount
---------|-------------|-------------|--------------
1        | 101         | 2023-01-15  | 100.00
4        | 103         | 2023-02-10  | 200.00

orders_archive table:
order_id | customer_id | order_date  | total_amount
---------|-------------|-------------|--------------
2        | 102         | 2022-11-20  | 50.00
3        | 101         | 2022-10-05  | 75.00

Explanation: Orders with order_date before '2022-12-31' are moved to orders_archive and deleted from orders.

Example 2:

Input: archive_date = '2024-01-01'
orders table:
order_id | customer_id | order_date  | total_amount
---------|-------------|-------------|--------------
1        | 101         | 2023-01-15  | 100.00
2        | 102         | 2022-11-20  | 50.00
Output:
orders table:
order_id | customer_id | order_date  | total_amount
---------|-------------|-------------|--------------
1        | 101         | 2023-01-15  | 100.00

orders_archive table:
order_id | customer_id | order_date  | total_amount
---------|-------------|-------------|--------------
2        | 102         | 2022-11-20  | 50.00

Explanation: Only the order with order_date before '2024-01-01' is archived.

Example 3:

Input: archive_date = '2023-01-01'
orders table:
order_id | customer_id | order_date  | total_amount
---------|-------------|-------------|--------------
1        | 101         | 2023-01-15  | 100.00
2        | 102         | 2023-01-01  | 50.00
Output:
orders table:
order_id | customer_id | order_date  | total_amount
---------|-------------|-------------|--------------
1        | 101         | 2023-01-15  | 100.00
2        | 102         | 2023-01-01  | 50.00

orders_archive table:
order_id | customer_id | order_date  | total_amount
---------|-------------|-------------|--------------

Explanation: No rows are archived because no order_date is before '2023-01-01'.

Constraints

  • The order_date column is of type DATE or DATETIME.
  • The archive date will be provided as a string in 'YYYY-MM-DD' format.
  • The database system is assumed to be a standard SQL implementation (e.g., PostgreSQL, MySQL, SQL Server). Specific syntax may need to be adjusted for the target database.
  • The solution should be reasonably efficient for tables with up to 1 million rows. While full optimization is not required, avoid excessively inefficient queries.

Notes

  • Consider using a transaction to ensure atomicity.
  • You may need to drop and recreate the orders_archive table if it already exists.
  • Think about how to handle NULL values in the order_date column. They should not be archived.
  • Pseudocode is preferred over actual SQL syntax, but the pseudocode should be clear and easily translatable to SQL.
  • Focus on the logic and structure of the archiving process. Error handling and detailed logging are not required for this challenge.

Pseudocode:

BEGIN TRANSACTION;

-- Drop the archive table if it exists
DROP TABLE IF EXISTS orders_archive;

-- Create the archive table with the same structure as the orders table
CREATE TABLE orders_archive AS SELECT * FROM orders WHERE 1=0; -- Creates table with same structure but no data
ALTER TABLE orders_archive ADD CONSTRAINT PK_orders_archive PRIMARY KEY (order_id); -- Add primary key if it exists in original table

-- Archive rows older than the specified date
INSERT INTO orders_archive
SELECT *
FROM orders
WHERE order_date < archive_date AND order_date IS NOT NULL;

-- Delete archived rows from the orders table
DELETE FROM orders
WHERE order_date < archive_date AND order_date IS NOT NULL;

COMMIT TRANSACTION;
Loading editor...
plaintext