Hone logo
Hone
Problems

Optimizing Sales Data Analysis with SQL OLAP Queries

Businesses often need to analyze historical sales data to identify trends, understand customer behavior, and make informed decisions. This challenge focuses on designing efficient SQL queries for Online Analytical Processing (OLAP) on a sales dataset, emphasizing performance and clarity. You'll be tasked with crafting queries that aggregate and summarize data across different dimensions (time, product, region) to answer common business questions.

Problem Description

You are given a simplified sales dataset represented by a table named sales. The sales table has the following columns:

  • sale_id (INTEGER): Unique identifier for each sale.
  • sale_date (DATE): Date of the sale.
  • product_id (INTEGER): Identifier for the product sold.
  • region_id (INTEGER): Identifier for the region where the sale occurred.
  • quantity (INTEGER): Quantity of the product sold in the sale.
  • price (DECIMAL): Price per unit of the product.

Your task is to write SQL queries to answer the following business questions efficiently. Efficiency is paramount; consider indexing and query optimization techniques. Assume the database system is capable of standard SQL operations and indexing.

  1. Monthly Sales Trend: Calculate the total sales revenue for each month in the year 2023.
  2. Top Performing Products: Identify the top 5 products (by product_id) with the highest total sales revenue in 2023.
  3. Regional Sales Comparison: Determine the total sales revenue for each region in 2023, sorted in descending order.
  4. Sales by Product Category (Simulated): Assume you have a separate table products with columns product_id and category_id. Write a query to calculate the total sales revenue for each category_id in 2023.
  5. Year-over-Year Growth: Calculate the percentage growth in total sales revenue between 2022 and 2023.

Examples

Example 1: Monthly Sales Trend

Input: sales table with data spanning 2023
Output:
Month | Total Revenue
-------|---------------
2023-01 | 12345.67
2023-02 | 15678.90
...
2023-12 | 18901.23

Explanation: The query aggregates sales revenue by month for the year 2023. The sale_date is extracted to get the month, and the quantity and price are multiplied to calculate revenue for each sale, then summed for each month.

Example 2: Top Performing Products

Input: sales table with data spanning 2023
Output:
product_id | Total Revenue
------------|---------------
101         | 56789.01
102         | 43210.56
103         | 38901.23
104         | 32109.87
105         | 29876.43

Explanation: The query calculates the total sales revenue for each product in 2023 and then returns the top 5 products with the highest revenue, ordered by revenue in descending order.

Example 3: Year-over-Year Growth

Input: sales table with data spanning 2022 and 2023
Output:
Year-over-Year Growth: 10.5%

Explanation: The query calculates the total sales revenue for 2022 and 2023 separately. It then calculates the percentage growth using the formula: ((Revenue_2023 - Revenue_2022) / Revenue_2022) * 100.

Constraints

  • The sales table will contain data for at least two years (2022 and 2023).
  • The sale_date column will always contain valid dates.
  • The quantity column will always contain non-negative integers.
  • The price column will always contain non-negative decimal values.
  • The products table (for question 4) will always exist and contain valid product_id and category_id pairs.
  • Queries should be optimized for performance, especially for large datasets (assume millions of rows). Consider indexing strategies.
  • All queries must return results in the specified format.

Notes

  • Focus on writing efficient SQL queries. Consider using appropriate aggregate functions, WHERE clauses, ORDER BY clauses, and indexing.
  • You do not need to create the sales or products tables. Assume they already exist with the specified schema.
  • The database system supports standard SQL syntax.
  • For the Year-over-Year Growth calculation, round the percentage to one decimal place.
  • Pseudocode for the general approach to each query is acceptable if you cannot provide the full SQL query. However, providing the SQL query is strongly preferred.
  • Assume that product_id is unique within the products table.
  • Consider using Common Table Expressions (CTEs) to improve query readability and modularity.
  • Think about how indexing the sale_date, product_id, and region_id columns might improve query performance.
Loading editor...
plaintext