Designing a Star Schema for a Retail Sales Data Warehouse

This challenge asks you to design a star schema data warehouse in SQL to support analytical reporting for a retail business. A well-designed data warehouse allows for efficient querying and analysis of sales data, enabling business users to gain insights into customer behavior, product performance, and overall sales trends. You will define the tables and relationships necessary to represent sales transactions, products, customers, and time dimensions.

Problem Description

You are tasked with designing a star schema data warehouse for a retail company. The core of the schema should revolve around a central fact table representing sales transactions. This fact table will be surrounded by dimension tables that provide context for the sales data. The dimensions should include:

Customer Dimension: Stores information about customers (customer ID, name, address, city, state, zip code, gender, age group).
Product Dimension: Stores information about products (product ID, product name, category, sub-category, brand, price).
Time Dimension: Stores information about dates and times (date ID, date, day of week, month, quarter, year).
Store Dimension: Stores information about stores (store ID, store name, city, state, zip code, region).

The fact table (Sales Fact) should contain:

Sales transaction ID (primary key)
Foreign keys referencing the Customer Dimension, Product Dimension, Time Dimension, and Store Dimension.
Sales quantity
Sales amount
Cost of goods sold

Key Requirements:

Design the schema using SQL CREATE TABLE statements.
Define appropriate data types for each column.
Establish primary and foreign key relationships between tables.
Ensure the schema is optimized for analytical queries (star schema principles).
Consider potential future expansion of the data warehouse (e.g., adding new dimensions).

Expected Behavior:

The resulting SQL code should create a functional star schema that can be populated with sales data. The schema should allow for efficient querying to answer business questions such as:

What are the total sales for a specific product category in a given month?
Which customers are the most valuable (based on total spending)?
What is the average sales amount per transaction for a specific store?
How do sales trends vary by day of the week?

Edge Cases to Consider:

Handling missing values in dimension attributes (e.g., a customer without an address). Consider using NULL or default values appropriately.
Dealing with changes in dimension attributes over time (e.g., a product price change). The Time Dimension should facilitate this.
Scalability: While not a primary focus, consider how the schema might perform with a large volume of sales data.

Examples

Example 1:

Let's say we want to analyze sales by product category and year. The schema should allow us to easily join the Sales Fact table with the Product Dimension and Time Dimension to aggregate sales data by category and year.

Example 2:

If a customer moves, their address in the Customer Dimension should be updated. The Time Dimension allows us to track historical addresses if needed.

Example 3:

A new store opens. A new record should be added to the Store Dimension with the store's details, and sales transactions for that store will reference the new store ID.

Constraints

The schema must adhere to star schema principles.
All tables must be created using standard SQL CREATE TABLE statements.
Data types should be chosen appropriately for the data being stored.
Assume a reasonable number of products, customers, stores, and transactions (e.g., millions of records). While performance optimization is not the primary focus, avoid excessively complex data types that could hinder query performance.
The schema should be designed to be extensible to accommodate future dimensions (e.g., promotion dimension).

Notes

Focus on the structure of the data warehouse. You do not need to provide sample data or write queries to populate the tables.
Consider using surrogate keys (auto-incrementing integer IDs) for dimension tables to simplify relationships and improve performance.
Think about the granularity of the fact table. In this case, each row represents a single sales transaction.
The Time Dimension is crucial for time-based analysis. Ensure it includes relevant attributes like day of week, month, quarter, and year.
Document your design choices briefly (e.g., why you chose a particular data type). This is not required but can be helpful.
Pseudocode for the table creation:

// Create Customer Dimension Table
CREATE TABLE Customer (
    customer_id INT PRIMARY KEY,
    customer_name VARCHAR(255),
    address VARCHAR(255),
    city VARCHAR(255),
    state VARCHAR(2),
    zip_code VARCHAR(10),
    gender CHAR(1),
    age_group VARCHAR(50)
);

// Create Product Dimension Table
CREATE TABLE Product (
    product_id INT PRIMARY KEY,
    product_name VARCHAR(255),
    category VARCHAR(255),
    sub_category VARCHAR(255),
    brand VARCHAR(255),
    price DECIMAL(10, 2)
);

// Create Time Dimension Table
CREATE TABLE Time (
    date_id INT PRIMARY KEY,
    date DATE,
    day_of_week INT,
    month INT,
    quarter INT,
    year INT
);

// Create Store Dimension Table
CREATE TABLE Store (
    store_id INT PRIMARY KEY,
    store_name VARCHAR(255),
    city VARCHAR(255),
    state VARCHAR(2),
    zip_code VARCHAR(10),
    region VARCHAR(255)
);

// Create Sales Fact Table
CREATE TABLE SalesFact (
    sales_transaction_id INT PRIMARY KEY,
    customer_id INT,
    product_id INT,
    date_id INT,
    store_id INT,
    sales_quantity INT,
    sales_amount DECIMAL(10, 2),
    cost_of_goods_sold DECIMAL(10, 2),
    FOREIGN KEY (customer_id) REFERENCES Customer(customer_id),
    FOREIGN KEY (product_id) REFERENCES Product(product_id),
    FOREIGN KEY (date_id) REFERENCES Time(date_id),
    FOREIGN KEY (store_id) REFERENCES Store(store_id)
);