Hone logo
Hone
Problems

Real-Time Data Aggregation and Alerting with SQL

Real-time data processing is crucial for many applications, allowing for immediate insights and actions based on incoming data streams. This challenge focuses on implementing a system that aggregates data in real-time within a SQL database and triggers alerts when certain conditions are met. You'll design and implement SQL queries to achieve this, simulating a continuous data feed.

Problem Description

You are tasked with designing and implementing SQL queries to process a stream of sensor readings in real-time. The sensor readings are represented as events arriving continuously. Your goal is to aggregate these readings over a sliding time window (e.g., every 5 minutes) and trigger an alert if the average temperature exceeds a predefined threshold. The system should be able to handle a continuous stream of data and provide near real-time aggregation and alerting.

What needs to be achieved:

  1. Data Ingestion: Assume a table named sensor_readings exists with the following columns: timestamp (TIMESTAMP), sensor_id (VARCHAR), temperature (DECIMAL). Data is continuously inserted into this table.
  2. Sliding Window Aggregation: Calculate the average temperature for each sensor over a 5-minute sliding window. This means the window moves forward with each new reading, including only readings within the last 5 minutes.
  3. Alerting: Define a threshold temperature (e.g., 80 degrees Celsius). If the average temperature for any sensor within the 5-minute window exceeds this threshold, generate an alert. The alert should include the sensor_id, the average temperature, and the timestamp of the alert.
  4. Continuous Operation: The solution should be designed to operate continuously, processing new data as it arrives.

Key Requirements:

  • The solution must use standard SQL (compatible with most major database systems like PostgreSQL, MySQL, SQL Server).
  • The solution should be efficient to minimize latency in processing the data stream.
  • The solution should be robust and handle potential edge cases (e.g., missing data, sensor failures).

Expected Behavior:

The system should continuously monitor the sensor_readings table. Whenever a new reading is inserted, the system should:

  1. Calculate the 5-minute sliding window average temperature for each sensor.
  2. Check if any sensor's average temperature exceeds the threshold.
  3. If the threshold is exceeded, generate an alert record in a table named alerts with columns: sensor_id (VARCHAR), average_temperature (DECIMAL), alert_timestamp (TIMESTAMP).

Edge Cases to Consider:

  • Insufficient Data: What happens if there are fewer than two readings within a 5-minute window for a particular sensor? (Consider returning NULL or a default value).
  • Sensor Downtime: How should the system handle periods of sensor inactivity?
  • High Data Volume: The solution should be reasonably efficient even with a high volume of incoming data.

Examples

Example 1:

Input:
sensor_readings table:
timestamp             | sensor_id | temperature
----------------------|-----------|-------------
2024-10-27 10:00:00 | sensor1   | 70.0
2024-10-27 10:01:00 | sensor1   | 72.0
2024-10-27 10:02:00 | sensor2   | 75.0
2024-10-27 10:03:00 | sensor1   | 75.0
2024-10-27 10:04:00 | sensor2   | 85.0
Threshold: 80.0

Output:
alerts table:
sensor_id | average_temperature | alert_timestamp
-----------|---------------------|-----------------
sensor2   | 85.0                | 2024-10-27 10:04:00

Explanation: The average temperature for sensor2 over the 5-minute window (2024-10-27 10:00:00 to 2024-10-27 10:04:00) is 85.0, which exceeds the threshold of 80.0, triggering an alert. Sensor1's average is below the threshold.

Example 2:

Input:
sensor_readings table:
timestamp             | sensor_id | temperature
----------------------|-----------|-------------
2024-10-27 10:00:00 | sensor1   | 70.0
2024-10-27 10:01:00 | sensor1   | 72.0
Threshold: 80.0

Output:
alerts table: (empty)

Explanation: There are only two readings for sensor1 within the 5-minute window. The average temperature (71.0) is below the threshold, so no alert is triggered.

Constraints

  • Data Volume: The system should be able to handle at least 1000 sensor readings per minute.
  • Latency: The time between a new reading arriving and an alert being generated (if applicable) should be less than 10 seconds.
  • Window Size: The sliding window size is fixed at 5 minutes.
  • Threshold: The temperature threshold is a configurable parameter (e.g., 80.0).
  • SQL Standard: Use standard SQL syntax. Avoid database-specific extensions unless absolutely necessary.

Notes

  • Consider using window functions (e.g., AVG() OVER (ORDER BY timestamp ROWS BETWEEN 5 MINUTE PRECEDING AND CURRENT ROW)) to efficiently calculate the sliding window average.
  • The alerts table should be pre-created.
  • The challenge focuses on the SQL queries for aggregation and alerting. The data ingestion mechanism is assumed to be handled separately.
  • Think about how to optimize the query for performance, especially with a large number of sensors and high data volume. Indexing the timestamp and sensor_id columns in the sensor_readings table is highly recommended.
  • The solution should be designed to be easily adaptable to different window sizes and thresholds.
Loading editor...
plaintext