Advanced SQL — Window Functions Use Cases.

Bluetick Consultants Inc
9 min readMar 16, 2023

--

Welcome to this exploration of Advanced SQL — Window Functions Use Cases.

  • To learn about frame type — ROWS, RANGE, and GROUPS with Postgres and Django ORM, refer to the part 1 of this blog series
  • To learn about functions like PARTITION_BY, EXCLUDE, LEAD/LAG with Postgres and Django ORM, refer to the part 2of this blog series
  • To learn more about FIRST_VALUE, LAST_VALUE, NTH_VALUE, ROW_NUMBER, DENSE_RANK, RANK, NTILE, refer to the part 3 of this blog series

USE CASE - What is the likelihood of nice weather on weekends?

Input - Daily weather sensor readings consisting of day, weekday, temperature, rain.
The weather is fine on day d if—on d and the two days prior—the minimum temperature is above 15°C and the overall rainfall is less than 600ml/m².

DROP TABLE IF EXISTS sensors;
CREATE TABLE sensors (
day int PRIMARY KEY, -- day of month
weekday text, -- day of week (Mon...Sun)
temp float, -- temperature in °C
rain float); -- rainfall in ml

INSERT INTO sensors(day, weekday, temp, rain) VALUES
( 1, 'Thu', 13, 0),
( 2, 'Fri', 10, 800),
( 3, 'Sat', 12, 300),
( 4, 'Sun', 16, 100),
( 5, 'Mon', 20, 400),
( 6, 'Tue', 20, 80),
( 7, 'Wed', 18, 500),
( 8, 'Thu', 14, 0),
( 9, 'Fri', 10, 0),
(10, 'Sat', 12, 500),
(11, 'Sun', 14, 300),
(12, 'Mon', 14, 800),
(13, 'Tue', 16, 0),
(14, 'Wed', 15, 0),
(15, 'Thu', 18, 100),
(16, 'Fri', 17, 100),
(17, 'Sat', 15, 0),
(18, 'Sun', 16, 300),
(19, 'Mon', 16, 400),
(20, 'Tue', 19, 200),
(21, 'Wed', 19, 100),
(22, 'Thu', 18, 0),
(23, 'Fri', 17, 0),
(24, 'Sat', 16, 200);

As per the problem statement, the weather is fine on day d if—on d and the two days prior—the minimum temperature is above 15°C and the overall rainfall is less than 600ml/m².

Pseudo code

Step 1
1. Collect the weather data for each day and 2 days prior as a frame.
2. Find the minimum temperature over the 3 days in the window
3. Find the overall rainfall over the 3 days in the window

weather data for each day and 2 days prior

ORDER BY s.day ROWS BETWEEN 2 PRECEDING AND CURRENT ROW

minimum temperature over the 3 days in the window -
MIN(s.temp) OVER three_days

overall rainfall over the 3 days in the window
SUM(s.rain) OVER three_days

Combining all the above into a query will be

SELECT s.day, s.weekday,
MIN(s.temp) OVER three_days AS temp,
SUM(s.rain) OVER three_days AS rain
FROM sensors AS s
WINDOW three_days AS (ORDER BY s.day ROWS BETWEEN 2 PRECEDING AND CURRENT ROW)

Step 2:
Over the collected data for current day and 2 prior, check if minimum temperature is above 15°C and total rainfall is less than 600 ml/m2

SELECT s.day, s.weekday,
CASE WHEN s.temp >= 15 and s.rain <= 600
THEN 'sunny'
ELSE 'rainy'
END AS condition
FROM three_day_sensors AS s

Step 3:
1. Now that we have determined whether a day is sunny or rainy. From this data collect only the weekend data (saturday and sunday)
2. Group all the weekends and weekdays data separately
3. For each group, count the number of days it was sunny and calculate its percentage.

SELECT w.weekday IN ('Sat', 'Sun') AS "weekend?",
(COUNT(*) FILTER (WHERE w.condition = '☀') * 100.0 /
COUNT(*)) :: int AS "% fine"
FROM weather AS w
GROUP BY "weekend?";

Final Query will be

WITH
three_day_sensors(day, weekday, temp, rain) AS (
SELECT s.day, s.weekday,
MIN(s.temp) OVER three_days AS temp,
SUM(s.rain) OVER three_days AS rain
FROM sensors AS s
WINDOW three_days AS (ORDER BY s.day ROWS BETWEEN 2 PRECEDING AND CURRENT ROW)
),

weather(day, weekday, condition) AS (
SELECT s.day, s.weekday,
CASE WHEN s.temp >= 15 and s.rain <= 600
THEN 'sunny'
ELSE 'rainy'
END AS condition
FROM three_day_sensors AS s
)

SELECT w.weekday IN ('Sat', 'Sun') AS "weekend?",
(COUNT(*) FILTER (WHERE w.condition = 'sunny') * 100.0 /
COUNT(*)) :: int AS "% fine"
FROM weather AS w
GROUP BY "weekend?";

OUTPUT -

weekend? | % fine
---------+--------
f | 29
t | 43

Weekdays have pleasant weather 29% of the time and weekends have pleasant weather 43% of the time.

USE CASE - Sessionization

Input:
A table that contains timestamps of 2 user's logs. For a particular user, if the difference between 2 timestamps is more than 30 seconds, it should have a new session id

DROP TABLE IF EXISTS log;
CREATE TABLE log(uid text NOT NULL,ts timestamp NOT NULL);

INSERT INTO log(uid, ts) VALUES
('User1', '05-25-2020 07:25:12'), -- assumes datestyle = 'ISO, MDY'
('User1', '05-25-2020 07:25:18'),
('User1', '05-25-2020 07:25:21'),
('User2', '05-25-2020 08:01:55'),
('User2', '05-25-2020 08:05:07'),
('User2', '05-25-2020 08:05:30'),
('User2', '05-25-2020 08:05:53'),
('User2', '05-25-2020 08:06:19'), -- ⎱ sessions by User2 and User1 within
('User1', '05-25-2020 08:06:30'), -- ⎰ 30 seconds, still *two* sessions ⇒ partition by uid ⁑
('User1', '05-25-2020 08:06:42'),
('User1', '05-25-2020 18:32:07'),
('User1', '05-25-2020 18:32:27'),
('User1', '05-25-2020 18:32:44'),
('User1', '05-25-2020 18:33:00'),
('User2', '05-25-2020 22:20:06'),
('User2', '05-25-2020 22:20:16');

Pseudo code

1. Create 2 groups, for user 1 and user 2, and order the timestamps in ascending order.
PARTITION BY uid ORDER BY ts
Partition by is used to separate user1 and user2 logs and the timestamps are ordered.

2. If the time difference between the current row and its previous row is greater than 30 secs then mark the session as 1 else mark it as 0.
Note : 1 marks the beginning of new session
Current.ts > LAG(ts, 1, '-infinity) OVER window_function

SELECT l.*,
CASE WHEN l.ts > LAG (l.ts, 1, '-infinity') OVER (PARTITION BY l.uid ORDER BY l.ts) + :'inactivity'
THEN 1
ELSE 0
END AS sos
FROM log AS l
ORDER BY l.uid, l.ts -- ← for presentation purposes only

OUTPUT -

uid  |         ts          | sos
-------+---------------------+---------------
User1 | 2020-05-25 07:25:12 | 1
User1 | 2020-05-25 07:25:18 | 0
User1 | 2020-05-25 07:25:21 | 0
User1 | 2020-05-25 08:06:30 | 1
User1 | 2020-05-25 08:06:42 | 0
User1 | 2020-05-25 18:32:07 | 1
User1 | 2020-05-25 18:32:27 | 0
User1 | 2020-05-25 18:32:44 | 0
User1 | 2020-05-25 18:33:00 | 0
User2 | 2020-05-25 08:01:55 | 1
User2 | 2020-05-25 08:05:07 | 1
User2 | 2020-05-25 08:05:30 | 0
User2 | 2020-05-25 08:05:53 | 0
User2 | 2020-05-25 08:06:19 | 0
User2 | 2020-05-25 22:20:06 | 1
User2 | 2020-05-25 22:20:16 | 0

3. Perform a running sum over the start of session tag to assign session IDs

SELECT t.*,
SUM (t.sos) OVER (PARTITION BY t.uid ORDER BY t.ts) AS session
FROM tagged AS t
ORDER BY t,uid, t.ts
uid   |         ts          | sos | session
--------+---------------------+-----+------------
User1 | 2020-05-25 07:25:12 | 1 | 1
User1 | 2020-05-25 07:25:18 | 0 | 1
User1 | 2020-05-25 07:25:21 | 0 | 1
User1 | 2020-05-25 08:06:30 | 1 | 2
User1 | 2020-05-25 08:06:42 | 0 | 2
User1 | 2020-05-25 18:32:07 | 1 | 3
User1 | 2020-05-25 18:32:27 | 0 | 3
User1 | 2020-05-25 18:32:44 | 0 | 3
User1 | 2020-05-25 18:33:00 | 0 | 3
User2 | 2020-05-25 08:01:55 | 1 | 1
User2 | 2020-05-25 08:05:07 | 1 | 2
User2 | 2020-05-25 08:05:30 | 0 | 2
User2 | 2020-05-25 08:05:53 | 0 | 2
User2 | 2020-05-25 08:06:19 | 0 | 2
User2 | 2020-05-25 22:20:06 | 1 | 3
User2 | 2020-05-25 22:20:16 | 0 | 3

4. With this information, we can calculate the analytical data such as time per session, total sessions etc.

SELECT s.uid,
s.session,
MAX(s.ts) - MIN(s.ts) AS duration
FROM sessionized AS s
GROUP BY s.uid, s.session
ORDER BY s.uid, s.session
uid   | session | duration
------+---------+----------
User1 | 1 | 00:00:09
User1 | 2 | 00:00:12
User1 | 3 | 00:00:53
User2 | 1 | 00:00:00
User2 | 2 | 00:01:12
User2 | 3 | 00:00:10

FINAL QUERY

\set inactivity '30 seconds'

WITH

tagged(uid, ts, sos) AS (
SELECT l.*,
CASE WHEN l.ts >
LAG (l.ts, 1, '-infinity') OVER (PARTITION BY l.uid ORDER BY l.ts) + :'inactivity'
THEN 1
ELSE 0
END AS sos
FROM log AS l
ORDER BY l.uid, l.ts
),

sessionized(uid, ts, sos, session) AS (
SELECT t.*,
SUM (t.sos) OVER (PARTITION BY t.uid ORDER BY t.ts) AS session

FROM tagged AS t
ORDER BY t,uid, t.ts
),

measured(uid, session, duration) AS (
SELECT s.uid,
s.session,
MAX(s.ts) - MIN(s.ts) AS duration
FROM sessionized AS s
GROUP BY s.uid, s.session
ORDER BY s.uid, s.session
)
TABLE measured;

USE CASE - Identify Consecutive Ranges

Input -
Table of random numbers - example - [5,2,14,3,1,42,6,10,7,13] shall be converted to [1-3,5-7,10,13&14,42].

Input - [5,2,14,3,1,42,6,10,7,13]
On order by sorting - [1,2,3,5,6,7,10,13,14,42 ]
Reducing -
[ 1,2,3 ] to 1-3 [5,6,7] to 5-7 [13,14] to 13&14 42 - 42

Approach -
1. Sort the numbers
2. Create row numbers to the table using ROW_NUMBER()
3. Subtract the number with ROW NUMBER.
4. Sort the numbers
5. Create row numbers to the table using ROW_NUMBER()
6. Subtract the number with ROW NUMBER.
7. Similar values are grouped together.

  ┌─────
│ ref │
├─────
│ 1 │ - 1 = 0 ⎫
│ 2 │ - 2 = 0 ⎬ range 0
│ 3 │ - 3 = 0 __ ⎭
│ 5 │ - 4 = 1 ⎫
│ 6 │ - 5 = 1 ⎬ range 1
│ 7 │ - 6 = 1 __ ⎭
│ 10 │ - 7 = 3 __ } range 3
│ 13 │ - 8 = 5 ⎱ range 5
│ 14 │ - 9 = 5 __ ⎰
│ 42 │ - 10 = 32 } range 32
└─────
DROP TABLE IF EXISTS citations;
CREATE TABLE citations(ref int PRIMARY KEY);

INSERT INTO citations VALUES
(5), (2), (14), (3), (1), (42), (6), (10), (7), (13);
ref - ROW_NUMBER() OVER (ORDER BY c.ref) AS range

SELECT c.ref,
c.ref - ROW_NUMBER() OVER (ORDER BY c.ref) AS range
FROM citations AS c
ref | range
----+-------
1 | 0
2 | 0
3 | 0
5 | 1
6 | 1
7 | 1
10 | 3
13 | 5
14 | 5
42 | 32

From a given range, find the min and max values. These are the first and last values of the range

SELECT r.range, MIN(r.ref) AS first, MAX(r.ref) AS last  
FROM ranges AS r
GROUP BY r.range
range  | first | last
-------+-------+------
3 | 10 | 10
5 | 13 | 14
0 | 1 | 3
32 | 42 | 42
1 | 5 | 7

Next step is to add ‘-’ or ‘&’ between first and last values based on their difference.

FINAL QUERY

WITH ranges(ref, range) AS (
SELECT c.ref,
c.ref - ROW_NUMBER() OVER (ORDER BY c.ref) AS range
FROM citations AS c
),
outputs(range, first, last) AS (
SELECT r.range, MIN(r.ref) AS first, MAX(r.ref) AS last
FROM ranges AS r
GROUP BY r.range
)
SELECT string_agg(CASE o.last - o.first
WHEN 0 THEN o.first :: text
WHEN 1 THEN o.first || '&' || o.last
ELSE o.first || '-' || o.last
END,
','
ORDER BY o.range) AS citations
FROM outputs AS o;

citations
-----------------------------
1-3,5-7,10,13&14,42

--

--

Bluetick Consultants Inc

Bluetick Consultants LLP: Driving Digital Transformation with Innovations like Generative AI, Cloud Migration, Talent Augmentation & More.