SQL Window Functions: The Complete Guide for Data Engineers

Window functions are the most powerful feature most data professionals learned too late. They look intimidating at first, mostly because of the OVER() syntax and the distinction between PARTITION BY and GROUP BY. Once that clicks, the functions themselves are straightforward and the patterns they unlock solve a huge category of analytical problems elegantly.

This guide covers the mechanics, the common patterns, and the real-world use cases that come up repeatedly in data engineering and analytics work.

How Window Functions Work

A window function operates on a set of rows related to the current row, without collapsing them into a single result. This is the key difference from GROUP BY aggregation. GROUP BY reduces rows. Window functions preserve them while adding computed values based on related rows.

-- GROUP BY collapses 5 rows to 1
SELECT customer_id, SUM(amount) AS total
FROM orders
GROUP BY customer_id;

-- Window function keeps all rows and adds the total
SELECT
  order_id,
  customer_id,
  amount,
  SUM(amount) OVER (PARTITION BY customer_id) AS customer_total
FROM orders;

The OVER() clause defines the window. PARTITION BY is the grouping equivalent (which rows are included in the calculation). ORDER BY inside the window specifies the sort order for ordered functions. ROWS or RANGE defines the frame for running calculations.

Ranking Functions

SELECT
  customer_id,
  order_date,
  amount,
  -- Unique rank, no ties, sequential
  ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date) AS rn,
  -- Ties get same rank, gaps after ties
  RANK() OVER (PARTITION BY customer_id ORDER BY amount DESC) AS rnk,
  -- Ties get same rank, no gaps after ties
  DENSE_RANK() OVER (PARTITION BY customer_id ORDER BY amount DESC) AS dense_rnk
FROM orders;

ROW_NUMBER assigns a unique sequential integer starting at 1 within each partition. Use this when you need exactly one row per group, like the most recent order per customer.

RANK gives tied rows the same rank but skips subsequent ranks. DENSE_RANK gives tied rows the same rank without skipping.

-- Most recent order per customer
SELECT * FROM (
  SELECT
    order_id,
    customer_id,
    order_date,
    amount,
    ROW_NUMBER() OVER (
      PARTITION BY customer_id
      ORDER BY order_date DESC
    ) AS rn
  FROM orders
) ranked
WHERE rn = 1;

LAG and LEAD

LAG and LEAD access values from previous or following rows within a partition. They are essential for period-over-period comparisons, churn analysis, and identifying state transitions.

SELECT
  customer_id,
  order_date,
  amount,
  LAG(amount) OVER (
    PARTITION BY customer_id
    ORDER BY order_date
  ) AS prev_order_amount,
  amount - LAG(amount) OVER (
    PARTITION BY customer_id
    ORDER BY order_date
  ) AS amount_change,
  LEAD(order_date) OVER (
    PARTITION BY customer_id
    ORDER BY order_date
  ) AS next_order_date
FROM orders;

The optional second argument to LAG/LEAD specifies how many rows to look back or forward (default 1). The third argument provides a default value when the offset goes out of bounds.

Running Aggregates and Frames

Window aggregates with an ORDER BY clause compute running totals by default. The frame clause (ROWS BETWEEN or RANGE BETWEEN) controls exactly which rows are included.

SELECT
  order_date,
  daily_revenue,
  -- Running total (all rows from first to current)
  SUM(daily_revenue) OVER (
    ORDER BY order_date
    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
  ) AS running_total,
  -- 7-day rolling average
  AVG(daily_revenue) OVER (
    ORDER BY order_date
    ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
  ) AS rolling_7d_avg,
  -- Trailing 30-day sum
  SUM(daily_revenue) OVER (
    ORDER BY order_date
    ROWS BETWEEN 29 PRECEDING AND CURRENT ROW
  ) AS trailing_30d_revenue
FROM daily_summary
ORDER BY order_date;

ROWS BETWEEN counts physical rows. RANGE BETWEEN includes rows with equal values in the ORDER BY column. For time-series work, ROWS is usually more predictable.

NTILE and Percentiles

SELECT
  customer_id,
  lifetime_value,
  -- Divide into 4 equal buckets (quartiles)
  NTILE(4) OVER (ORDER BY lifetime_value) AS ltv_quartile,
  -- Percentile rank (0 to 1)
  PERCENT_RANK() OVER (ORDER BY lifetime_value) AS ltv_pct_rank,
  -- Cumulative distribution
  CUME_DIST() OVER (ORDER BY lifetime_value) AS ltv_cume_dist
FROM customer_ltv;

Sessionization

Sessionization groups events into sessions based on a time gap threshold. It is one of the most common real-world window function patterns for clickstream and product analytics.

WITH events_with_gaps AS (
  SELECT
    user_id,
    event_timestamp,
    LAG(event_timestamp) OVER (
      PARTITION BY user_id
      ORDER BY event_timestamp
    ) AS prev_event_timestamp
  FROM events
),
session_starts AS (
  SELECT
    user_id,
    event_timestamp,
    CASE
      WHEN prev_event_timestamp IS NULL
        OR DATEDIFF('minute', prev_event_timestamp, event_timestamp) > 30
      THEN 1
      ELSE 0
    END AS is_session_start
  FROM events_with_gaps
)
SELECT
  user_id,
  event_timestamp,
  SUM(is_session_start) OVER (
    PARTITION BY user_id
    ORDER BY event_timestamp
    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
  ) AS session_id
FROM session_starts;

Gaps and Islands

Gaps and islands problems involve finding contiguous sequences. The classic approach uses a difference between row number and a ranked column to identify group boundaries.

-- Find consecutive streaks of daily activity
WITH activity AS (
  SELECT
    user_id,
    activity_date,
    -- Subtract sequential row number from date
    -- Dates in the same streak produce the same difference
    DATEADD(day,
      -ROW_NUMBER() OVER (
        PARTITION BY user_id
        ORDER BY activity_date
      ),
      activity_date
    ) AS streak_group
  FROM user_activity
)
SELECT
  user_id,
  MIN(activity_date) AS streak_start,
  MAX(activity_date) AS streak_end,
  COUNT(*) AS streak_days
FROM activity
GROUP BY user_id, streak_group
ORDER BY user_id, streak_start;

First and Last Values

SELECT
  customer_id,
  order_date,
  -- First order date for this customer
  FIRST_VALUE(order_date) OVER (
    PARTITION BY customer_id
    ORDER BY order_date
    ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
  ) AS first_order_date,
  -- Most recent order amount for this customer
  LAST_VALUE(amount) OVER (
    PARTITION BY customer_id
    ORDER BY order_date
    ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
  ) AS latest_amount
FROM orders;

Note the frame specification for LAST_VALUE: without ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING, the default frame only goes up to the current row and LAST_VALUE will equal the current row value for most rows.

Performance Considerations

Window functions are often much faster than self-joins or correlated subqueries for the same logic. However, multiple window functions on the same partition can be expensive if the optimizer materializes the sort for each one. Most modern warehouses optimize multiple window functions over the same window into a single sort pass, but it is worth verifying with explain plans on critical queries.

Avoid applying window functions inside WHERE clauses directly. They are computed after filtering but before the outer select. If you need to filter on a window function result, wrap the query in a CTE or subquery first.