How to do rolling count distinct in BigQuery? Analytic function is not a good solution

Huijie Wang
5 min readApr 30, 2021

BigQuery analytic functions compute values over a group of rows and returns a single result for each row, making it a powerful tool for feature engineering, especially on time series data. However, analytic functions are not always a good solution, and rolling count distinct is an example.

I recently got a question of calculating a rolling count of how many users logged in during the past 30 and 365 days period. The first solution came to my mind was Analytic function in Standard SQL to perform a Count Distinct over partition by date. However, it proved not to be a good solution. This article will explain what are the problems, and I will share my final solution to the problem.

Data Preparation

Let’s start with some pseudo data for our experiments. In the experiment we create a 2-year history using the following query. Each row is an entry of Login date -User ID, i.e. if a user has logged in during anytime of the day, it will become one entry to the table.

CREATE OR REPLACE TABLE `demos.login_data`
PARTITION BY date
AS
SELECT
date, userid
FROM
UNNEST(GENERATE_DATE_ARRAY(
DATE_SUB(CURRENT_DATE(), INTERVAL 730 DAY),
CURRENT_DATE(),
INTERVAL 1 DAY
)) AS date
CROSS JOIN
UNNEST(GENERATE_ARRAY(1, 10000, 1)) AS userid -- 10000 users
WHERE
rand() < 0.7

You can examine your data with:

SELECT
date,
COUNT(userid) as count_daily
FROM
`demos.login_data`
GROUP BY date
ORDER BY date

The first rows are shown below:

Initial failure solutions with BigQuery analytic functions

The first solution came to my mind is using BigQuery analytic functions. BigQuery analytic functions compute values over a group of rows and returns a single result for each row, making it a powerful tool for feature engineering, especially on time series data. You define the window with OVER clause, indicating what Partition and Order to perform on the windows.

In this case, we want to generate a 30 (or 365)-day window, and issue a COUNT(DISTINCT MemberID) and that’s it! To achieve this, we first aggregate the data by date, so that each date has a unique row with an array of user ID. Then we do the aggregation. In a window we need to count the distinct users in the 30-day range. But we don’t need a partition in this case. Sounds easy! Here is the first solution I wrote:

with dl as (
SELECT
date,
UNIX_DATE(date) udate,
ARRAY_AGG(userid) users,
FROM `demos.login_data`
GROUP BY 1,2
)
SELECT
Date,
"rolling30day" AS Timeperiod,
COUNT(DISTINCT user) OVER (ORDER BY udate RANGE BETWEEN 30 PRECEDING AND 1 PRECEDING) users
FROM dl, unnest(users) user

The problem became complicated to calculate the historical data, given the fact that, in BQ, count distinct is NOT supported in rolling aggregation, which means if i want to calculate a 30-day rolling count of distinct user logged in, the query above is NOT supported.

Failure on “Window ORDER BY is not allowed if DISTINCT is specified”

(Besides, I actually also made a mistake where here, once the userscolumn is unnested, the date column is also flattened and will lead to duplicated rows. The correct array aggregate also need a workaround which can be seen here: https://stackoverflow.com/questions/52485871/distinct-count-across-bigquery-arrays. However, it doesn’t solve the rolling count distinct problem here so we will skip it.)

With some research, well, apparently rolling count distinct is not a new thing, here is one solution: https://stackoverflow.com/questions/35174397/google-bigquery-rolling-count-distinct. As a result, work-around is needed and here we update our query into:

with dl as (
SELECT
Date,
UNIX_DATE(Date) udate,
userid,
FROM `demos.login_data`
),
last_days as (
SELECT
Date,
ARRAY_AGG(userid)
OVER (ORDER BY udate RANGE BETWEEN 30 PRECEDING AND 1 PRECEDING) AS active_users
FROM dl
)
SELECT
Date,
"rolling30day" AS Timeperiod,
(SELECT COUNT(DISTINCT userid) FROM UNNEST(active_users) userid) AS active_users
FROM last_days
GROUP BY 1, 2, 3
ORDER BY 1

As we aggregate the users into an array first, we no longer need to count distinct users with in each windows. Then in the next step, we unnest the array and count distinct users from each window.

However, more problems popped out. First of all, I tested it with a smaller dataset contains only 100 users, it works well. But on the dataset we created earlier, which contains 10000 users, the query took huge amount of time to process, and eventually we received a OOM error.

Is BigQuery analytic functions really needed?

Now it’s the time I took a step back. Is BigQuery analytic functions really needed? I have been relying too much on BigQuery analytic functions as it is convenient and straightforward sometimes, regardless of its performance not as good as standard SQL queries.

The question is actually simple if you see which data should belongs to which window. In short, each date’s login user ID should be counted in the following 30-day window, shown as this:

Keeping this in mind, we can start to disassemble the problem step by step.

  1. How to aggregate on date? Each user id should belong to 30 dates instead of one day only. Instead of having the data as Login date -User ID pair, we should duplicate one row data to 30 copies, and each copy belongs to one day in the following 30 days. Thus, as the first step, we generate a new column from the Date column which is an array of the 30 following dates: GENERATE_DATE_ARRAY(Date, DATE_ADD(Date, INTERVAL 29 DAY), INTERVAL 1 DAY)
  2. Aggregation. Unnest (flatten) the generated dates array, so we can perform COUNT(DISTINCT userid)
  3. Correct the date range. The dates generated are more than we wanted. The first 29 days do not have enough 30-day history to have correct data, so that we should discard; The last valid date also generated 29 days ahead, which is in the future. So we calculate the min/max dates from the original dates, and we only select the correct date range.
WITH min_max AS (
SELECT
MIN(Date) AS min_date,
MAX(Date) AS max_date
FROM
`demos.login_data` ),
...
WHERE
Date BETWEEN (SELECT DATE_ADD(min_date, INTERVAL 29 DAY) FROM min_max)
AND (SELECT max_date FROM min_max)

The final solution looks like this, which took only 1.3s to execute:

WITH min_max AS (
SELECT
MIN(Date) AS min_date,
MAX(Date) AS max_date
FROM
`demos.login_data` ),
member_30_day_window AS (
SELECT
userid,
Date AS base_date,
GENERATE_DATE_ARRAY(Date, DATE_ADD(Date, INTERVAL 29 DAY), INTERVAL 1 DAY) AS rollingwindow
FROM
`demos.login_data` )
SELECT
Date,
"rolling30day" AS Timeperiod,
COUNT(DISTINCT userid) AS Members
FROM
member_30_day_window, UNNEST(rollingwindow) AS Date
WHERE
Date BETWEEN (SELECT DATE_ADD(min_date, INTERVAL 29 DAY) FROM min_max)
AND (SELECT max_date FROM min_max)
GROUP BY Date
ORDER BY Date

--

--