How to do rolling count distinct in BigQuery? Analytic function is not a good solution

BigQuery analytic functions compute values over a group of rows and returns a single result for each row, making it a powerful tool for feature engineering, especially on time series data. However, analytic functions are not always a good solution, and rolling count distinct is an example.

I recently got a question of calculating a rolling count of how many users logged in during the past 30 and 365 days period. The first solution came to my mind was Analytic function in Standard SQL to perform a Count Distinct over partition by date. However, it proved not to be a good solution. This article will explain what are the problems, and I will share my final solution to the problem.

Data Preparation

Let’s start with some pseudo data for our experiments. In the experiment we create a 2-year history using the following query. Each row is an entry of Login date -User ID, i.e. if a user has logged in during anytime of the day, it will become one entry to the table.

You can examine your data with:

The first rows are shown below:

Initial failure solutions with BigQuery analytic functions

The first solution came to my mind is using BigQuery analytic functions. BigQuery analytic functions compute values over a group of rows and returns a single result for each row, making it a powerful tool for feature engineering, especially on time series data. You define the window with OVER clause, indicating what Partition and Order to perform on the windows.

In this case, we want to generate a 30 (or 365)-day window, and issue a COUNT(DISTINCT MemberID) and that’s it! To achieve this, we first aggregate the data by date, so that each date has a unique row with an array of user ID. Then we do the aggregation. In a window we need to count the distinct users in the 30-day range. But we don’t need a partition in this case. Sounds easy! Here is the first solution I wrote:

The problem became complicated to calculate the historical data, given the fact that, in BQ, count distinct is NOT supported in rolling aggregation, which means if i want to calculate a 30-day rolling count of distinct user logged in, the query above is NOT supported.

(Besides, I actually also made a mistake where here, once the userscolumn is unnested, the date column is also flattened and will lead to duplicated rows. The correct array aggregate also need a workaround which can be seen here: However, it doesn’t solve the rolling count distinct problem here so we will skip it.)

With some research, well, apparently rolling count distinct is not a new thing, here is one solution: As a result, work-around is needed and here we update our query into:

As we aggregate the users into an array first, we no longer need to count distinct users with in each windows. Then in the next step, we unnest the array and count distinct users from each window.

However, more problems popped out. First of all, I tested it with a smaller dataset contains only 100 users, it works well. But on the dataset we created earlier, which contains 10000 users, the query took huge amount of time to process, and eventually we received a OOM error.

Is BigQuery analytic functions really needed?

Now it’s the time I took a step back. Is BigQuery analytic functions really needed? I have been relying too much on BigQuery analytic functions as it is convenient and straightforward sometimes, regardless of its performance not as good as standard SQL queries.

The question is actually simple if you see which data should belongs to which window. In short, each date’s login user ID should be counted in the following 30-day window, shown as this:

Keeping this in mind, we can start to disassemble the problem step by step.

  1. How to aggregate on date? Each user id should belong to 30 dates instead of one day only. Instead of having the data as Login date -User ID pair, we should duplicate one row data to 30 copies, and each copy belongs to one day in the following 30 days. Thus, as the first step, we generate a new column from the Date column which is an array of the 30 following dates: GENERATE_DATE_ARRAY(Date, DATE_ADD(Date, INTERVAL 29 DAY), INTERVAL 1 DAY)
  2. Aggregation. Unnest (flatten) the generated dates array, so we can perform COUNT(DISTINCT userid)
  3. Correct the date range. The dates generated are more than we wanted. The first 29 days do not have enough 30-day history to have correct data, so that we should discard; The last valid date also generated 29 days ahead, which is in the future. So we calculate the min/max dates from the original dates, and we only select the correct date range.

The final solution looks like this, which took only 1.3s to execute: