Google BigQuery — how to unwrap an array-like field into a wide-format table with one column per array element

Paddy Alton
6 min readApr 16, 2022

--

Introduction

Over the Easter holiday I found myself well-and-truly nerd sniped by a question on Stack Overflow. It happens.

Anyway, I am reproducing my answer here (to justify the amount of time I spent on it…). Let me first describe the question at hand.

Imagine a case where someone has given you data with a field containing arrays of fixed length. Something like this:

name  |    a
----------------
A | [1,2,3]
B | [3,2,1]

You really wanted them to give it to you in a wide format, like this:

name  |   a_1  |    a_2  |    a_3
-----------------------------------
A | 1 | 2 | 3
B | 3 | 2 | 1

… but for some reason they refuse to change it and tell you it’s your problem. Seems like it should be a simple transformation, right?

Hahahaha.

Sadly, this is going to be much more complicated than most people expect.

It can be conceptually easier to pass the values into a scripting language (e.g. Python) and work there, but clearly keeping things inside BigQuery is going to tend to be more performant. Here is an approach which achieves that.

Cross-joining to turn array fields into long-format tables

The first thing we want to do is get the values out of the arrays and into rows.

Typically in BigQuery this is accomplished using CROSS JOIN. The syntax is a tad unintuitive:

WITH raw AS (
SELECT "A" AS name, [1,2,3,4,5] AS a
UNION ALL
SELECT "B" AS name, [5,4,3,2,1] AS a
),
long_format AS (
SELECT name, vals
FROM raw
CROSS JOIN UNNEST(raw.a) AS vals
)
SELECT * FROM long_format

Here raw is a Common Table Expression (CTE) that builds a table similar to my example in the introduction.

UNNEST(raw.a) then takes those arrays of values in the field a and turns each array into a set of (five) rows (one per array element), every single one of which is then joined to the corresponding value of name. This is the definition of a CROSS JOIN. In this way we can 'unwrap' a table with an array field.

This will yields a result like:

 name | vals
-------------
A | 1
A | 2
A | 3
A | 4
A | 5
B | 5
B | 4
B | 3
B | 2
B | 1

There is a shorthand for this syntax in which CROSS JOIN is replaced with a simple comma:

WITH raw AS (
SELECT "A" AS name, [1,2,3,4,5] AS a
UNION ALL
SELECT "B" AS name, [5,4,3,2,1] AS a
),
long_format AS (
SELECT name, vals
FROM raw, UNNEST(raw.a) AS vals -- SEE HERE --
)
SELECT * FROM long_format

This is more compact but may be confusing if you haven’t seen it before.

For many use-cases this is where we stop. We have a long-format table, created without any requirement that the original arrays all had the same length. Pretty useful.

What the questioner was asking for is harder to produce — a wide-format table containing the same information (relying on the fact that each array was the same length).

Pivot tables in BigQuery

The good news is that BigQuery now has a PIVOT function! That makes this kind of operation possible, albeit non-trivial:

WITH raw AS (
SELECT "A" AS name, [1,2,3,4,5] AS a
UNION ALL
SELECT "B" AS name, [5,4,3,2,1] AS a
),
long_format AS (
SELECT name, vals, offset
FROM raw, UNNEST(raw.a) AS vals WITH OFFSET
)
SELECT *
FROM long_format PIVOT(
ANY_VALUE(vals) AS vals
FOR offset IN (0,1,2,3,4)
)

This makes use of WITH OFFSET to generate an extra offset column in long_format(so that we know which order the values in the array originally had).

Also, in general pivoting requires us to aggregate the values returned in each cell. But here we expect exactly one value for each combination of name and offset, so we simply use the aggregation function ANY_VALUE (which non-deterministically selects a value from the group you’re aggregating over). Since, in this case, each group has exactly one value, that’s the value retrieved.

The query yields results like:

name    vals_0  vals_1  vals_2  vals_3  vals_4
----------------------------------------------
A 1 2 3 4 5
B 5 4 3 2 1

This is starting to look pretty good, but we have a fundamental issue: the column names are hard-coded. The questioner wanted them generated dynamically, because this won’t scale readily to arrays with hundreds of elements.

Unfortunately expressions for the pivot column values aren’t something PIVOT can accept out-of-the-box.

Note that BigQuery has no way to know that the long-format table we built will resolve neatly to a fixed number of columns (that relies on offset having the values 0-4 for each and every set of records we extracted via our CROSS JOIN).

Dynamically building/executing the pivot

And yet, there is a way. We will have to leave behind the comfort of standard SQL and move into the realm of BigQuery Procedural Language.

We must use the expression EXECUTE IMMEDIATE, which allows us to dynamically construct a standard SQL query as a String and execute it!

(as an aside, I bet you weren’t expecting to end up this far down the rabbit hole…)

This solution is, of course, inelegant to say the least. If you find yourself doing anything like this you really ought to rethink whether you really need to do it. That said: below is an extension of the toy example above, implemented using EXECUTE IMMEDIATE.

The trick is that the executed query is defined as a string, so we just have to use an expression to inject the full range of values you want into this string. Recall that || can be used as a string concatenation operator:

EXECUTE IMMEDIATE """
WITH raw AS (
SELECT "A" AS name, [1,2,3,4,5] AS a
UNION ALL
SELECT "B" AS name, [5,4,3,2,1] AS a
),
long_format AS (
SELECT name, vals, offset
FROM raw, UNNEST(raw.a) AS vals WITH OFFSET
)
SELECT *
FROM long_format PIVOT(
ANY_VALUE(vals) AS vals
FOR offset IN ("""
|| (SELECT STRING_AGG(CAST(x AS STRING)) FROM UNNEST(GENERATE_ARRAY(0,4)) AS x)
|| """
)
)
"""

Ouch. I’ve tried to make that as readable as possible. Near the bottom there is an expression that generates the list of column suffices (the pivoted values of offset, 0–4):

(SELECT STRING_AGG(CAST(x AS STRING)) FROM UNNEST(GENERATE_ARRAY(0,4)) AS x)

This generates the string "0,1,2,3,4" which is then concatenated into the query we are building to give us ...FOR offset IN (0,1,2,3,4)... inside the final query, which is then executed.

Note that this produces a query identical to the hard-coded one in the example before this one.

REALLY dynamically executing the pivot

That might have been a good place to stop, but I realised I was still technically insisting on knowing up-front how long those arrays are!

It’s a big improvement (in the narrow sense of avoiding painful repetitive code — as opposed to any more practical sense) to use GENERATE_ARRAY(0,4) instead of hard-coding (0,1,2,3,4) (like I did in the first PIVOT example), but it's not quite what was requested.

Unfortunately, I can’t provide a working toy example that

  • gets around this, and
  • can be simply pasted into the BigQuery console

… but I can tell you how to do it. You would simply replace the ‘pivot values’ expression with:

(SELECT STRING_AGG(DISTINCT CAST(offset AS STRING)) FROM long_format)

Once again this should resolve to "0,1,2,3,4".

But simply doing this in the example above won’t work, because long_format is a Common Table Expression that is only defined inside the EXECUTE IMMEDIATE block. The statements in that block won't be executed until after building the query … so at build-time long_format has yet to be defined!

With that caveat, this will work just fine:

SELECT *
FROM d.long_format PIVOT(
ANY_VALUE(vals) AS vals
FOR offset IN ("""
|| (SELECT STRING_AGG(DISTINCT CAST(offset AS STRING)) FROM d.long_format)
|| """
)
)

… provided you first define a BigQuery VIEW (for example) called long_format in a dataset d (or, better, use some more expressive names). That way, both the job that builds the query and the job that runs it will have access to the values.

If successful, you should see both jobs execute and succeed. You should then click ‘VIEW RESULTS’ on the job that ran the query (the other one simply returns the text of the query you’ve built).

As a final aside, this assumes you are working from the BigQuery console. If you’re instead working from a scripting language, that gives you plenty of better options! You could

  • load and manipulate the data in your scripting language instead
  • build the query in your scripting language (rather than massaging BigQuery into doing it for you)

Other options might include loading the long-format table into a spreadsheet . Don’t turn up your nose like that! Pivoting tables is something spreadsheet software tends to be good at, so (if you can tolerate the extra step in your workflow and the number of rows is not colossal) you might like to export your long-format table from BigQuery into a Google Sheet and work on it there.

--

--

Paddy Alton

Expect articles on data science, engineering, and analysis.