How to use DBT with BigQuery to wrangle your Google Analytics data
The advent of Google Analytics 4 (GA4) brought with it a killer feature: the ability to directly export your data to BigQuery.
Previously, this kind of functionality had only been available with the (expensive!) Analytics 360 plan. Those of us without deep pockets would use the Google Analytics API to export batches of data, or rig up a custom solution to stream our analytics data directly into the warehouse. A direct export simplifies matters.
However, the exported data may seem a trifle confusing at first:
- data arrive in a large number of different tables
- these tables have a heavily nested, denormalised structure with arrays of key: value pairs and structured fields
The reason for this is understandable: GA4 is very flexible, with the ability to add custom fields and change your schema as you go along. Therefore any export that has to meet everyone’s needs is going to need to use some tricks.
The downsides are clear: the incoming data are hard to understand and work with; they are not readily digestible by business intelligence tools.
But there is a way out!
Crucially, you don’t have to support everyone’s use cases. Therefore you can transform your incoming data into a more normalised form that can be consumed by your business intelligence and analysis tools of choice.
Fortunately we have DBT (data build tool) to manage this sort of transformation for us. I have created a template project that you can use (under an MIT license) to do just that.
(N.B. this template is based on work I did for my previous employer, Apolitical, who generously agreed that I could release an open-source version)
Wait, what are you talking about?
If your response to the above wasn’t ‘great, thanks, fork’ then we should take a step back.
Feel free to read any of the following subsections to find out more about the tools I have mentioned … or skip to ‘Bringing it together’.
Google Analytics
GA4 is a client-side analytics tool. When people visit your site, a server you control (the ‘server side’) sends a web page to the visitor, which is interpreted by their browser (the ‘client side’). To make the web interactive, the web page you send will include some code that gets executed by the browser (i.e. on the client side).
This code can do all sorts of things. One thing it can do is send signals (‘telemetry’) back to you — we call this client-side analytics. This is important because most of the interesting, interactive things visitors do on your website actually happen on their computer; client-side analytics is essential for understanding what they are doing.
GA4 works with a tool called Google Tag Manager (GTM), which configures the analytics code included in your web pages. This determines what signals — we call these ‘analytics events’ — are sent to your GA4 property (project). These events package up a lot of information about what action a visitor took, on what page of your site, and when — along with a bunch of information about the visitor themself (type of device and browser they are using, what country they are in, etc).
The GA4 user interface is a great way to explore these events, but the problems start when you want to combine your client-side analytics with other data.
For example, if you are running marketing campaigns via email, you may want to join the data from your mass-mailing system to discover whether people you successfully bring to your website actually behave in the expected way (whether that’s creating an account, signing up for a special promotion, or making a purchase etc). Without this, you don’t really know whether what your campaign was effective.
BigQuery
Data warehouses were designed to solve the problem of combining data from different sources. The idea is that you regularly/continuously load your data from different sources into a single, central location that is well-optimised for responding to analytical queries.
(I wrote more about this in a previous article)
From this location, analysts and business intelligence tools can efficiently and easily retrieve rich, combined data for their needs.
Assuming you have GA4 and BigQuery already set up, you can stream data from GA4 directly. Instructions for setting up such a connection may be found here.
DBT
The rise of the data warehouse has made the ‘ELT’ (Extract, Load, Transform) approach to data pipelines very tractable. One can land data from various sources in warehouse tables (with a schema with a close-to-raw form) then use the processing power of the warehouse to transform it (e.g. using SQL Views and scheduled queries) into a new set of processed data assets.
These transformations can become unwieldy if you have a lot of them. This led to the emergence of data transformation tools, among which DBT is especially well known. DBT captures the transformations in version control as templated, modular SQL and can run these transformations against the warehouse (e.g. on a schedule).
You can read a lot more about why this is a good thing in my previous articles, starting here.
Bringing it together
When you set up your GA4 to BigQuery connection, you’ll end up with a workflow like this:
The exported data is in a slightly curious format. You can read about that in detail here, but in short:
- a BigQuery dataset called
analytics_<your ID>
is created (where your ID comes from your GA4 property) - this dataset contains ‘date sharded’ tables: tables with a common schema, created once per day, with a name like
events_YYYYMMDD
, where the suffix denotes the date of the export - there are also special tables,
events_intraday_YYYYMMDD
, which receive a continuous stream of data (albeit with some processing not yet complete) - the data schema for these tables is also highly nested, with key/value repeated fields in many cases
This is somewhat different from the usual way of handling large amounts of data with a date[time] field, namely a date-partitioned table (a single table with data for different dates collocated in the same storage partition).
However, your Google Tag Manager/GA4 setup is highly customisable and might evolve over time; the flexibility required to handle this can be provided by date sharding.
BigQuery provides a handy way of querying a collection of date-sharded tables — you can use a wildcard character *
to refer to multiple tables (i.e. SELECT * FROM events_*
will treat all the exported tables like a single table called events_*
). Because of this, it is straightforward to use the collection of tables as a DBT source, which means we can define transformations to wrangle the data into some normalised, partitioned tables suitable for analysis of your unique data.
Using incremental data loads, we can create or refresh only those storage partitions that need to be updated, processing a limited amount of data every day. Only if we make a big change to the table schema do we need to rebuild from scratch.
The template project
This project defines some boilerplate DBT infrastructure plus DBT incremental models that
- load your GA4 export data into a series of incrementally-materialised tables
- execute transformations to normalise the schema, eliminating nesting
- yield two ‘data mart’ models (one for event-scoped data and the other for session-scoped data)
The tables have been set up to minimise the amount of data that will be processed every time DBT runs. This requires multiple discrete transformations and also that the outputs of these transformations be stored persistently. Thus there is a storage-space vs processing efficiency tradeoff. However, BigQuery pricing tends to be weighted more heavily towards processing than storage.
This setup is overkill if your website doesn’t have much traffic! If that is your case, it would be better to remove the incremental logic and fall back on the default base view -> ephemeral intermediate -> materialised mart
setting defined in the project's dbt_project.yml
file (i.e. materialising only the final 'data mart' tables). However, beyond a certain scale of daily traffic the incremental model will likely become faster and cheaper.
It is expected that you will have your own custom parameters defined, and will therefore extend these models. If your transformation logic changes, you should simply run the models in ‘full refresh’ mode, which will rebuild all of them from scratch. You could consider doing this as a matter of course at the point-of-deployment, but that is beyond the scope of the template project’s infrastructure.
Adding this DBT setup to the direct pipeline from GA4 to BigQuery yields the following workflow:
The ‘marts’ tables are suitable for use by analysts and business intelligence tools alike. The schemas are documented in the template project, but it is expected that you’ll modify these to suit your needs (e.g. adding custom fields).
Wrapping up
I hope that this template project will prove to be a useful resource! I have skipped over discussion of how to get up-and-running with the template, but you can find more details in the readme.
Let me know whether you found this article and the template project helpful — your feedback is very welcome.