Querying Google Analytics data as flat tables from BigQuery: The Definitive Guide and Recipe
If you're using Google Analytics 360, the premium version of Google Analytics, then chances are you have setup an automatic export of your Google Analytics data into BigQuery. This allows your hit-level data in Google Analytics to be exported to BigQuery datasets every day.
But Google does not provide much help with figuring out how to query your data!
Google Analytics provides a cookbook of sample BigQuery queries. But these are very limited in scope. And they omit the most trivial -- yet important -- query that you will ever run:
Flat tables are essential to perform further work on the results with Python, R and other data science languages.
Another flaw in the cookbook is that it uses BigQuery's older Legacy SQL. BigQuery is already moving to its Standard SQL. But examples based on Google Analytics data were either difficult to find or based on guesswork that had not been tested.
To query a flat table of your Google Analytics data using BigQuery's Standard SQL, follow these rules:
Instead, use this with Standard SQL:
FROM
`[dataset ID].ga_sessions_*`
WHERE
_table_suffix BETWEEN FORMAT_DATE('%Y%m%d', DATE('YYYY-MM-DD')) AND FORMAT_DATE('%Y%m%d', DATE('YYYY-MM-DD'))
All Google Analytics table names end with a date. So this statement tells BigQuery to look across all your tables, but then limit the data to only those with name suffixes that match your desired dates.
IMPORTANT! Take note of the backticks used with the `[dataset ID].ga_sessions_*`. On a standard computer QWERTY keyboard, this is the button to the left of the "1" at the top of your keyboard.
For example, if you want to query data for 1 December 2016 to 28 February 2017 you will use this:
FROM
`12345678.ga_sessions_*`
WHERE
_table_suffix BETWEEN FORMAT_DATE('%Y%m%d', DATE('2016-12-01')) AND FORMAT_DATE('%Y%m%d', DATE('2017-02-28'))
You might be aware that Google Analytics data appears as nested data in BigQuery. If you're coming from a regular SQL database world, this will blow your mind. But nested data is bad for getting a flat table. You really should unnest all of your table's records.
So when you are setting where to get the data from, your FROM statement should look like this:
FROM
`[dataset ID].ga_sessions_*`
LEFT JOIN UNNEST(customDimensions) AS cd
LEFT JOIN UNNEST(customMetrics) AS cm
LEFT JOIN UNNEST(hits) AS h
LEFT JOIN UNNEST(h.customDimensions) AS hcd
LEFT JOIN UNNEST(h.customMetrics) AS hcm
LEFT JOIN UNNEST(h.product) AS p
LEFT JOIN UNNEST(p.customDimensions) AS pcd
LEFT JOIN UNNEST(p.customMetrics) AS pcm
This is more-or-less similar to BigQuery's Legacy SQL's FLATTEN function, but with less headache.
Then, when you want to query fields from these unnested records, you can use something like this:
SELECT
h.transaction.currencyCode,
p.productSKU
...
FROM
`12345678.ga_sessions_*`
LEFT JOIN UNNEST(customDimensions) AS cd
LEFT JOIN UNNEST(customMetrics) AS cm
LEFT JOIN UNNEST(hits) AS h
LEFT JOIN UNNEST(h.customDimensions) AS hcd
LEFT JOIN UNNEST(h.customMetrics) AS hcm
LEFT JOIN UNNEST(h.product) AS p
LEFT JOIN UNNEST(p.customDimensions) AS pcd
LEFT JOIN UNNEST(p.customMetrics) AS pcm
WHERE
h.eCommerceAction.action_type = "6"
...
Because I have aliased each unnested record with an abbreviation, I have to use that abbreviation when querying, i.e. "h" instead of "hits", "p" instead of "hits.product".
I discovered this method from a post in the Firebase blog.
As a result, you need to be familiar with the MAX function. This is provided in the Google Analytics BigQuery cookbook for querying Custom Dimensions.
What is not immediately obvious from that cookbook is that MAX is an aggregate function. So if you're querying other non-aggregate fields, you need to GROUP your results BY those non-aggregated fields.
In this example, I'm querying Custom Dimension 13 (hit-scoped) and Custom Dimension 26 (product-scoped):
SELECT
MAX(IF(hcd.index = 13,
hcd.value,
NULL)) WITHIN h AS customDimension13,
h.transaction.currencyCode,
p.productSKU,
MAX(IF(pcd.index = 26,
pcd.value,
NULL)) WITHIN p AS productCustomDimension26
...
GROUP BY
h.transaction.currencyCode,
p.productSKU
For both of these Custom Dimensions, I need to use MAX to get their values correctly from the row-wise data. Then I need to group all of the results by the non-aggregated fields, h.transaction.currencyCode and p.productSKU.
SELECT
MAX(IF(hcd.index = 13,
hcd.value,
NULL)) WITHIN h AS customDimension13,
h.transaction.currencyCode,
p.productSKU,
MAX(IF(pcd.index = 26,
pcd.value,
NULL)) WITHIN p AS productCustomDimension26
FROM
`12345678.ga_sessions_*`
LEFT JOIN UNNEST(customDimensions) AS cd
LEFT JOIN UNNEST(customMetrics) AS cm
LEFT JOIN UNNEST(hits) AS h
LEFT JOIN UNNEST(h.customDimensions) AS hcd
LEFT JOIN UNNEST(h.customMetrics) AS hcm
LEFT JOIN UNNEST(h.product) AS p
LEFT JOIN UNNEST(p.customDimensions) AS pcd
LEFT JOIN UNNEST(p.customMetrics) AS pcm
WHERE
_table_suffix BETWEEN FORMAT_DATE('%Y%m%d', DATE('2016-12-01')) AND FORMAT_DATE('%Y%m%d', DATE('2017-02-28'))
AND h.eCommerceAction.action_type = "6"
GROUP BY
h.transaction.currencyCode,
p.productSKU;
More information:
But Google does not provide much help with figuring out how to query your data!
Google Analytics provides a cookbook of sample BigQuery queries. But these are very limited in scope. And they omit the most trivial -- yet important -- query that you will ever run:
Get a flat table of results that you can export into a CSV file or a SQL database
Flat tables are essential to perform further work on the results with Python, R and other data science languages.
Another flaw in the cookbook is that it uses BigQuery's older Legacy SQL. BigQuery is already moving to its Standard SQL. But examples based on Google Analytics data were either difficult to find or based on guesswork that had not been tested.
Rules for Querying a Flat Table with BigQuery Standard SQL
To query a flat table of your Google Analytics data using BigQuery's Standard SQL, follow these rules:
- Use _table_suffix with FORMAT_DATE for querying across multiple date-based datasets.
- Unnest records – customDimensions, hits, hits.product, hits.customDimensions, hits.product.customDimensions, etc.
- Use MAX and GROUP BY to query Custom Dimension and Custom Metric data.
Query across multiple date-based datasets
If you're familiar with BigQuery's Legacy SQL, then you might be familiar with the TABLE_DATE_RANGE function to query across multiple date-based datasets. If you don't know this, never mind. BigQuery's Standard SQL doesn't use this any more.Instead, use this with Standard SQL:
FROM
`[dataset ID].ga_sessions_*`
WHERE
_table_suffix BETWEEN FORMAT_DATE('%Y%m%d', DATE('YYYY-MM-DD')) AND FORMAT_DATE('%Y%m%d', DATE('YYYY-MM-DD'))
All Google Analytics table names end with a date. So this statement tells BigQuery to look across all your tables, but then limit the data to only those with name suffixes that match your desired dates.
IMPORTANT! Take note of the backticks used with the `[dataset ID].ga_sessions_*`. On a standard computer QWERTY keyboard, this is the button to the left of the "1" at the top of your keyboard.
For example, if you want to query data for 1 December 2016 to 28 February 2017 you will use this:
FROM
`12345678.ga_sessions_*`
WHERE
_table_suffix BETWEEN FORMAT_DATE('%Y%m%d', DATE('2016-12-01')) AND FORMAT_DATE('%Y%m%d', DATE('2017-02-28'))
Unnest all records
You might be aware that Google Analytics data appears as nested data in BigQuery. If you're coming from a regular SQL database world, this will blow your mind. But nested data is bad for getting a flat table. You really should unnest all of your table's records.
So when you are setting where to get the data from, your FROM statement should look like this:
FROM
`[dataset ID].ga_sessions_*`
LEFT JOIN UNNEST(customDimensions) AS cd
LEFT JOIN UNNEST(customMetrics) AS cm
LEFT JOIN UNNEST(hits) AS h
LEFT JOIN UNNEST(h.customDimensions) AS hcd
LEFT JOIN UNNEST(h.customMetrics) AS hcm
LEFT JOIN UNNEST(h.product) AS p
LEFT JOIN UNNEST(p.customDimensions) AS pcd
LEFT JOIN UNNEST(p.customMetrics) AS pcm
This is more-or-less similar to BigQuery's Legacy SQL's FLATTEN function, but with less headache.
Then, when you want to query fields from these unnested records, you can use something like this:
SELECT
h.transaction.currencyCode,
p.productSKU
...
FROM
`12345678.ga_sessions_*`
LEFT JOIN UNNEST(customDimensions) AS cd
LEFT JOIN UNNEST(customMetrics) AS cm
LEFT JOIN UNNEST(hits) AS h
LEFT JOIN UNNEST(h.customDimensions) AS hcd
LEFT JOIN UNNEST(h.customMetrics) AS hcm
LEFT JOIN UNNEST(h.product) AS p
LEFT JOIN UNNEST(p.customDimensions) AS pcd
LEFT JOIN UNNEST(p.customMetrics) AS pcm
WHERE
h.eCommerceAction.action_type = "6"
...
Because I have aliased each unnested record with an abbreviation, I have to use that abbreviation when querying, i.e. "h" instead of "hits", "p" instead of "hits.product".
I discovered this method from a post in the Firebase blog.
Getting Custom Dimension and Custom Metric data
One of the most difficult-to-understand parts of the Google Analytics export to BigQuery is that all of the Custom Dimension and Metric data get saved as individual rows. Normally, we think of these custom data as being in the same "data row" as the hit or session, because that's how the custom data was set in the Google Analytics tracking code. So it's quite mind-blowing to find out that each Custom Dimension / Metric takes up a row in the table data.As a result, you need to be familiar with the MAX function. This is provided in the Google Analytics BigQuery cookbook for querying Custom Dimensions.
What is not immediately obvious from that cookbook is that MAX is an aggregate function. So if you're querying other non-aggregate fields, you need to GROUP your results BY those non-aggregated fields.
In this example, I'm querying Custom Dimension 13 (hit-scoped) and Custom Dimension 26 (product-scoped):
SELECT
MAX(IF(hcd.index = 13,
hcd.value,
NULL)) WITHIN h AS customDimension13,
h.transaction.currencyCode,
p.productSKU,
MAX(IF(pcd.index = 26,
pcd.value,
NULL)) WITHIN p AS productCustomDimension26
...
GROUP BY
h.transaction.currencyCode,
p.productSKU
For both of these Custom Dimensions, I need to use MAX to get their values correctly from the row-wise data. Then I need to group all of the results by the non-aggregated fields, h.transaction.currencyCode and p.productSKU.
BigQuery Standard SQL Recipe for Querying a Flat Table
SELECT
MAX(IF(hcd.index = 13,
hcd.value,
NULL)) WITHIN h AS customDimension13,
h.transaction.currencyCode,
p.productSKU,
MAX(IF(pcd.index = 26,
pcd.value,
NULL)) WITHIN p AS productCustomDimension26
FROM
`12345678.ga_sessions_*`
LEFT JOIN UNNEST(customDimensions) AS cd
LEFT JOIN UNNEST(customMetrics) AS cm
LEFT JOIN UNNEST(hits) AS h
LEFT JOIN UNNEST(h.customDimensions) AS hcd
LEFT JOIN UNNEST(h.customMetrics) AS hcm
LEFT JOIN UNNEST(h.product) AS p
LEFT JOIN UNNEST(p.customDimensions) AS pcd
LEFT JOIN UNNEST(p.customMetrics) AS pcm
WHERE
_table_suffix BETWEEN FORMAT_DATE('%Y%m%d', DATE('2016-12-01')) AND FORMAT_DATE('%Y%m%d', DATE('2017-02-28'))
AND h.eCommerceAction.action_type = "6"
GROUP BY
h.transaction.currencyCode,
p.productSKU;
More information:
Hi there,
ReplyDeleteI learned a lot from you article, but I can't wrap my head around one thing. MAX and MIN gives only one value of course per sku. How would one achieve to repeat the number of rows of skus, productname per value of a custom dimension? Example my customDimension13 can contain multiple values hence a sku, productname but the value stored in customDimension13 may vary. In the example I'm querying on particular product "Mountain Jacket Norway"
SELECT
p.productSKU,
p.v2ProductName,
MAX(IF(pcd.index = 13,
pcd.value,
NULL)) AS productCustomDimension13
FROM
`XXXXXXX.ga_sessions_*`
LEFT JOIN UNNEST(customDimensions) AS cd
LEFT JOIN UNNEST(hits) AS h
LEFT JOIN UNNEST(h.customDimensions) AS hcd
LEFT JOIN UNNEST(h.product) AS p
LEFT JOIN UNNEST(p.customDimensions) AS pcd
WHERE
_table_suffix BETWEEN FORMAT_DATE('%Y%m%d', DATE('2020-01-01')) AND FORMAT_DATE('%Y%m%d', DATE('2020-05-11'))
AND h.eCommerceAction.action_type = "6"
AND p.v2ProductName = "Mountain Jacket Norway"
GROUP BY
p.productSKU,
p.v2ProductName
ORDER BY
p.v2ProductName
You need to remember that in BigQuery, for every hit, every Custom Dimension (and Custom Metric) is listed in *its own row*. So you have one big RECORD for the hit, and in that record, there is one row for Custom Dimension index = 1, another row for Custom Dimension index = 2, and so on. And for each hit, a Custom Dimension at a specific index can contain one-and-only-one value.
DeleteSo MAX still works in your case.
In your query, can you try the following to see if it works?
MAX(IF(pcd.index = 13,
pcd.value,
NULL)) WITHIN p AS productCustomDimension13
Hi Yuhui,
DeleteThanks for your swift reply and explaination. I get a syntax error: Syntax error: Unexpected keyword WITHIN at [6:14]
Any idea why that would be?
Try using the "Custom SQL UDF" solution at https://robertsahlin.com/flatten-google-analytics-custom-dimensions-with-a-bigquery-udf/. Does that work?
Delete