If you are new to Google’s BigQuery, it can be a steep (and expensive) learning curve. Here are some helpful tips and facts that will get you started a whole lot more quickly than search through and for docs or learning as you go.
Most of what I’ll be speaking to regards BigQuery when used alongside Google Analytics, which seems like the most prevalent (but not necessarily the majority) of BQ uses. Even if you aren’t dumping your GA data into BQ, these items should still be helpful.
- Fundamentally, BigQuery is column-based. This means it’s very good at looking up of one type of data at a time, but rather bad (and very expensive) to look up all the columns associated with one row. This is because you have to query the whole data set to do so.
- A good example of how this works is that a COUNT * query will be free, because no attributes are used.
- Charges are done by the amount of data processed, not the complexity of your query. This means you should limit data ranges and/or attributes in order to limit charges when fine-tuning your queries.
- There is no SLA on BQ export time! So if you want to rely on BigQuery for your Google Analytics data, keep this in mind. Normally exports will happen at about 7am local time, but they can go much later sometimes.
- Can’t figure out the difference between a view and a table in your dataset? It’s simpler than you think: a view is basically a pre-written query.
- You’ll want to keep in mind that a view is not cached
- How do you choose between a view and a table?
- If working with static data, do a table (ie: data for January 2019)
- If working dynamic data, do a view (ie: previous week)
- BigQuery exports — how you get data out of BQ — can happen in three ways:
- Full daily export – This is the default.
- Intraday – Similar to full day but sends data every 2-3 hours
- There can be some small differences when comparing intraday vs full export, including delayed hits, and slowed auto-tagging from Adwords for traffic source.
- Within a single site you should be fine but some traffic source etc might be off (external information).
- This is free
- Streaming – Every hit is streamed directly into BQ
- Data arrives in BQ within 15 minutes
- There is a charge of 5 cents per gigabyte streamed in
- Unfortunately there is lots of duplicated data. So you must build a de-dupe view! Which costs more money.
- Given the intraday frequency there are not that as many good uses for this as it might seem.
- Most BQ questions, as least as they pertain to Google Analytics, can be answered by reviewing the BigQuery Export Schema.
- If you need to do something slightly out-of-the-box BQ supports user-defined functions (UDFs). These will let you use a non-SQL expression inside your SQL. This would let you use, say, Javascript in your query. For example:
CREATE TEMP FUNCTION getMinutes(x FLOAT64)
RETURNS STRING
LANGUAGE js AS """
var min = x/60;
return min;""";
SELECT [myStuff],
FROM `myProject.myDataset`
- Sharding and Partitioning are different acts that often achieve the same goal, but in terms of BigQuery, they have specific differences. The best practice way in BQ is generally partitioning.
- Daily tables can be made with sharding – Using time-based naming, like
[name]_YYYYMMDD
you can create date-sharded tables. You can then UNION
the tables to get the data you want
- Partitioning will split a table by day – that is, split data horizontally. As per Google: “Partitioned tables allow you to bind the partitioning scheme to a specific TIMESTAMP or DATE column”
- Clustering will split data horizontally on one attribute (like have all the columns on a individual user)
Related
Sal January 21st, 2019
Posted In: Google Cloud Platform
Tags: BigQuery, Databases, Google Analytics