BigQuery: Querying nested arrays

Reading Time: 2 minutes

In a previous blog, we had seen BigQuery facilitate efficient data warehouse schema design. BigQuery supports the nested & repeated columns. We can use a combination of ARRAY and STRUCT data types to define our schema in BigQuery. It enables to denormalize data efficiently in single table. In this blog, for the same schema of sales data, we will execute a few DML operations on nested array fields.

Schema

In the schema, products is of type RECORD means nested or struct type. And mode is REPEATED means array. So, products is array of struct type.

Sample data

{"odr_id":"1","str_loc":"NewYork","amount":"1450","cust_id" : "C012","products":[{"prdct_id":"1234","prdct_nm" : "Shampoo","pdct_prc": "140"},{"prdct_id":"421", "prdct_nm" : "Grinder","pdct_prc": "240"}]}
{"odr_id":"2","str_loc":"NewYork","amount":"1250","cust_id" : "C034","products":[{"prdct_id":"34P","prdct_nm" : "Table","pdct_prc": "140"},{"prdct_id":"42A","prdct_nm" : "Chair","pdct_prc": "440"}]}

Select Queries

SELECT * FROM demo.test_nested.sales

Recommends to not execute Select * query. In BigQuery, you can use preview option which is free of cost.

Retrieve product id and product name for customer C012

Note: we can’t get product id and product name with below query.

SELECT prdct_id, prdct_nm FROM l-ulti-data-dwh-6816.test_nested.sales where cust_id = 'C012'

ERROR: Unrecognized name: prdct_id at [1:8]

To get from nested field need to cross join or use UNNEST.

SELECT prdct_id, prdct_nm FROM l-ulti-data-dwh-6816.test_nested.sales, Unnest(products) where cust_id = 'C012'

Number of products purchases by customer C012

This we can get by finding the length of the array of products.

SELECT ARRAY_LENGTH(products) FROM l-ulti-data-dwh-6816.test_nested.sales where cust_id = 'C012';

Update Queries

Update product id of Shampoo p-103 for customer C012

UPDATE l-ulti-data-dwh-6816.test_nested.sales SET prdct_id = 'p103' where prdct_nm = 'Shampoo' and cust_id = 'C012'

Again the above query will not work for the nested field.

Unrecognized name: prdct_nm at [1:75]

To update nested field the correct query would be:

UPDATE l-ulti-data-dwh-6816.test_nested.sales SET products = ARRAY(SELECT AS STRUCT * REPLACE('p-103' AS prdct_id) FROM UNNEST(products) where prdct_nm = 'Shampoo') where cust_id='C012'

There can be more complex use cases where more complex SQL will be needing.

Written by 

Exploring Big Data Technologies.