BigQuery: Efficient Data Warehouse Schema Design

Table of contents

Reading Time: 3 minutes

Conventional data warehouses support data models based on star schema and snowflake schema. In these models, there are a number of fact tables and dimension tables. In order to minimize redundancy it is recommends to split data into multiple tables in . This is a normalization process. Normalization is the technique of eliminating the redundant data. It minimize the insertion, deletion, and update anomalies. It saves the disk space and maintains data integrity.

However, the downside of normalization is that it causes the queries to slow down. Because there are more tables to join. For that need to execute complex quires. This is where denormalized databases come in. Denormalizing is the strategy of allowing duplicate field values for a column in a table. Everything is on the same table. The query doesn’t have to jump from table to table. It results in making the queries more efficient. It is easy to process denormalized data in parallel. However, there are cases where denormalizing data is bad for performance. Like, when huge shuffling is involved in distributed systems. because of denormalized data. So, in some cases, we want conventional data models and sometimes we want to denormalize our data in a data warehouse.

However, BigQuery supports conventional data models based on star schema and snowflake schema. When the data is denormalized, BigQuery performs best. It facilitate efficient data warehouse schema design which is using the nested & repeated columns. To do this, we can use a combination of ARRAY and STRUCT data types to define your schema.

BigQuery Nested and Repeated column

Suppose there are transaction, order and customer details data. They sits on different tables. Means data is normalized into three different tables.

Corresponding denormalized data into single tables can be –

Besides, denormalizing data in this way is bad for performance when order by or group by. BigQuery enables, denormalizing the data more efficiently over a distribute system. BigQuery supports columns with nested and repeated data. The same table in BigQuery can be stored like this:

In this table, only the nested columns are repeated. In the above table, there were many products within a single order_id. That’s why it created an array of those products. Since the product details table had multiple columns, it would be an array of struct type. For the above table, data in JSON format will look like this:

{"odr_id":"1","str_loc":"NewYork","amount":"1450","cust_id" : "C012","products":[{"prdct_id":"1234","prdct_nm" : "Shampoo","pdct_prc": "140"},{"prdct_id":"421", "prdct_nm" : "Grinder","pdct_prc": "240"}]}
{"odr_id":"2","str_loc":"NewYork","amount":"1250","cust_id" : "C034","products":[{"prdct_id":"34P","prdct_nm" : "Table","pdct_prc": "140"},{"prdct_id":"42A","prdct_nm" : "Chair","pdct_prc": "440"}]}

Above the products field is repeated and nested. Repeated because it is an array and nested because the array elements have a struct type.

Equivalent BigQuery Schema

Corresponding JSON BigQuery Schema –

{
"fields": [
 	{
   	  "name": "odr_id",
   	  "type": "STRING",
   	  "mode": "REQUIRED"
	 },
 	{
   	  "name": "str_loc",
   	  "type": "STRING",
   	  "mode": "NULLABLE"
 	},
 	{
   	  "name": "amount",
   	  "type": "STRING",
   	  "mode": "NULLABLE"
 	},
 	{
   	  "name": "cust_id",
   	  "type": "STRING",
   	  "mode": "NULLABLE"
 	},
 	{
      	  "name": "products",
      	  "type": "RECORD",
      	  "mode": "REPEATED",
      	  "fields": [
        	{
          	  "name": "prdct_id",
          	  "type": "STRING",
          	  "mode": "REQUIRED"
        	},
        	{
          	  "name": "prdct_nm",
          	  "type": "STRING",
          	  "mode": "NULLABLE"
        	},
        	{
          	 "name": "pdct_prc",
          	  "type": "STRING",
          	  "mode": "NULLABLE"
        	}
      		]
  	}
  ]
}