Google BigQuery: An Introduction to Big Data Analytics Platform.

Reading Time: 6 minutes
Big Query Introductory image

Hey Folks, Today we going to discuss Google BigQuery, an enterprise data warehouse with built-in machine learning capabilities.

Before going to BigQuery, let’s understand what is Google Cloud Platform?
Google Cloud Platform is a suite of public cloud computing services offered by Google. The platform includes a range of hosted services for compute, storage and application development that run on Google hardware.

Google Cloud protects your data, applications, infrastructure, and customers from fraudulent activity, spam, and abuse with the same infrastructure and security services Google uses. Google Cloud’s networking, data storage, and compute services provide data encryption at rest, in transit, and in use.

Let’s see some use cases where Google Cloud Platform is overpowering traditional tools and technologies.

Use Cases:

When You NeedLocal/Traditional ToolsGoogle Cloud Platform
Storage for Computer, Block StoragePersistent(Hard Drives), SSDPersistent(Hard Drives), SSD
Storing Media, Blob StorageFile System- may be HDFSCloud Storage
SQL Interface ATOP File DataHive (SQL-like, but MapReduce on HDFS)BigQuery
Document database, NoSQLCouchDB, MongoDB (key-value/indexed database)Data Store
Fast ScanningHBase (Columnar Database)Big Table
Transaction Processing (OLTP)RDBMSCloud SQL, Cloud Spanner
Analytics/Data Warehouse (OLAP)Hive (SQL-like but MapReduce on HDFS)BigQuery

What is BigQuery?

Process Flow in Big Query
  • BigQuery is a fully-managed, serverless, enterprise data warehouse that enables scalable analysis over petabytes of data with all of its implications of complex grading facilities.
  • On large distributed datasets operations like partitioning, bucket, and windowing are nice to have in traditional RDBMS but must have OLAP.
  • A Platform as a Service (PaaS) that supports querying using ANSI SQL.
  • Fully managed platform- no server, no resources deployed. Can be accessed through Web UI, REST API, Client SDK.
  • Google BigQuery has built-in machine learning capabilities.

Architecture of BigQuery

BigQuery Architecture

Data Model

  • Dataset: Set of tables and views.
  • Table must belong to a dataset.
  • Dataset must belong to a project.
  • Tables contains records with rows and columns (fields).
  • Nested and repeated fields are OK too.

Table Schema

  • Can be specified while creating the table.
  • Can also specify schema during intial load.
  • Schema can be updated later too.

Table Types

  • Native tables: BigQuery storage.
  • External tables
    • Big Table
    • Cloud Storage
    • Google Drive
  • Views

Schema Auto-Detection

  • Available while
    • Loading data
    • Querying external data
  • BigQuery selects random file in the data source and scans upto 100 rows of the data to use as representative sample.
  • Then examine each field and attempt to assign a data type to that field based on the values in the sample.

Loading Data….

  • Batch loads
    • CSV
    • JSON (new line delimited)
    • Avro (Open Source storage format or GCP data storage backups)
    • GCP Datastore backups
  • Streaming loads
    • High volume event tracking logs
    • Realtime dashboards.

Data Formats

  • CSV
  • JSON (newline delimited)
  • Avro (open source data format that bundles serialized data with the data’s schema in the same file)
  • Cloud Datastore backups (BigQuery converts data from each entity in cloud datastore backup files to BigQuery’s data types).

Alternatives to loading

  • Public datasets
  • Shared datasets
  • Stack driven log tiles.

Exploring SQL Workspace of BigQuery (Web UI).

GCP provides a very interactive user-friendly web console to interact with the data and perform the analytics on it.

BigQuery Web Console

Let’s explore the SQL workspace of BigQuery.

Creating Dataset and Loading data.

Firstly, we need to create a dataset for our project to work on.

After that, we need to create/upload tables to the corresponding dataset.

Creating/Uploading Tables.

We can create an empty table within our dataset or upload tables to our dataset.

Creating Tables in BigQuery
Source
  • Select from where you want to create you table.
  • Upload you table from your desired location i.e.,
    • Empty table
    • Google Cloud Storage.
    • Upload
    • Drive
    • Google Cloud Bigtable
    • Amazon S3
  • Select the file.
  • Specify the file format.
Destination
  • Select the project name
  • Enter Dataset ID
  • Enter table name
  • Select the table type.
Schema

You can either opt for the option Auto Detect and the schema of the table will be automatically detected.
Or you can manually define the schema for your table.

And at the end, Click on CREATE TABLE.

Query Output

Once the table got created, you can view the schema of the table, metadata, and can see the preview of the table also.

Querying Data

After the table got ingested into the dataset then you can use SQL queries to fetch the data and perform different operations.

SQL stands for structured query language and is a standardized way to interact with relational (or other) databases. MySQL, PostgreSQL, SQL Server, Oracle, MariaDB, SQLite, etc are some of the common databases that use SQL as the interface.

BigQuery is a database product from Google that also uses SQL as the interface to query and manipulate data. BigQuery is a completely managed solution meaning you don’t set up any servers or install any software – you just send it data and then query directly.

It’s one of the best systems available for querying massive amounts of data and supports lots of unique functionality while only charging for the amount of data that is actively processed in running a query.

BigQuery supports two kinds of SQL queries.

The first one is Legacy SQL, in which the table name should be specified into square brackets.

[project_id:datasetname.tablename]

and the second one is Standard SQL, in which the table name will be specified in tilde.

`project_id.datasetname.tablename`

You can change your SQL type from the More > Query Settings > SQL dialect.

Now, let’s run a query to fetch the first name and the last name column.

SELECT first_name, last_name from `helical-decoder-333208.personData.personsData` LIMIT 1000
Table Preview

You can also visualize your data with the help of Google Data Studio, Just click on EXPLORE DATA.

SQL workspace web console is a very interactive platform, as much you play around it you will more explore your data.

Accessing BigQuery through Cloud Shell

Yes, you can also perform different operations on BigQuery through the command line.

bq command is used to manipulate BigQuery through the command line.

If you want help, you just need to type

bq help

It will show all the manipulation that can be performed on the command line.

If you want more help on a particular command, you just need to add that command after bq help.

bq help mk
Big Query Command line
bq help show   #show gives all information about an object
bq ls          #list all dataset in a particular project

If you execute your BigQuery commands in an interactive mode then we can switch to BigQuery Shell also.

bq shell

shell will eliminate the “bq” keyword while specifying the BigQuery Commands.

For example:

mk dataset_name   # will create a dataset in the current project.
show              # will show list of helps
show dataset_name # will show information about the dataset.

To exit the interactive shell

exit

We can run these BigQuery commands on the public datasets as well

bq show publicdata:samples.shakespeare
Accessing Public Datasets

To preview some rows we will use

bq head -n 10 publicdata:samples.shakespeare
View public dataset

Uploading data from cloud storage using command line.

You can upload your tables from Google Cloud Storage to BigQuery using the command line using the following commands.

bq load --source_format=CSV babyname.names_2001 gs://babynames-demo/yob2001.txt name:string,gender:string,count:integer

The above command will ingest a table ‘names_2001’ containing the baby names in the year 2001.

We can use regular expressions to load multiple data in the table.

bq load --source_format=CSV babyname.all_names gs://babynames-demo/yob2*.txt name:string,gender:string,count:integer

The above command will ingest all the baby names after the year 2000 in the table all_names.

Running SQL queries through command lines

We can run queries on the command line as well.

For example:

bq query --use_legacy_sql=false "SELECT name, count FROM helical-decoder-333208.babyname.all_names where gender='F' ORDER BY count DESC  LIMIT 10"
Query output

Export CSV to cloud storage from BigQuery table

We can export our tables stored in BigQuery to a GCS bucket using the command line.

bq extract babyname.all_names gs://babynames-demo/export/all_names.csv

all_names.csv will be exported to the GCS bucket and you can download the CSV file from the bucket.

Conclusion

So In this blog, we discussed in brief Google BigQuery, its architecture. We have learned how to create datasets, upload tables on BigQuery using both web consoles, and the command line. We have also learned how we can perform analytics on data by running SQL queries on BigQuery.

References

https://cloud.google.com/bigquery/

Scala Future

Written by 

Durgesh Gupta is a Software Consultant working in the domain of AI/ML.