Dynamic Schema & Data Loading Into BigQuery

Reading Time: 3 minutes

BigQuery is a fully managed enterprise data warehouse that helps you manage and analyze your data with built-in features like machine learning, geospatial analysis, and business intelligence.

BigQuery’s serverless architecture lets you use SQL queries to answer your organization’s biggest questions with zero infrastructure management. BigQuery’s scalable, distributed analysis engine lets you query terabytes in seconds and petabytes in minutes.

Scope – Dynamic schema & Data loading

In this blog, we will see how to –

  1. Upload local JSON Data into GCS Bucket
  2. Overwrite Tables Data Load
  3. Append Load with Dynamic schema
  4. Load JSON data from the bucket
  5. Create Datasets

Prerequisites

Usage

1. Clone the repo or download the template

git clone https://github.com/knoldus/Dynamic-schema-loading-into-BigQuery

2. Export Required Variables

export PROJECT_ID=your_project_id
export GCS_BUCKET_ID=example-bucket-name
export DATASET=any_name_of_dataset
export BIGQUERY_TABLE=any_name_of_table
export GCS_BUCKET_PATH=json/data
export JSON_DATA_FILE=data.json

Run this command

source env_vars.sh

3. Directory Structure Should look like this

tree

Output should look like this:

├── data.json
├── env_vars.sh
├── loadTables.sh
├── README.md
└── vfv.json

0 directories, 5 files

4. Run the Script

./loadTables.sh

Output should look like:

Options...

1) Create a dataset in gcp-learn285 
2) Create table with JSON data 
3) Dynamic Append data in table 
4) Overwrite JSON Data into table 
5) Upload local JSON data into GCS bucket.

Make a choice [  1,2,3,4,5 or q to exit  ]

Make your choices & run operations to update Bigquery tables.

Big Query Operations

1. Upload JSON to the GCS bucket

Run the script & choose this option.

upload_local_data(){
echo ""
echo -e "Listing all json file in $(pwd) directory ... \n"
ls |grep json
echo ""
read -p "Enter the json data file: (like: data.json) " json
echo -e "\nUploading...."

cat $json|jq -c > $JSON_DATA_FILE

gsutil cp $JSON_DATA_FILE gs://$GCS_BUCKET_ID/$GCS_BUCKET_PATH/$JSON_DATA_FILE

rm $JSON_DATA_FILE
}

2. Create Datasets

Run the script & create dataset

echo "Creating dataset into bigquery"
 read -p "Enter name of your Dataset. [default: $DATASET] " dataset

 if [[ -z "$dataset" ]];then
  dataset=$DATASET
 fi
 bq --location=us mk --dataset $PROJECT_ID:$dataset

3. Create a Table when loading

Run the script & create table when upload JSON data

echo "Listing Datasets in $PROJECT_ID project..."   
bq ls --datasets
read -p "Enter name of your Dataset. [default: $DATASET] " dataset
if [[ -z "$dataset" ]];then
  dataset=$DATASET
 fi

read -p "Choose name of your BigQuery Table to be created in $dataset Dataset. [default: $BIGQUERY_TABLE] " table

if [[ -z "$table" ]];then
  table=$BIGQUERY_TABLE
 fi


bq --location=us load \
--noreplace \
--autodetect \
--schema_update_option=ALLOW_FIELD_ADDITION \
--source_format=NEWLINE_DELIMITED_JSON \
$PROJECT_ID:$dataset.$table \
gs://$GCS_BUCKET_ID/$GCS_BUCKET_PATH/$JSON_DATA_FILE

4. Update table with dynamic schema

Run the script to update table with dynamic schema loading.

echo "Listing Datasets in $PROJECT_ID project..."   
bq ls --datasets
read -p "Enter name of your Dataset. [default: $DATASET] " dataset
if [[ -z "$dataset" ]];then
  dataset=$DATASET
 fi
 echo "Listing Tables in $dataset dataset...."
bq ls $dataset
read -p "Choose name of your BigQuery Table in $dataset Dataset. [default: $BIGQUERY_TABLE] " table

if [[ -z "$table" ]];then
  table=$BIGQUERY_TABLE
 fi

bq --location=us load \
--noreplace \
--autodetect \
--schema_update_option=ALLOW_FIELD_ADDITION \
--source_format=NEWLINE_DELIMITED_JSON \
$PROJECT_ID:$dataset.$table \
gs://$GCS_BUCKET_ID/$GCS_BUCKET_PATH/$JSON_DATA_FILE 

5. Overwrite Tables

Run script & choose overwrite option

echo "Listing Datasets in $PROJECT_ID project..."   
bq ls --datasets
read -p "Enter name of your Dataset. [default: $DATASET] " dataset
if [[ -z "$dataset" ]];then
  dataset=$DATASET
 fi
 echo "Listing Tables in $dataset dataset...."
bq ls $dataset
read -p "Choose name of your BigQuery Table in $dataset Dataset. [default: $BIGQUERY_TABLE] " table

if [[ -z "$table" ]];then
  table=$BIGQUERY_TABLE
 fi

bq --location=us load \
--autodetect \
--schema_update_option=ALLOW_FIELD_ADDITION \
--source_format=NEWLINE_DELIMITED_JSON \
$PROJECT_ID:$dataset.$table \
gs://$GCS_BUCKET_ID/$GCS_BUCKET_PATH/$JSON_DATA_FILE

Conclusion

So with this Bigquery script, you can upload JSON data into GCS bucket & update bigquery tables.

Written by 

Rahul Soni is a Software Consultant at Knoldus Software. He is always charged up for new things & learnings. He is dedicated to his work and believes in quality output. He loves to take deep dives into cloud technologies & different tools.

Discover more from Knoldus Blogs

Subscribe now to keep reading and get access to the full archive.

Continue reading