BigQuery Machine Learning using GCP

Reading Time: 4 minutes

BigQuery ML enables users to create and execute machine learning models in BigQuery using SQL queries.

in this blog we will cover How to create, evaluate and use machine learning models in BigQuery

Setup and requirements

first create a profile on GCP platform and login using your credentials. and in the API option , enable the BigQuery API.

Open BigQuery Console

In the Google Cloud Console, select Navigation menu > BigQuery:

The Welcome to BigQuery in the Cloud Console message box opens. This message box provides a link to the quickstart guide and the release notes.

Click Done.

The BigQuery console opens.

Create a dataset

To create a dataset, click on the View actions icon next to your project ID and select Create dataset

for the demo we will use dataset name as bqml_lab

Next, name your Dataset ID bqml_lab and click Create dataset.

Create a model

Now, move on to your task!

Go to BigQuery EDITOR,

type or paste the following query to create a model that predicts whether a visitor will make a transaction:

#standardSQL
CREATE OR REPLACE MODEL `bqml_lab.sample_model`
OPTIONS(model_type='logistic_reg') AS
SELECT
  IF(totals.transactions IS NULL, 0, 1) AS label,
  IFNULL(device.operatingSystem, "") AS os,
  device.isMobile AS is_mobile,
  IFNULL(geoNetwork.country, "") AS country,
  IFNULL(totals.pageviews, 0) AS pageviews
FROM
  `bigquery-public-data.google_analytics_sample.ga_sessions_*`
WHERE
  _TABLE_SUFFIX BETWEEN '20160801' AND '20170631'
LIMIT 100000;

Click RUN.

Here the visitor’s device’s operating system is used, whether said device is a mobile device, the visitor’s country and the number of page views as the criteria for whether a transaction has been made.

In this case, bqml_lab is the name of the dataset and sample_model is the name of the model. The model type specified is binary logistic regression. In this case, label is what you’re trying to fit to.

Running the CREATE MODEL command creates a Query Job that will run asynchronously so you can, for example, close or refresh the BigQuery UI window.

Model information & training statistics

If interested, you can get information about the model by expanding bqml_lab dataset and then clicking the sample_model model in the UI. Under the Details tab you should find some basic model info and training options used to produce the model. Under Training, you should see a table either a table or graphs, depending on your View as settings:

Evaluate the model

in the query write code as following and then click Run:

#standardSQL
SELECT
  *
FROM
  ml.EVALUATE(MODEL `bqml_lab.sample_model`, (
SELECT
  IF(totals.transactions IS NULL, 0, 1) AS label,
  IFNULL(device.operatingSystem, "") AS os,
  device.isMobile AS is_mobile,
  IFNULL(geoNetwork.country, "") AS country,
  IFNULL(totals.pageviews, 0) AS pageviews
FROM
  `bigquery-public-data.google_analytics_sample.ga_sessions_*`
WHERE
  _TABLE_SUFFIX BETWEEN '20170701' AND '20170801'));

If used with a linear regression model, the above query returns the following columns:

  • mean_absolute_errormean_squared_errormean_squared_log_error,
  • median_absolute_errorr2_scoreexplained_variance.

If used with a logistic regression model, the above query returns the following columns:

  • precisionrecall
  • accuracyf1_score
  • log_lossroc_auc

Please consult the machine learning glossary or run a Google search to understand how each of these metrics are calculated and what they mean.

You’ll realize the SELECT and FROM portions of the query is identical to that used during training. The WHERE portion reflects the change in time frame and the FROM portion shows that you’re calling ml.EVALUATE.

You should see a table similar to this:

Use the Model

Predict purchases per country

With this query you will try to predict the number of transactions made by visitors of each country, sort the results, and select the top 10 countries by purchases:

#standardSQL
SELECT
  country,
  SUM(predicted_label) as total_predicted_purchases
FROM
  ml.PREDICT(MODEL `bqml_lab.sample_model`, (
SELECT
  IFNULL(device.operatingSystem, "") AS os,
  device.isMobile AS is_mobile,
  IFNULL(totals.pageviews, 0) AS pageviews,
  IFNULL(geoNetwork.country, "") AS country
FROM
  `bigquery-public-data.google_analytics_sample.ga_sessions_*`
WHERE
  _TABLE_SUFFIX BETWEEN '20170701' AND '20170801'))
GROUP BY country
ORDER BY total_predicted_purchases DESC
LIMIT 10;

This query is very similar to the evaluation query demonstrated in the previous section. Instead of ml.EVALUATE, you’re using ml.PREDICT and the BigQuery ML portion of the query is wrapped with standard SQL commands. For this lab you’re interested in the country and the sum of purchases for each country, so that’s why SELECTGROUP BY and ORDER BYLIMIT is used to ensure you only get the top 10 results.

You should see a table similar to this:

Predict purchases per user

Here is another example. This time you will try to predict the number of transactions each visitor makes, sort the results, and select the top 10 visitors by transactions:

Replace the query with the following and then click Run:

#standardSQL
SELECT
  fullVisitorId,
  SUM(predicted_label) as total_predicted_purchases
FROM
  ml.PREDICT(MODEL `bqml_lab.sample_model`, (
SELECT
  IFNULL(device.operatingSystem, "") AS os,
  device.isMobile AS is_mobile,
  IFNULL(totals.pageviews, 0) AS pageviews,
  IFNULL(geoNetwork.country, "") AS country,
  fullVisitorId
FROM
  `bigquery-public-data.google_analytics_sample.ga_sessions_*`
WHERE
  _TABLE_SUFFIX BETWEEN '20170701' AND '20170801'))
GROUP BY fullVisitorId
ORDER BY total_predicted_purchases DESC
LIMIT 10;

You should see a table similar to this:

Conclusion

In this blog we have created a binary logistic regression model, evaluated the model, and used the model to make predictions using the Google cloud Platform .

Written by 

Lokesh Kumar is intern in AI/ML studio at Knoldus. He is passionate about Artificial Intelligence and Machine Learning , having knowledge of C , C++ , Python and Data Analytics and much more. He is recognised as a good team player, a dedicated and responsible professional, and a technology enthusiast. He is a quick learner & curious to learn new technologies.