Empowering Google Cloud AI Platform Notebooks by powerful AutoML

With the AI Platform’s powerful ecosystem, almost all modeling workflows can be completed on Google Cloud.

Introduction

The solution to all is… Google Cloud AI Platform Notebooks!

The advantages are as follows
・Authorised access to resources in the project (GCS, BigQuery, etc.)
・You can load BigQuery tables into a pandas data frame by writing SQL with simple magic commands.
・You can increase the number of CPUs and memory when you need more computation.
・You can train models by BigQuery ML or AutoML by one-liner. (This article is about this!)

So, What is AutoML Tables?

https://cloud.google.com/automl-tables/docs/beginners-guide

This photo is I took when Jeff Dean came to Tokyo for ML Summit 2019 Tokyo. It shows (almost) impossible for a human to create such a complex DL layer. But AutoML Tables makes it possible. Machine tries everything and never gets tired.

It used to only support Auto ML Vision (image classification) at the beginning. It now supports Tabular Data in Google Cloud Next ’19. Please see above for details. Such a powerful tool that anyone can use immediately makes me feel that we are truly in the era of data democratization.

Let’s take a deeper look inside AutoML.

The article says the concept of model search uses “greedy” Beam-search the multiple trainers (even try RNNs such as LSTM), tunes the depth of the layers and the connection, and eventually does ensembles. It creates a model written in TensorFlowFinally.

Model Search schematic illustrating the distributed search and ensembling. Each trainer runs independently to train and evaluate a given model. The results are shared with the search algorithm, which it stores. The search algorithm then invokes mutation over one of the best architectures and then sends the new model back to a trainer for the next iteration. S is the set of training and validation examples and A are all the candidates used during training and search.

The article does not describe what it is trying to do in detail. So let’s try to run the model search. The code is available here.

The sample code tries 200 models with Tabular data. Of course, this value can be changed, and the number of trials can be infinite. (However, this repository seems to be incomplete, they will fix it soon, I think)

Here’s an example that I actually tried.

The text is too small (my bad)

I ran it from the terminal of the 8CPU AI Platform Notebook. We can only know what is the model search actually doing from the log. Still, I was surprised to see the model search trying to tune hyperparameters, layer connections, learning rate, normalizations, activation functions, and even the ensemble. You can also see that .pb files have been generated. This means that AutoML models can be immediately deployed and quantized for the edge on the AI Platform. It’s a well-designed ecosystem.

Modeling with AutoML from AI Platform Notebooks

Creating an instance of Notebooks

https://cloud.google.com/ai-platform/notebooks/docs/images

There are many custom containers to choose from, including CUDA, TensorFlow, PyTorch, and even an instance with a GPU. So that you can set up a deep learning environment within a minute. Apache Spark and Apache Hive are also available. Conda is included, so you can use “conda” to add the libraries you need from the console in Jupyter Lab. (The container seems debian based)

Click “Create,” and your instance will be ready in a few tens of seconds. Here we have created Notebooks in the Tokyo region, and you can freely choose the CPU type here. The billing is based on the GCE instance size. If you use GCE regularly, you know that it is very inexpensive.

https://cloud.google.com/compute/all-pricing

Now let’s start Notebooks.

If you start the instance and click on the red circle, Jupyter Lab will start-up in a few seconds. Of course, you can also access Notebooks on the same instance at the same time if you are an authorized project member.

Loading data from BigQuery

Let’s quickly load 1000 data sets from the dataset name “bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2018” into the pandas df.

You only need to do this to load the data into the dataframe. If the dataset is in the same project, it can be loaded without explicitly specifying the Project ID.

https://googleapis.dev/python/bigquery/latest/magics.html

Once you are in the pandas DataFrame, you can plot, view statistics, and do feature engineering as you normally do in Jupyter. The manipulated DataFrame can also export to a BigQuery table using “df.to_gbq”. If you have large data, you may want to put it back into BigQuery and then manipulate it again, so you can flexibly choose whether to use pandas or BigQuery.

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_gbq.html

Modeling with BigQuery ML

BigQuery ML supports the following types of models:

Linear regression
Binary logistic regression
Multiclass logistic regression
K-means clustering
Matrix Factorization
Time series (Auto ARIMA)
Boosted Tree (XGBoost based)
Deep Neural Network (DNN)
AutoML Tables
TensorFlow model importing

Simple logistic regression model from census data. You can do modeling with just this SQL.

I won’t go into detail here, but for most basic modeling, it’s very fast and simple. You can train/evaluate/predict the model from Jupyter!

Modeling with AutoML Tables in Notebooks

https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-create-automl?hl=en

You can specify the classification model or regression, the variables you want to predict, Budget hour, etc. All options that can be specified in the AutoML Tables UI, such as optimization target functions (MAXIMIZE_AU_ROC, MINIMIZE_LOG_LOSS, MAXIMIZE_AU_PRC, MINIMIZE_RMSE, MINIMIZE_MAE, MINIMIZE_RMSLE) can be specified here. You can select the variables you want to use in the AS clause to load variables in the model. That’s very reasonable.

Now let’s do some modeling. Create a model that predicts the taxi fare.

https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-create-automl

Note that there is a restriction on the current state of modeling with queries.
When using the CREATE MODEL statements for AutoML Tables models must comply with the following rules:

  • The input data to AutoML Tables must be between 1000 and 100 million rows and less than 100 GB.
  • AXT and CMEK are currently not supported.
  • The models are not visible in the AutoML Tables UI and not available for batch or online AutoML Tables predictions.
    ・The model will not appear in the AutoML Tables UI and cannot be used for batch or online forecasting in AutoML Tables.

If you forecast more than 100 million rows, use the AutoML Tables from Web UI, and if you want to batch forecast or deploy it for online forecasting. In fact, the job I threw here does not appear in the AutoML Tables model UI, so it seems that BigQuery ML and AutoML Tables are independent. I’m looking forward to seeing this merged in the future.

https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-importance?hl=en

Once the modeling is done, you can make predictions and the feature importance. The results can be exported into dataframe again with a one-liner. This is very useful. You can also view other information such as training information, etc. You can find more details here.

Now you can import data from BigQuery, manipulate it in Notebooks, export it to BigQuery, and do modeling and prediction by AutoML with AI Platform Notebooks alone.

Summary

Google Cloud’s powerful products, such as

・BigQuery: Ultra-fast data infrastructure
・AutoML: A highly accurate automatic modeling tool
・AI Platform Notebooks: Jupyter for easy access to resources on Google Cloud.

Combining these, we can quickly implement efficient iterations of feature engineering, modeling, evaluation, and prediction to increase the accuracy. The AI Platform is very well thought out and is currently being enhanced and evolved. If you are a data scientist or data engineer, this is obviously worth trying out!

Thank you for reading!

Data Scientist Mgr at Coca-ColaBJ, Google Developer Expert (ML), Linguistics, Statistics, MBA. (Opinions are my own) https://www.linkedin.com/in/minorimatsuda/