Objectives
In this tutorial you will perform the following tasks:
- Create a logistic regression model.
- Evaluate the logistic regression model.
- Make predictions using the logistic regression model.
Costs
This tutorial uses billable components of Cloud Platform, including:
- BigQuery
- BigQuery ML
For more information on BigQuery costs, see the BigQuery pricing page.
For more information on BigQuery ML costs, see the BigQuery ML pricing page.
Before you begin
- If you’re new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
- In the Google Cloud Console, on the project selector page, select or create a Google Cloud project.
Go to project selector
- Make sure that billing is enabled for your Cloud project. Learn how to confirm that billing is enabled for your project.
- BigQuery is automatically enabled in new projects. To activate BigQuery in a pre-existing project, go to Enable the BigQuery API.Enable the API
Introduction
A common problem in machine learning is to classify data into one of two types, known as labels. For example, a retailer may want to predict whether a given customer will purchase a new product, based on other information about that customer. In that case, the two labels might be “will buy” and “won’t buy.” You can construct your dataset such that one column represents the label. The data you can use to train such a binary logistic regression model include the customer’s location, their previous purchases, the customer’s reported preferences, and so on.
In this tutorial, you use BigQuery ML to create a binary logistic regression model that predicts whether a US Census respondent’s income falls into one of two ranges based on the respondent’s demographic attributes.
Creating your logistic regression model consists of the following steps.
-
- Step one: Create a dataset to store your model.
- The first step is to create a BigQuery dataset to store your model.
-
- Step two: Examine your data.
- In this step, examine the dataset and identify which columns to use as training data for your logistic regression model.
-
- Step three: Select your training data.
- The next step is to prepare the data you use to train your binary logistic regression model by running a query against the
census_adult_income
table. This step identifies the relevant features and stores them in a view for later queries to use as input data.
-
- Step four: Create a logistic regression model.
- In this step, use the
CREATE MODEL
statement to create your logistic regression model.
-
- Step five: Use the
ML.EVALUATE
function to evaluate your model. - Then, use the
ML.EVALUATE
function to provide statistics about model performance.
- Step five: Use the
-
- Step six: Use the
ML.PREDICT
function to predict a participant’s income. - Finally, you use the
ML.PREDICT
function to predict the income bracket for a given set of census participants.
- Step six: Use the
Step one: Create your dataset
The first step is to create a BigQuery dataset to store your model. To create your dataset, follow these steps:
- In the Cloud Console, go to the BigQuery page.Go to the BigQuery page
- In the navigation panel, in the Resources section, click your project name.
- On the right side, in the details panel, click Create dataset.
- On the Create dataset page, for Dataset ID, enter
census
. - Leave all of the other default settings in place and click Create dataset.
Step two: Examine your data
The next step is to examine the dataset and identify which columns to use as training data for your logistic regression model. You can use a standard SQL query to return rows from the dataset.
The following query returns 100 rows from the US Census Dataset:
SELECT * FROM `bigquery-public-data.ml_datasets.census_adult_income` LIMIT 100;
Run the query
To run the query that returns rows from your dataset:
- In the Cloud Console, click the Compose new query button.
- Enter the following standard SQL query in the Query editor text area:
SELECT * FROM `bigquery-public-data.ml_datasets.census_adult_income` LIMIT 100;
- Click Run.
- When the query is complete, click the Results tab below the query text area. The results should look like the following:
- The query results show that the
income_bracket
column in thecensus_adult_income
table has only one of two values:<=50K
or>50K
. It also shows that the columnseducation
andeducation_num
in thecensus_adult_income
table express the same data in different formats. Thefunctional_weight
column is the number of individuals that the Census Organizations believes a particular row represents; the values of this column appear unrelated to the value ofincome_bracket
for a particular row.
Step three: Select your training data
Next, you select the data used to train your logistic regression model. In this tutorial, you predict census respondent income based on the following attributes:
- Age
- Type of work performed
- Country of origin
- Marital status
- Level of education
- Occupation
- Race
- Hours worked per week
The following query creates a view that compiles your training data. This view is included in your
CREATE MODEL
statement later in this tutorial.CREATE OR REPLACE VIEW `census.input_view` AS SELECT age, workclass, native_country, marital_status, education_num, occupation, race, hours_per_week, income_bracket, CASE WHEN MOD(functional_weight, 10) < 8 THEN 'training' WHEN MOD(functional_weight, 10) = 8 THEN 'evaluation' WHEN MOD(functional_weight, 10) = 9 THEN 'prediction' END AS dataframe FROM `bigquery-public-data.ml_datasets.census_adult_income`
Query details
This query extracts data on census respondents, including
education_num
, which represents the respondent’s level of education, andworkclass
, which represents the type of work the respondent performs. This query excludes several categories that duplicate data: for example, the columnseducation
andeducation_num
in thecensus_adult_income
table express the same data in different formats, so this query excludes theeducation
column. Thedataframe
column uses the excludedfunctional_weight
column to label 80% of the data source for training, and reserves the remaining data for evaluation and prediction. The query creates a view containing these columns, so that you can use them to perform training and prediction later.Run the query
To run the query that compiles the training data for your model:
- In the Cloud Console, click the Compose new query button.
- Enter the following standard SQL query in the Query editor text area:
CREATE OR REPLACE VIEW `census.input_view` AS SELECT age, workclass, native_country, marital_status, education_num, occupation, race, hours_per_week, income_bracket, CASE WHEN MOD(functional_weight, 10) < 8 THEN 'training' WHEN MOD(functional_weight, 10) = 8 THEN 'evaluation' WHEN MOD(functional_weight, 10) = 9 THEN 'prediction' END AS dataframe FROM `bigquery-public-data.ml_datasets.census_adult_income`
- Click Run.
- In the navigation panel, in the Resources section, click your project name. Your view should appear beneath it.
- Click your view. The schema for the view appears in the Schema tab under the query editor.
Step four: Create a logistic regression model
Now that you have examined your training data, the next step is to create a logistic regression model using the data.
You can create and train a logistic regression model using the
CREATE MODEL
statement with the option'LOGISTIC_REG'
. The following query uses aCREATE MODEL
statement to train a new binary logistic regression model on the view from the previous query.CREATE OR REPLACE MODEL `census.census_model` OPTIONS ( model_type='LOGISTIC_REG', auto_class_weights=TRUE, input_label_cols=['income_bracket'] ) AS SELECT * EXCEPT(dataframe) FROM `census.input_view` WHERE dataframe = 'training'
Query details
The
CREATE MODEL
statement trains a model using the training data in theSELECT
statement.The
OPTIONS
clause specifies the model type and training options. Here, theLOGISTIC_REG
option specifies a logistic regression model type. It is not necessary to specify a binary logistic regression model versus a multiclass logistic regression model: BigQuery ML can determine which to train based on the number of unique values in the label column.The
input_label_cols
option specifies which column in theSELECT
statement to use as the label column. Here, the label column isincome_bracket
, so the model will learn which of the two values ofincome_bracket
is most likely based on the other values present in each row.The ‘auto_class_weights=TRUE’ option balances the class labels in the training data. By default, the training data is unweighted. If the labels in the training data are imbalanced, the model may learn to predict the most popular class of labels more heavily. In this case, most of the respondents in the dataset are in the lower income bracket. This may lead to a model that predicts the lower income bracket too heavily. Class weights balance the class labels by calculating the weights for each class in inverse proportion to the frequency of that class.
The
SELECT
statement queries the view from Step 2. This view contains only the columns containing feature data for training the model. TheWHERE
clause filters the rows ininput_view
so that only those rows belonging to the training dataframe are included in the training data.Run the
CREATE MODEL
queryTo run the query that creates your logistic regression model, complete the following steps:
- In the Cloud Console, click the Compose new query button.
- Enter the following standard SQL query in the Query editor text area:
CREATE OR REPLACE MODEL `census.census_model` OPTIONS ( model_type='LOGISTIC_REG', auto_class_weights=TRUE, data_split_method='NO_SPLIT', input_label_cols=['income_bracket'], max_iterations=15) AS SELECT * EXCEPT(dataframe) FROM `census.input_view` WHERE dataframe = 'training'
- Click Run.
- In the navigation panel, in the Resources section, expand [PROJECT_ID] > census and then click census_model.
- Click the Schema tab. The model schema lists the attributes that BigQuery ML used to perform logistic regression. The schema should look like the following:
Step five: Use the
ML.
function to evaluate your modelEVALUATE After creating your model, evaluate the performance of the model using the
ML.EVALUATE
function. TheML.EVALUATE
function evaluates the predicted values against the actual data.The query to evaluate the model is as follows:
SELECT * FROM ML.EVALUATE (MODEL `census.census_model`, ( SELECT * FROM `census.input_view` WHERE dataframe = 'evaluation' ) )
Query details
The
ML.EVALUATE
function takes the model trained in Step 1 and evaluation data returned by aSELECT
subquery. The function returns a single row of statistics about the model. This query uses data frominput_view
as evaluation data. TheWHERE
clause filters the input data so that the subquery contains only rows in theevaluation
dataframe.
- The query results show that the
Run the ML. EVALUATE
query
To run the ML.EVALUATE
query that evaluates the model, follow these steps:
-
-
-
- In the Cloud Console, click the Compose new query button.
- Enter the following standard SQL query in the Query editor text area:
SELECT * FROM ML.EVALUATE (MODEL `census.census_model`, ( SELECT * FROM `census.input_view` WHERE dataframe = 'evaluation' ) )
- (Optional) To set the processing location, click More > Query settings. For Processing location, choose
US
. This step is optional because the processing location is automatically detected based on the dataset’s location
-
-
- Click Run.
- When the query is complete, click the Results tab below the query text area. The results should look like the following:
- Because you performed a logistic regression, the results include the following columns:
precision
recall
accuracy
f1_score
log_loss
roc_auc
The accuracy of the model describes the ratio of correctly predicted values to total predictions: an accuracy of .80 means that 80% of predictions are correct.
Step six: Use the ML. PREDICT
function to predict income bracket
To identify the income bracket to which a particular respondent belongs, use the ML.PREDICT
function. The following query predicts the income bracket of every respondent in the prediction
dataframe.
SELECT * FROM ML.PREDICT (MODEL `census.census_model`, ( SELECT * FROM `census.input_view` WHERE dataframe = 'prediction' ) )
Query details
The ML.PREDICT
function predicts results using your model and the data from input_view
, filtered to include only rows in the ‘prediction’ dataframe. The top-most SELECT
statement retrieves the output of the ML.PREDICT
function.
Run the ML. PREDICT
query
To run the ML.PREDICT
query, follow these steps:
- In the Cloud Console, click the Compose new query button.
- Enter the following standard SQL query in the Query editor text area.
SELECT * FROM ML.PREDICT (MODEL `census.census_model`, ( SELECT * FROM `census.input_view` WHERE dataframe = 'prediction' ) )
- Click Run.
- When the query is complete, click the Results tab below the query text area. The results should look like the following:
Cleaning up
To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.
- You can delete the project you created.
- Or you can keep the project and delete the dataset.
Deleting your dataset
Deleting your project removes all datasets and all tables in the project. If you prefer to reuse the project, you can delete the dataset you created in this tutorial:
- If necessary, open the BigQuery page in the Cloud Console.Go to the BigQuery page
- In the navigation, click the census dataset you created.
- Click Delete dataset on the right side of the window. This action deletes the dataset and the model.
- In the Delete dataset dialog box, confirm the delete command by typing the name of your dataset (
census
) and then click Delete.
Deleting your project
To delete the project:
- In the Cloud Console, go to the Manage resources page.Go to Manage resources
- In the project list, select the project that you want to delete, and then click Delete.
- In the dialog, type the project ID, and then click Shut down to delete the project.
This article has been published from the source link without modifications to the text. Only the headline has been changed.