HomeMachine LearningMachine Learning DIYBuilding a classification model on census data

Building a classification model on census data

BigQuery ML supports supervised learning  with the logistic regression model type. You can use the binary logistic regression model type to predict whether a value falls into one of two categories; or, you can use the multi-class regression model type to predict whether a value falls into one of multiple categories. These are known as classification problems, because they attempt to classify data into two or more categories.In this tutorial, you use a binary logistic regression model in BigQuery ML to predict the income range of respondents in the US Census Dataset. This dataset contains the demographic and income information of US residents from 2000 and 2010. The data includes employment status, job category, level of education, and income data.

In this tutorial you will perform the following tasks:

  • Create a logistic regression model.
  • Evaluate the logistic regression model.
  • Make predictions using the logistic regression model.

This tutorial uses billable components of Cloud Platform, including:

  • BigQuery
  • BigQuery ML

For more information on BigQuery costs, see the BigQuery pricing page.

For more information on BigQuery ML costs, see the BigQuery ML pricing page.

  1. If you’re new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud Console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. Make sure that billing is enabled for your Cloud project. Learn how to confirm that billing is enabled for your project.
  4. BigQuery is automatically enabled in new projects. To activate BigQuery in a pre-existing project, go to Enable the BigQuery API.Enable the API

A common problem in machine learning is to classify data into one of two types, known as labels. For example, a retailer may want to predict whether a given customer will purchase a new product, based on other information about that customer. In that case, the two labels might be “will buy” and “won’t buy.” You can construct your dataset such that one column represents the label. The data you can use to train such a binary logistic regression model include the customer’s location, their previous purchases, the customer’s reported preferences, and so on.

In this tutorial, you use BigQuery ML to create a binary logistic regression model that predicts whether a US Census respondent’s income falls into one of two ranges based on the respondent’s demographic attributes.

Creating your logistic regression model consists of the following steps.

  • Step one: Create a dataset to store your model.
    The first step is to create a BigQuery dataset to store your model.
  • Step two: Examine your data.
    In this step, examine the dataset and identify which columns to use as training data for your logistic regression model.
  • Step three: Select your training data.
    The next step is to prepare the data you use to train your binary logistic regression model by running a query against the census_adult_income table. This step identifies the relevant features and stores them in a view for later queries to use as input data.
  • Step four: Create a logistic regression model.
    In this step, use the CREATE MODEL statement to create your logistic regression model.
  • Step five: Use the ML.EVALUATE function to evaluate your model.
    Then, use the ML.EVALUATE function to provide statistics about model performance.
  • Step six: Use the ML.PREDICT function to predict a participant’s income.
    Finally, you use the ML.PREDICT function to predict the income bracket for a given set of census participants.

The first step is to create a BigQuery dataset to store your model. To create your dataset, follow these steps:

  1. In the Cloud Console, go to the BigQuery page.Go to the BigQuery page
  2. In the navigation panel, in the Resources section, click your project name.
  3. On the right side, in the details panel, click Create dataset.
  1. On the Create dataset page, for Dataset ID, enter census.
  2. Leave all of the other default settings in place and click Create dataset.

The next step is to examine the dataset and identify which columns to use as training data for your logistic regression model. You can use a standard SQL query to return rows from the dataset.

The following query returns 100 rows from the US Census Dataset:

SELECT
  *
FROM
  `bigquery-public-data.ml_datasets.census_adult_income`
LIMIT
  100;

To run the query that returns rows from your dataset:

  1. In the Cloud Console, click the Compose new query button.
  2. Enter the following standard SQL query in the Query editor text area:
    SELECT
      *
    FROM
      `bigquery-public-data.ml_datasets.census_adult_income`
    LIMIT
      100;
    
  3. Click Run.
  4. When the query is complete, click the Results tab below the query text area. The results should look like the following:Building a classification model on census data 2
    1. The query results show that the income_bracket column in the census_adult_income table has only one of two values: <=50K or >50K. It also shows that the columns education and education_num in the census_adult_income table express the same data in different formats. The functional_weight column is the number of individuals that the Census Organizations believes a particular row represents; the values of this column appear unrelated to the value of income_bracket for a particular row.

    Next, you select the data used to train your logistic regression model. In this tutorial, you predict census respondent income based on the following attributes:

    • Age
    • Type of work performed
    • Country of origin
    • Marital status
    • Level of education
    • Occupation
    • Race
    • Hours worked per week

    The following query creates a view that compiles your training data. This view is included in your CREATE MODEL statement later in this tutorial.

    CREATE OR REPLACE VIEW
      `census.input_view` AS
    SELECT
      age,
      workclass,
      native_country,
      marital_status,
      education_num,
      occupation,
      race,
      hours_per_week,
      income_bracket,
      CASE
        WHEN MOD(functional_weight, 10) < 8 THEN 'training'
        WHEN MOD(functional_weight, 10) = 8 THEN 'evaluation'
        WHEN MOD(functional_weight, 10) = 9 THEN 'prediction'
      END AS dataframe
    FROM
      `bigquery-public-data.ml_datasets.census_adult_income`
    

    This query extracts data on census respondents, including education_num, which represents the respondent’s level of education, and workclass, which represents the type of work the respondent performs. This query excludes several categories that duplicate data: for example, the columns education and education_num in the census_adult_income table express the same data in different formats, so this query excludes the education column. The dataframe column uses the excluded functional_weight column to label 80% of the data source for training, and reserves the remaining data for evaluation and prediction. The query creates a view containing these columns, so that you can use them to perform training and prediction later.

    To run the query that compiles the training data for your model:

    1. In the Cloud Console, click the Compose new query button.
    2. Enter the following standard SQL query in the Query editor text area:
      CREATE OR REPLACE VIEW
        `census.input_view` AS
      SELECT
        age,
        workclass,
        native_country,
        marital_status,
        education_num,
        occupation,
        race,
        hours_per_week,
        income_bracket,
        CASE
          WHEN MOD(functional_weight, 10) < 8 THEN 'training'
          WHEN MOD(functional_weight, 10) = 8 THEN 'evaluation'
          WHEN MOD(functional_weight, 10) = 9 THEN 'prediction'
        END AS dataframe
      FROM
        `bigquery-public-data.ml_datasets.census_adult_income`
      
    3. Click Run.
    4. In the navigation panel, in the Resources section, click your project name. Your view should appear beneath it.
    5. Click your view. The schema for the view appears in the Schema tab under the query editor.Building a classification model on census data 3

      Now that you have examined your training data, the next step is to create a logistic regression model using the data.

      You can create and train a logistic regression model using the CREATE MODEL statement with the option 'LOGISTIC_REG'. The following query uses a CREATE MODEL statement to train a new binary logistic regression model on the view from the previous query.

      CREATE OR REPLACE MODEL
        `census.census_model`
      OPTIONS
        ( model_type='LOGISTIC_REG',
          auto_class_weights=TRUE,
          input_label_cols=['income_bracket']
        ) AS
      SELECT
        * EXCEPT(dataframe)
      FROM
        `census.input_view`
      WHERE
        dataframe = 'training'
      

      The CREATE MODEL statement trains a model using the training data in the SELECT statement.

      The OPTIONS clause specifies the model type and training options. Here, the LOGISTIC_REG option specifies a logistic regression model type. It is not necessary to specify a binary logistic regression model versus a multiclass logistic regression model: BigQuery ML can determine which to train based on the number of unique values in the label column.

      The input_label_cols option specifies which column in the SELECT statement to use as the label column. Here, the label column is income_bracket, so the model will learn which of the two values of income_bracket is most likely based on the other values present in each row.

      The ‘auto_class_weights=TRUE’ option balances the class labels in the training data. By default, the training data is unweighted. If the labels in the training data are imbalanced, the model may learn to predict the most popular class of labels more heavily. In this case, most of the respondents in the dataset are in the lower income bracket. This may lead to a model that predicts the lower income bracket too heavily. Class weights balance the class labels by calculating the weights for each class in inverse proportion to the frequency of that class.

      The SELECT statement queries the view from Step 2. This view contains only the columns containing feature data for training the model. The WHERE clause filters the rows in input_view so that only those rows belonging to the training dataframe are included in the training data.

      To run the query that creates your logistic regression model, complete the following steps:

      1. In the Cloud Console, click the Compose new query button.
      2. Enter the following standard SQL query in the Query editor text area:
      CREATE OR REPLACE MODEL
        `census.census_model`
      OPTIONS
        ( model_type='LOGISTIC_REG',
          auto_class_weights=TRUE,
          data_split_method='NO_SPLIT',
          input_label_cols=['income_bracket'],
          max_iterations=15) AS
      SELECT
        * EXCEPT(dataframe)
      FROM
        `census.input_view`
      WHERE
        dataframe = 'training'
      
      1. Click Run.
      2. In the navigation panel, in the Resources section, expand [PROJECT_ID] > census and then click census_model.
      3. Click the Schema tab. The model schema lists the attributes that BigQuery ML used to perform logistic regression. The schema should look like the following:Building a classification model on census data 4

        After creating your model, evaluate the performance of the model using the ML.EVALUATE function. The ML.EVALUATE function evaluates the predicted values against the actual data.

        The query to evaluate the model is as follows:

        SELECT
          *
        FROM
          ML.EVALUATE (MODEL `census.census_model`,
            (
            SELECT
              *
            FROM
              `census.input_view`
            WHERE
              dataframe = 'evaluation'
            )
          )
        

        The ML.EVALUATE function takes the model trained in Step 1 and evaluation data returned by a SELECT subquery. The function returns a single row of statistics about the model. This query uses data from input_view as evaluation data. The WHERE clause filters the input data so that the subquery contains only rows in the evaluation dataframe.

To run the ML.EVALUATE query that evaluates the model, follow these steps:

        1. In the Cloud Console, click the Compose new query button.
        2. Enter the following standard SQL query in the Query editor text area:
          SELECT
            *
          FROM
            ML.EVALUATE (MODEL `census.census_model`,
              (
              SELECT
                *
              FROM
                `census.input_view`
              WHERE
                dataframe = 'evaluation'
              )
            )
          
        3. (Optional) To set the processing location, click More > Query settings. For Processing location, choose US. This step is optional because the processing location is automatically detected based on the dataset’s location

Building a classification model on census data 5

  1. Click Run.
  2. When the query is complete, click the Results tab below the query text area. The results should look like the following:Building a classification model on census data 6
  1. Because you performed a logistic regression, the results include the following columns:
    • precision
    • recall
    • accuracy
    • f1_score
    • log_loss
    • roc_auc

    The accuracy of the model describes the ratio of correctly predicted values to total predictions: an accuracy of .80 means that 80% of predictions are correct.

To identify the income bracket to which a particular respondent belongs, use the ML.PREDICT function. The following query predicts the income bracket of every respondent in the prediction dataframe.

SELECT
  *
FROM
  ML.PREDICT (MODEL `census.census_model`,
    (
    SELECT
      *
    FROM
      `census.input_view`
    WHERE
      dataframe = 'prediction'
     )
  )

The ML.PREDICT function predicts results using your model and the data from input_view, filtered to include only rows in the ‘prediction’ dataframe. The top-most SELECT statement retrieves the output of the ML.PREDICT function.

To run the ML.PREDICT query, follow these steps:

  1. In the Cloud Console, click the Compose new query button.
  2. Enter the following standard SQL query in the Query editor text area.
SELECT
  *
FROM
  ML.PREDICT (MODEL `census.census_model`,
    (
    SELECT
      *
    FROM
      `census.input_view`
    WHERE
      dataframe = 'prediction'
     )
  )
  1. Click Run.
  2. When the query is complete, click the Results tab below the query text area. The results should look like the following:

Building a classification model on census data 7

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.

  • You can delete the project you created.
  • Or you can keep the project and delete the dataset.

Deleting your project removes all datasets and all tables in the project. If you prefer to reuse the project, you can delete the dataset you created in this tutorial:

  1. If necessary, open the BigQuery page in the Cloud Console.Go to the BigQuery page
  2. In the navigation, click the census dataset you created.
  3. Click Delete dataset on the right side of the window. This action deletes the dataset and the model.
  4. In the Delete dataset dialog box, confirm the delete command by typing the name of your dataset (census) and then click Delete.

To delete the project:

  1. In the Cloud Console, go to the Manage resources page.Go to Manage resources
  2. In the project list, select the project that you want to delete, and then click Delete.
  3. In the dialog, type the project ID, and then click Shut down to delete the project.

This article has been published from the source link without modifications to the text. Only the headline has been changed.

Source link

Most Popular