Build, Train, and Deploy a Machine Learning Model
with Amazon SageMaker
In this tutorial, you will learn how to use Amazon SageMaker to build, train, and deploy a machine learning (ML) model using Python3 implementing the popular XGBoost ML algorithm. Amazon SageMaker is a fully-managed machine learning platform that enables data scientists and developers to build and train machine learning models and deploy them into production applications.
In the example scenario for this tutorial, as a machine learning developer for the marketing department of a bank you need to predict if a customer will enroll for certificate of deposit. Your marketing dataset contains bank information on customer demographics, response to marketing events, and external environment factors. The data is labeled, meaning there is a column in the dataset that identifies whether the customer is enrolled for a product offered by the bank. For this example scenario, you will use a publicly-available dataset from the ML repository curated by the University of California, Irvine.
Ordinarily, building a ML model to solve a challenge like this is complex. You have to manage large amounts of data for model training, choose the best algorithm, manage the compute capacity to train the model, and then deploy the model into a production environment. Amazon SageMaker removes these complexities, making it easy to build ML models by providing everything you need to quickly connect to your training data and select the best algorithm and framework for your application, while managing all of the underlying infrastructure, so you can train models at petabyte scale.
To build a machine learning model with Amazon SageMaker in this tutorial, you will create a notebook instance, prepare the data, train the model to learn from the data, deploy the model, then evaluate your machine learning models performance.
This tutorial requires an Amazon Web Services account
Parts of the resources you create in this tutorial are Free Tier eligible.
Step 1. Enter the SageMaker console
Navigate to the Amazon SageMaker console.
Open the Amazon Web Services Management Console, so you can keep this step-by-step guide open. Next, begin typing SageMaker in the search bar and select Amazon SageMaker to open the service console.
( click to enlarge )
Step 2. Create a SageMaker notebook instance
In this step, you will create a SageMaker notebook instance.
2a. From the Amazon SageMaker dashboard, select Create notebook Instance.
( click to enlarge )
2b. On the Create notebook instance pane, in Notebook instance name field type MySageMakerInstance. So that your new instance can securely access S3 and other Amazon Web Services services, SageMaker can create a new IAM role with the right permissions and assign it to your instance for you. Instruct SageMaker to create this IAM role by selecting Create a new role from the IAM role drop down.
( click to enlarge )
2c. On the Create an IAM role box, select Any S3 bucket. This will allow your SageMaker instance to access all the S3 buckets in your account. Now, select Create role.
( click to enlarge )
2d. Notice that SageMaker made a role called AmazonSageMaker-ExecutionRole-*** for you. On the Create notebook instance panel, you can also optionally place your instance in VPC, set a lifecycle configuration, and an encryption key. For this tutorial, leave the rest of the fields with the default options. Choose Create notebook instance.
( click to enlarge )
Step 3. Prepare the data
In this step you will use your Amazon SageMaker notebook to preprocess the data you need to train your machine learning model.
3a. On the Notebook instances pane, after MySageMakerInstance has transitioned from Pending to InService, select Open from the Actions column.
( click to enlarge )
3b. After Jupyter opens up MySageMakerInstance, in the Files tab, click on New, and then choose conda_python3.
( click to enlarge )
3c. To prepare the data, train the machine learning model, and deploy it, you will need to import some libraries and define a few environment variables in your Jupyter notebook environment. Copy the following code into the code cell in your instance and select Run.
While the code runs, a * will appears between the square brackets as pictured in the first screenshot to the right. After a few seconds, the code execution will complete, the * will be replaced with the number 1, and you will see a success message as pictured in the second screenshot to the right.
# import libraries
import boto3, re, sys, math, json, os, sagemaker, urllib.request
from sagemaker import get_execution_role
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import Image
from IPython.display import display
from time import gmtime, strftime
from sagemaker.predictor import csv_serializer
# Define IAM role
role = get_execution_role()
prefix = 'sagemaker/DEMO-xgboost-dm'
containers = {'us-west-2': '433757028032.dkr.ecr.us-west-2.amazonaws.com/xgboost:latest',
'us-east-1': '811284229777.dkr.ecr.us-east-1.amazonaws.com/xgboost:latest',
'us-east-2': '825641698319.dkr.ecr.us-east-2.amazonaws.com/xgboost:latest',
'eu-west-1': '685385470294.dkr.ecr.eu-west-1.amazonaws.com/xgboost:latest'} # each region has its XGBoost container
my_region = boto3.session.Session().region_name # set the region of the instance
print("Success - the MySageMakerInstance is in the " + my_region + " region. You will use the " + containers[my_region] + " container for your SageMaker endpoint.")
( click to enlarge )
( click to enlarge )
3d. You will need a S3 bucket to store your training data once you have processed it. Copy and Run the following code to create a S3 bucket into the next code cell in your notebook.
Because S3 bucket names must be globally unique and have some restrictions, change the your_s3_bucket_name string in the code to a unique string. In your notebook, select Run.
If you don't receive a success message after running the code, change the bucket name and try again.
bucket_name = 'your_s3_bucket_name' # <--- change this variable to a unique name for your bucket
s3 = boto3.resource('s3')
try:
if my_region == 'us-east-1':
s3.create_bucket(Bucket=bucket_name)
else:
s3.create_bucket(Bucket=bucket_name, CreateBucketConfiguration={ 'LocationConstraint': my_region })
print('S3 bucket created successfully')
except Exception as e:
print('S3 error: ',e)
( click to enlarge )
3e. Next, you need to download the data to your SageMaker instance and load it into a dataframe. Copy and Run the following code:
try:
urllib.request.urlretrieve ("https://d1.awsstatic.com/tmt/build-train-deploy-machine-learning-model-sagemaker/bank_clean.27f01fbbdf43271788427f3682996ae29ceca05d.csv", "bank_clean.csv")
print('Success: downloaded bank_clean.csv.')
except Exception as e:
print('Data load error: ',e)
try:
model_data = pd.read_csv('./bank_clean.csv',index_col=0)
print('Success: Data loaded into dataframe.')
except Exception as e:
print('Data load error: ',e)
( click to enlarge )
3f. Then, shuffle the data and split it into training data and test data. The training data (70% of customers) will be used during an iterative cycle called gradient optimization to learn model parameters and infer the class label from input features with the least possible error. The test data (30%) will be used to evaluate the performance of the model. Copy then paste the following code and select Run to shuffle and split the data:
train_data, test_data = np.split(model_data.sample(frac=1, random_state=1729), [int(0.7 * len(model_data))])
print(train_data.shape, test_data.shape)
( click to enlarge )
Step 4. Train the model from the data
In this step, you will train your machine learning model with the training dataset.
4a. To use a SageMaker pre-built XGBoost model, you will need to reformat the header and first column of the training data and load the data from the S3 bucket. Copy then paste the following code into a new code cell and select Run to reformat and load the data:
pd.concat([train_data['y_yes'], train_data.drop(['y_no', 'y_yes'], axis=1)], axis=1).to_csv('train.csv', index=False, header=False)
boto3.Session().resource('s3').Bucket(bucket_name).Object(os.path.join(prefix, 'train/train.csv')).upload_file('train.csv')
s3_input_train = sagemaker.s3_input(s3_data='s3://{}/{}/train'.format(bucket_name, prefix), content_type='csv')
4b. Next, you need to set up the SageMaker session, create an instance of the XGBoost model (an estimator), and define the model’s hyperparameters. Copy, paste and Run the following code:
sess = sagemaker.Session()
xgb = sagemaker.estimator.Estimator(containers[my_region],role, train_instance_count=1, train_instance_type='ml.m4.xlarge',output_path='s3://{}/{}/output'.format(bucket_name, prefix),sagemaker_session=sess)
xgb.set_hyperparameters(max_depth=5,eta=0.2,gamma=4,min_child_weight=6,subsample=0.8,silent=0,objective='binary:logistic',num_round=100)
4c. With the data loaded and XGBoost estimator set up, train the model using gradient optimization on a ml.m4.xlarge instance by pasting the code and selecting Run.
xgb.fit({'train': s3_input_train})
( click to enlarge )
Step 5. Deploy the model
In this step, you will deploy the trained model to an endpoint, reformat then load the CSV data, then run the model to create predictions.
5a. To deploy the model on a server and create an endpoint you can access, copy, paste, and Run the following code:
xgb_predictor = xgb.deploy(initial_instance_count=1,instance_type='ml.m4.xlarge')
( click to enlarge )
5b. To predict whether customers in the test data enrolled for the bank product or not, copy, paste, and select Run the follow code:
test_data_array = test_data.drop(['y_no', 'y_yes'], axis=1).as_matrix() #load the data into an array
xgb_predictor.content_type = 'text/csv' # set the data type for an inference
xgb_predictor.serializer = csv_serializer # set the serializer type
predictions = xgb_predictor.predict(test_data_array).decode('utf-8') # predict!
predictions_array = np.fromstring(predictions[1:], sep=',') # and turn the prediction into an array
print(predictions_array.shape)
( click to enlarge )
Step 6. Evaluate model performance
In this step, you will evaluate the performance and accuracy of the machine learning model.
6a. Copy and paste the code below and select Run to compare actual vs. predicted values in a table called a confusion matrix.
You can conclude that you predicted the outcome accurately for 90% of customers in the test data, with a precision of 65% (278/429) for enrolled and 90% (10,785/11,928) for didn’t enroll. The model could now be fined tuned, but this performance is already better than in the original paper.
cm = pd.crosstab(index=test_data['y_yes'], columns=np.round(predictions_array), rownames=['Observed'], colnames=['Predicted'])
tn = cm.iloc[0,0]; fn = cm.iloc[1,0]; tp = cm.iloc[1,1]; fp = cm.iloc[0,1]; p = (tp+tn)/(tp+tn+fp+fn)*100
print("\n{0:<20}{1:<4.1f}%\n".format("Overall Classification Rate: ", p))
print("{0:<15}{1:<15}{2:>8}".format("Predicted", "No Purchase", "Purchase"))
print("Observed")
print("{0:<15}{1:<2.0f}% ({2:<}){3:>6.0f}% ({4:<})".format("No Purchase", tn/(tn+fn)*100,tn, fp/(tp+fp)*100, fp))
print("{0:<16}{1:<1.0f}% ({2:<}){3:>7.0f}% ({4:<}) \n".format("Purchase", fn/(tn+fn)*100,fn, tp/(tp+fp)*100, tp))
( click to enlarge )
Step 7. Terminate your resources
In this step you will terminate your Amazon SageMaker related resources. Important: Terminating resources that are not actively being used reduces costs and is a best practice. Not terminating your resources will result in a charge.
7a. To delete the SageMaker endpoint and the objects in your S3 bucket, copy, paste and Run the following code:
sagemaker.Session().delete_endpoint(xgb_predictor.endpoint)
bucket_to_delete = boto3.resource('s3').Bucket(bucket_name)
bucket_to_delete.objects.all().delete()
( click to enlarge )
Congratulations!
You have learned how to use Amazon SageMaker to prepare, train, deploy and evaluate a machine learning model. Amazon SageMaker makes it easy to build ML models by providing everything you need to quickly connect to your training data and select the best algorithm and framework for your application, while managing all of the underlying infrastructure, so you can train models at petabyte scale.
Recommended Next:
Learn More
Amazon SageMaker comes with pre-built machine learning algorithms that can be used for various use cases. Learn more about using the built-in algorithms that come with Amazon SageMaker.
Dive Deeper
You can use Machine Learning with Automatic Model Tuning in Amazon SageMaker. This allows you to automatically tune hyperparameters in your models to achieve the best possible outcome. Check out the documentation for Automatic Model Tuning and the blog post to dive deeper into this capability.
See it in action
Amazon SageMaker has a number of libraries in GitHub for your use in your machine learning use cases. Check out GitHub for the SageMaker libraries