A typical introductory task in machine learning (the “Hello World” equivalent) is one that uses a dataset to predict whether a customer will enroll for a term deposit at a bank, after one or more phone calls. For more information about the task and the dataset used, see Bank Marketing Data Set.
Direct marketing, through mail, email, phone, etc., is a common tactic to acquire customers. Because resources and a customer's attention are limited, the goal is to only target the subset of prospects who are likely to engage with a specific offer. Predicting those potential customers based on readily available information like demographics, past interactions, and environmental factors is a common machine learning problem. You can imagine that this task would readily translate to marketing lead prioritization in your own organization.
This example demonstrates how you can use Autopilot on this dataset to get the most accurate ML pipeline through exploring a number of potential options, or “candidates”. Each candidate generated by Autopilot consists of two steps. The first step performs automated feature engineering on the dataset and the second step trains and tunes an algorithm to produce a model. When you deploy this model, it follows similar steps. Feature engineering followed by inference, to decide whether the lead is worth pursuing or not. The notebook contains instructions on how to train the model as well as to deploy the model to perform batch predictions on a set of leads. Where it is possible, use the Amazon SageMaker Python SDK, a high level SDK, to simplify the way you interact with Amazon SageMaker. Let's explore two ways to run this example:
Clicking “Create Experiment” in the previous step will open the following tab
Track the progress of your experiment using the progress bar shown below. In this case, SageMaker Autopilot is done with “Analyzing data” and the “Feature Engineering” steps and is currently running “Model tuning”
The trials tab helps you keep track of training jobs that are completed, in progress and the training job that is the best so far:
To get more details about a particular trial, you can right click on any trial and click “Open in Trial details”
Click on “Objective” to sort the trial rows by objective - this will put the best trial on the top
Right-click on the best trial and click “Open in Trial component list”. Highlight the first 15 rows by holding down ctrl and click through, or by clicking the first row, and clicking on the 15th row while holding down shift. Then click “Add chart”
Scroll down and click the “New Chart” area
Under Chart Properties in the right side-bar, select Data type : “Summary Statistics”, Chart type : “Histogram” and use the drop-down under X-axis to select validation:accuracy_avg Feel free to explore other charts!
In the Trial components list, click any trial followed by “Deploy model”. Add an endpoint name, and any other optional parameters and click the [Deploy model] button!
(Optional) Enable model monitoring and run through the notebook that opens up. Read more about model monitoring here
To run cells in your notebook, click a cell followed by the [ Run] button on the top toolbar. You can also click a cell and hit shift+enter
Start by running the cells that import necessary packages like sagemaker (python SDK), and set parameters such as region, bucket, prefix and role for later use. Then run the cell the downloads the data.
Read data using padas read_csv
Split the data into a train set and a test dataset
Run the next cell to set up and launch the autopilot job
Run and track the status of the job:
Run the rest of the notebook to get the best candidate job, and to view the “Candidate generation” and “Data exploration” notebooks that were generated by SageMaker Autopilot.