As we've seen, the performance of our very simple TensorFlow model is medicore. While we should look at the basic network architecture and feature engineering, we can also look at tuning some hyperparameters like the learning rate. SageMaker HPO helps find the best set of hyperparameters.
We need to make some changes to the TensorFlow script to accept learning rate as a parameter, and output the loss (error) as a recognizable metric.
Download the training and serving script. Create a gzipped tar file with this script, and upload it to your S3 bucket. In the snippet below,
BUCKET is the name of your S3 bucket.
tar cvf sourcedir.tar tftab-lr.py gzip sourcedir.tar aws s3 cp sourcedir.tar.gz s3://BUCKET/hpo/source/sourcedir.tar.gz
You'll see in the HPO job definition that we want to minimize the loss (error). In a production case, you should consider which metric you want to minimize or maximize, and which hyperparameters you want to explore.
hyperParameterTuningJobObjective: type: Minimize metricName: tferror
Download the template job definition and make the following changes:
kubectl apply -f tf-hpo-job.yaml
You can list all HPO jobs:
kubectl get hyperparametertuningjob
You'll see the job status in the output of that command. You can get more details on a job, including the best value for the hyperparameters, by describing it:
kubectl describe hyperparametertuningjob <JOBNAME>
In this case, it seems that the best value for
learning_rate is about
Status: Best Training Job: Creation Time: 2020-04-28T22:34:37Z Final Hyper Parameter Tuning Job Objective Metric: Metric Name: tferror Value: Objective Status: Succeeded Training End Time: 2020-04-28T22:38:13Z Training Job Status: Completed Training Start Time: 2020-04-28T22:36:52Z Tuned Hyper Parameters: Name: learning_rate Value: 0.005037803259063151
Detailed logs for a job are available with:
kubectl smlogs hyperparametertuningjob <JOBNAME>