Let's manually run a SageMaker training job using the SageMaker training job operator.
Download the training and serving script. Create a gzipped tar file with this script, and upload it to your S3 bucket. In the snippet below,
JOBNAME is a training job identifier that you define, such as
tar cvf sourcedir.tar tftab.py gzip sourcedir.tar aws s3 cp sourcedir.tar.gz s3://BUCKET/JOBNAME/source/sourcedir.tar.gz
We need an IAM role for our training job to use.
AWS serviceand for the entity set the service to
Download the template job definition and make the following changes:
You'll notice on line 32 that we use an
ml.m5.xlarge instance, which does not have a GPU. Normally with TensorFlow we'd want to use GPU-powered instances, but new AWS accounts often have limits on using GPU instances.
kubectl apply -f tf-job.yaml
You can list all training jobs:
kubectl get TrainingJob
You'll see the job status in the output of that command. You can get more details on a job by describing it:
kubectl describe trainingjob <JOBNAME>
To see the full logs from the job:
kubectl smlogs trainingjob <JOBNAME>