Ensemble Learning with Sagemaker and Step-Functions Dr. Benjamin - - PowerPoint PPT Presentation

ensemble learning with sagemaker and step functions
SMART_READER_LITE
LIVE PREVIEW

Ensemble Learning with Sagemaker and Step-Functions Dr. Benjamin - - PowerPoint PPT Presentation

Ensemble Learning with Sagemaker and Step-Functions Dr. Benjamin Weigel | 09.09.2019 Hamburg, Germany Benjamin Weigel Data Engineer & Cloud Coordinator Europace AG https://www.europace.de/ There is manual efgort in obtaining a mortgage


slide-1
SLIDE 1

Ensemble Learning with Sagemaker and Step-Functions

  • Dr. Benjamin Weigel | 09.09.2019

Hamburg, Germany

slide-2
SLIDE 2

Benjamin Weigel

Data Engineer & Cloud Coordinator Europace AG

https://www.europace.de/

slide-3
SLIDE 3
slide-4
SLIDE 4

There is manual efgort in obtaining a mortgage

slide-5
SLIDE 5

Smart Document Classification

slide-6
SLIDE 6

Smart Document Classification

Text Model Image Model Sequence Model trained on OCR-extracted text trained on page-bitmap trained on sequence information (i.e. “Page 1-4 is a contract”) use output as input

slide-7
SLIDE 7

Options to Build a Model Training-Pipeline

slide-8
SLIDE 8

AWS Sagemaker

slide-9
SLIDE 9

AWS Step-Functions

  • define a distributed workflow as

series of steps

  • visual workflow
  • long running workflows (max 1yr)
  • 4000 transitions/month for free
  • after: 0.025 USD per 1000 transitions
  • can get expensive quickly
slide-10
SLIDE 10

AWS Step-Functions

slide-11
SLIDE 11

Amazon States Language

  • define a state-machine
  • JSON-based
  • describe:

○ a state and the transition to the next ○ error-conditions etc.

{ "Comment" : "An example of the Amazon States Language." , "StartAt" : "FirstState" , "States" : { "FirstState" : { "Type": "Task", "Resource" : "arn:aws:lambda:us-east-1:123456789012:function:..." , "Next": "ChoiceState" }, ... } }

"ChoiceState": { "Type": "Choice", "Choices": [ { "Variable": "$.foo", "NumericEquals": 1, "Next": "FirstMatchState" }, { "Variable": "$.foo", "NumericEquals": 2, "Next": "SecondMatchState" } ], "Default": "DefaultState" }

slide-12
SLIDE 12

Step Functions & Sagemaker

easy peasy

  • Step Functions can control Sagemaker

Transform and Training Jobs directly via these Resources:

"arn:aws:states:::sagemaker:createTransformJob.sync" "arn:aws:states:::sagemaker:createTrainingJob.sync"

https://docs.aws.amazon.com/step-functions/latest/dg/connect-sagemaker.html

slide-13
SLIDE 13

sagemaker: createTrainingJob.sync

  • configure job via Parameters

section

https://docs.aws.amazon.com/step-functions/latest/dg/connect-sagemaker.html "Image Model Training": { "Type": "Task", "Resource": "arn:aws:states:::sagemaker:createTrainingJob.sync", "Parameters": { "TrainingJobName": "ImageModel", "AlgorithmSpecification": { "TrainingImage": "520713654638.dkr.ecr.eu-central-1.amazonaws.com/sagemaker-mxnet:1.3-gpu-py3", "TrainingInputMode": "File" }, "HyperParameters": { "epochs": "80", "batch_size": "10", "conv_block_length": "2", "cycle_length": "10", "depth": "5", "dropout": "0.5", "max_lr": "0.1", "min_lr": "0.0001", ... "start_filter": "4", "worker": "4" }, "InputDataConfig.$": "$.generated.image_model.InputDataConfig", "OutputDataConfig": { "S3OutputPath.$": "$.generated.output_artifact_paths.image_model_prefix" }, "ResourceConfig": { "InstanceCount": 4, "InstanceType": "ml.p2.xlarge", "VolumeSizeInGB": 10 }, "RoleArn": "arn:aws:iam::123456789012:role/sm-stepfunction-iam-role", "StoppingCondition": { "MaxRuntimeInSeconds": 172800 } } } https://docs.aws.amazon.com/sagemaker/latest/dg/API_CreateTrainingJob.html#API_CreateTrainingJob_RequestSyntax

slide-14
SLIDE 14

The Good

Photo by Joshua Ness on Unsplash

slide-15
SLIDE 15

Start simple

{ "StartAt": "Train Text Model", "States": { "Train Text Model": { "Type": "Task", "Resource": "arn:aws:states:::sagemaker:createTrainingJob.sync", "Parameters": { ... }, "End": true } } }

slide-16
SLIDE 16

Expand from there

{ "StartAt": "Fetch Preprocessed Data" , "States": { "Fetch Preprocessed Data": { "Type": "Task", "Resource": "arn:aws:states:::batch:submitJob.sync" , "Next": "Train Text Model" , "Parameters": { "JobName": "FetchPreparedData" , "JobDefinition": "arn:aws:batch: us-east-1:1234567890:job-definition/job:2 ", "JobQueue": "arn:aws:batch:us-east-1:1234567890:job-queue/queue" , "Parameters": { "DATA_INPUT_PATH.$": "$.input_data" , "OUTPUT_PATH.$": "$.ready_to_use_artifacts" } } }, "Train Text Model": { ... } } }

slide-17
SLIDE 17

Retry if possible

"Train Text Model": { "Type": "Task", "Resource": "arn:aws:states:::sagemaker:createTrainingJob.sync", ... "Retry": [ { "ErrorEquals": [ "SageMaker.AmazonSageMakerException" ], "IntervalSeconds": 1, "MaxAttempts": 100, "BackoffRate": 1.1 }, ... ] }

slide-18
SLIDE 18

If all else fails ...

"Train Text Model": { "Type": "Task", ... "Catch": [{ "ErrorEquals": ["States.ALL" ], "Next": "Notify Failure" }] }, "Notify Failure": { "Type": "Task", "Resource": "arn:aws:states:::sns:publish" , "End": true, "Parameters": { "Subject": "[ERROR] Model Training failed!" , "Message": "Error during model training!" , "TopicArn": "arn:aws:sns:*:123456789012:alerting_topic" , "MessageAttributes": { ... } } }

slide-19
SLIDE 19

But there is a catch

...it’s a valid state after all

slide-20
SLIDE 20

Fail successfully!

"Notify Failure": { "Type": "Task", "Resource": "arn:aws:states:::sns:publish" , "Next": "Fail", ... }, "Fail": { "Type": "Fail" }

slide-21
SLIDE 21

Add a few more models ...

Text Model Image Model Sequence Model trained on OCR-extracted text trained on page-bitmap trained on sequence information (i.e. “Page 1-4 is a contract”) use output as input

slide-22
SLIDE 22

Use concurrency for time effjciency

"Fetch Preprocessed Data": { ... "Next": "Base Model Training" }, "Base Model Training": { "Type": "Parallel" , "Next": "Train Sequence Model" , "Branches": [ { "StartAt": "Train Image Model" , "States": { "Train Image Model": { ... "End": true } } },{ "StartAt": "Train Text Model" , "States": { "Train Text Model": { ... "End": true }}}]},

slide-23
SLIDE 23

Beware of “silent” errors

notification trigger won’t fire because there is no state defined for this scenario

  • > unexpected failure
slide-24
SLIDE 24

Everything should fail the same

"Base Model Training": { "Type": "Parallel", "Next": "Train Sequence Model" , "Branches": [...], "Catch": [ { "ErrorEquals": [ "States.ALL" ], "Next": "Notify Failure" } ] }

slide-25
SLIDE 25

Some jobs are long running and expensive

then something fails and you have to debug (rerun) …

slide-26
SLIDE 26

Save time & money

skip some steps...

"States": { "Skip Image Model Training?": { "Type": "Choice" , "Choices": [ { "Variable": "$.train_image_model" , "BooleanEquals": false, "Next": "Skip Fetch Preprocessing Artifacts" } ], "Default": "Train Image Model" }, "Skip Fetch Preprocessing Artifacts": { "Type": "Pass", "End": true }, "Train Image Model": { ... "End": true } }

slide-27
SLIDE 27

Rinse and repeat

...and add a little sprinkle on top

slide-28
SLIDE 28

Our Model Training Workflow

  • Lambda
  • Batch Job
  • Sagemaker
  • SNS
  • Choice
  • Pass (to skip steps)
  • Fail
  • Wait
slide-29
SLIDE 29

Our Model Training Workflow

  • Input to setup state

machine execution

  • define where the data is
  • (Hyper)Parameterization
  • Data & Models stored on

S3 (each execution gets its

  • wn copy of the data)

Step Function

Data (S3) Models & Data (S3) “Setup” Input

slide-30
SLIDE 30

The Bad

Photo by Markus Spiske on Unsplash

slide-31
SLIDE 31

“Configure” step functions via initial input

{ "initialization": { "fetch_data": { "image_model_artifact_path": "", "preprocessed_data_path": "s3://data/2019-08-31T21:53:12+0200" , "text_model_artifact_path": "" }, "image_model": { "batch_size": "128", "instance_type": "ml.p2.xlarge" }, "output_artifact_target_base_path": "s3://data/ model_training_data" , "run_training_steps": { "image_model": true, "text_model": true, "fetch_preprocessing_artifacts": true, "generate_image_model_split": true }, "sequence_model": { ... }, "text_model": { ... } } }

slide-32
SLIDE 32

“Generate” additional parameterization for execution

initial input generated parametrization of state machine via “setup” function max 32768 characters for input/result !

slide-33
SLIDE 33

Reference parameters in steps via JSON path expressions

"Image Model Training" : { "Type": "Task", "Resource" : "arn:aws:states:::sagemaker:createTrainingJob.sync" , "Parameters" : { "TrainingJobName.$" : "$.generated.image_model.TrainingJobName" , "HyperParameters" : { "epochs" : "80", "batch_size.$" : "$.initialization.image_model.batch_size" , "bucket.$" : "$.initialization.image_model.log_bucket" , "conv_block_length" : "2", "cycle_length" : "10", "depth": "5", "dropout" : "0.5", "job_name.$" : "$.generated.image_model.TrainingJobName" , "max_lr" : "0.1", "min_lr" : "0.0001" , "worker" : "4" }, "InputDataConfig.$" : "$.generated.image_model.InputDataConfig" , "ResourceConfig" : { "InstanceCount" : 4, "InstanceType.$" : "$.initialization.image_model.instance_type" ... { "initialization": { "fetch_data": { "image_model_artifact_path": "", "preprocessed_data_path": "s3://data/2019-08-31T21:53:12+0200" , "text_model_artifact_path": "" }, "image_model": { "batch_size": "128", "instance_type": "ml.p2.xlarge" }, "output_artifact_target_base_path": "s3://data/model_training_data" , "run_training_steps": { "image_model": true, "text_model": true, "fetch_preprocessing_artifacts": true, "generate_image_model_split": true }, "sequence_model": { ... }, "text_model": { ... } } }

slide-34
SLIDE 34

Anecdotal evidence: pX.ml Instances are a rare good!

"Is Capacity Error?" : { "Type": "Choice", "Comment": "Retry if capacity error." , "Choices": [ { "Variable": "$.error-info.Cause.FailureReason" , "StringEquals" : "CapacityError: Unable to provision requested ML compute

  • capacity. Please retry using a

different ML instance type." , "Next": "Wait 10 Minutes" } ], "Default": "FailImageModel" },

slide-35
SLIDE 35

Retry on Capacity Error

No notification about failure!

slide-36
SLIDE 36

Retry on Capacity Error

No notification about failure!

slide-37
SLIDE 37

Retry on Capacity Error - JSONify Error Cause with Lambda

slide-38
SLIDE 38

The Ugly

Photo by Zoltan Tasi on Unsplash

slide-39
SLIDE 39

Hyperparameters in the beginning be like...

"HyperParameters": { "s3_log_folder": "\"logs/sagemaker\"", "job_name.$": "$.generated.text_model.TrainingJobName", "sagemaker_container_log_level": "20", "sagemaker_enable_cloudwatch_metrics": "false", "sagemaker_program": "\"sagemaker_entry_point.py\"", "sagemaker_region": "\"${AWS::Region}\"", "sagemaker_submit_directory": "\"${textModelArtifactPath}\"" },

slide-40
SLIDE 40

Hyperparameters in the beginning be like...

"HyperParameters": { "s3_log_folder": "\"logs/sagemaker\"", "job_name.$": "$.generated.text_model.TrainingJobName", "sagemaker_container_log_level": "20", "sagemaker_enable_cloudwatch_metrics": "false", "sagemaker_program": "\"sagemaker_entry_point.py\"", "sagemaker_region": "\"${AWS::Region}\"", "sagemaker_submit_directory": "\"${textModelArtifactPath}\"" },

slide-41
SLIDE 41

"HyperParameters": { "s3_log_folder": "\"logs/sagemaker\"", "job_name.$": "$.generated.text_model.TrainingJobName", "sagemaker_container_log_level": "20", "sagemaker_enable_cloudwatch_metrics": "false", "sagemaker_program": "\"sagemaker_entry_point.py\"", "sagemaker_region": "\"${AWS::Region}\"", "sagemaker_submit_directory": "\"${textModelArtifactPath}\"" },

Expect a bumpy ride...

slide-42
SLIDE 42

"HyperParameters": { "s3_log_folder": "\"logs/sagemaker\"", "job_name.$": "$.generated.text_model.TrainingJobName", "sagemaker_container_log_level": "20", "sagemaker_enable_cloudwatch_metrics": "false", "sagemaker_program": "\"sagemaker_entry_point.py\"", "sagemaker_region": "\"${AWS::Region}\"", "sagemaker_submit_directory": "\"${textModelArtifactPath}\"" },

Expect a bumpy ride...

class TrainingEnvironment(ContainerEnvironment): # TODO expecting serialized hyperparams might break containers that aren't launched by python sdk @staticmethod def _deserialize_hyperparameters(hp): ... for (k, v) in hp.items(): ... hyperparameter_dict[k] = json.loads(v)

*Fixed in MXNet ≥1.3 Images

slide-43
SLIDE 43

Infrastructure as Code for Step Functions

slide-44
SLIDE 44

Infrastructure-as-Code and the Amazon States Language

  • we started out with

Cloudformation

  • stringified JSON-definition
  • no JSON linting possible
  • can you spot what’s

wrong?

SendSnsStateMachine : Type: 'AWS::StepFunctions::StateMachine' Properties : StateMachineName : 'send-hello-world-sns' RoleArn: !GetAtt Role.Arn DefinitionString : |- { "StartAt" : "HelloWorld" , "States" : { "HelloWorld" : { "Type": "Task", "Resource" : "arn:aws:states:::sns:publish" , "Parameters" : { "TopicArn" : "arn:aws:sns:eu-central-1:0123456789:hello-world" , "Message" : { "Input": "Hello from Step Functions!" , } }, "End": true } } }

slide-45
SLIDE 45

Infrastructure-as-Code and the Amazon States Language

ModelTrainingStateMachine: Type: 'AWS::StepFunctions::StateMachine' Properties: StateMachineName: !Sub '${Service}-model-training-${Stage}' RoleArn: !GetAtt TrainingStateMachineExecutionRole.Arn DefinitionString: Fn::Sub:
  • |-
{ "Comment": "Trainiert das SmartCat Modell. Input sind ", "StartAt": "Setup Statemachine", "TimeoutSeconds": 172800, "States": { "Setup Statemachine": { "Type": "Task", "ResultPath": "$.generated", "Resource": "arn:aws:lambda:${AWS::Region}:${AWS::AccountId}:function:smartcat-model-training-setup-lambda-${Stage}" , "Comment": "Generiert dynamische Input Parameter (Pfade fuer Modelle, Image Model Splits...)" , "Next": "Fetch Preprocessing Artifacts?" }, "Fetch Preprocessing Artifacts?": { "Type": "Choice", "Comment": "Wenn 'initialization.run_training_steps.fetch_preprocessing_artifacts' false ist ..." , "Choices": [ { "Variable": "$.initialization.run_training_steps.fetch_preprocessing_artifacts", "BooleanEquals": false, "Next": "Skip Fetch Preprocessing Artifacts" } ], "Default": "Fetch Preprocessing Artifacts" }, "Skip Fetch Preprocessing Artifacts": { "Type": "Pass", "Next": "Base Model Training" }, "Fetch Preprocessing Artifacts": { "Type": "Task", "Comment": "Holt die preprozessierten Daten (& eventuell vorhandene Models) und l..." , "Resource": "arn:aws:states:::batch:submitJob.sync", "ResultPath": null, "InputPath": "$", "Parameters": { "JobDefinition": "${FetchDataBatchJob}", "JobName": "FetchSmartCatPreprocessedData", "JobQueue": "arn:aws:batch:${AWS::Region}:${AWS::AccountId}:job-queue/MediumPriority-DnaBatchCompute-JobQueue" , "Parameters": { "INPUT_PATH.$": "$.initialization.fetch_data.preprocessed_data_path", "OUTPUT_PATH.$": "$.generated.output_artifact_paths.raw_data" } }, "Catch": [ { "ErrorEquals": [ "States.ALL" ], "Next": "Notify Failure" } ], "Next": "Base Model Training" }, "Base Model Training": { "Type": "Parallel", "Next": "Sequence Model Dataframe generation", "ResultPath": null, "Catch": [ { "ErrorEquals": [ "States.ALL" ], "Next": "Notify Failure" } ], "Branches": [ { "StartAt": "Image Model trainieren?", "States": { "Image Model trainieren?": { "Type": "Choice", "Comment": "Verhalten steuerbar über 'initialization.training_steps.image_model'“ "Choices": [ { "And": [ { "Variable": "$.initialization.run_training_steps.image_model", "BooleanEquals": true }, { "Variable": "$.initialization.run_training_steps.generate_image_model_split", "BooleanEquals": true } ], "Next": "Create Image Model Split Set" }, { "And": [ { "Variable": "$.initialization.run_training_steps.image_model", "BooleanEquals": true }, { "Variable": "$.initialization.run_training_steps.generate_image_model_split", "BooleanEquals": false } ], "Next": "Image Model Training" } ], "Default": "Skip Image Training" }, "Skip Image Training": { "Type": "Pass", "End": true }, "Create Image Model Split Set": { "Type": "Task", "Comment": "Creates Split Test Set für Image Model. Writes the InputDataConfig." , "Resource": "arn:aws:states:::batch:submitJob.sync", "Parameters": { "JobName": "GenerateImageModelSplitSet", "JobDefinition": "${SplitImageDataBatchJob}", "JobQueue": "arn:aws:batch:${AWS::Region}:${AWS::AccountId}:job-queue/MediumPriority-DnaBatchCompute-JobQueue" , "Parameters": { "K_FOLDS": "4", "INPUT_PATH.$": "$.generated.output_artifact_paths.raw_data", "OUTPUT_PATH.$": "$.generated.output_artifact_paths.image_model_training_split_set" } }, "ResultPath": null, "Next": "Image Model Training" }, "Image Model Training": { "Type": "Task", "Comment": "Trainiert das Image Modell und legt das Artefakt in das Verzeichnis für diesen Job Run." , "Resource": "arn:aws:states:::sagemaker:createTrainingJob.sync", "End": true, "Parameters": { "TrainingJobName.$": "$.generated.image_model.TrainingJobName", "AlgorithmSpecification": { "TrainingImage": "520713654638.dkr.ecr.eu-central-1.amazonaws.com/sagemaker-mxnet:1.3-gpu-py3" , "TrainingInputMode": "File" }, "HyperParameters": { "epochs": "80", "batch_size.$": "$.initialization.image_model.batch_size", "bucket.$": "$.initialization.image_model.log_bucket", "conv_block_length": "2", "cycle_length": "10", "depth": "5", "dropout": "0.5", "job_name.$": "$.generated.image_model.TrainingJobName", "max_lr": "0.1", "min_lr": "0.0001", "s3_log_folder": "logs/sagemaker", "sagemaker_container_log_level": "20", "sagemaker_enable_cloudwatch_metrics": "false", "sagemaker_program": "train_n_folds.py", "sagemaker_region": "${AWS::Region}", "sagemaker_submit_directory": "${imageModelArtifactPath}", "start_filter": "4", "worker": "4" }, "InputDataConfig.$": "$.generated.image_model.InputDataConfig", "OutputDataConfig": { "S3OutputPath.$": "$.generated.output_artifact_paths.image_model_prefix" }, "ResourceConfig": { "InstanceCount": 4, "InstanceType.$": "$.initialization.image_model.instance_type", "VolumeSizeInGB": 10 }, "RoleArn": "${SagemakerTrainModelRoleArn}", "StoppingCondition": { "MaxRuntimeInSeconds": 172800 }, "Tags": [ { "Key": "service", "Value": "${Service}" }, { "Key": "subservice", "Value": "training" } ] }, "Catch": [ { "ErrorEquals": [ "States.TaskFailed" ], "Next": "JSONify Error Cause", "ResultPath": "$.error-info" } ] }, "JSONify Error Cause": { "Type": "Task", "Resource": "arn:aws:lambda:${AWS::Region}:${AWS::AccountId}:function:smartcat-model-training-jsonify-error-lambda-${Stage}" , "Next": "Is Capacity Error?" }, "Is Capacity Error?": { "Type": "Choice", "Comment": "Retry if capacity error.", "Choices": [ { "Variable": "$.error-info.Cause.FailureReason", "StringEquals": "CapacityError: Unable to provision requested ML compute capacity. Please retry using a different ML instance type." , "Next": "Wait 10 Minutes" } ], "Default": "FailImageModel" }, "Wait 10 Minutes": { "Type": "Wait", "Seconds": 60, "Next": "Image Model Training" }, "FailImageModel": { "Type": "Fail" } } }, { "StartAt": "Text Model trainieren?", "States": { "Text Model trainieren?": { "Type": "Choice", "Comment": "Wenn 'initialization.training_steps.text_model' false ist dann führe diesen Trainingschritt nicht aus." , "Choices": [ { "Variable": "$.initialization.run_training_steps.text_model", "BooleanEquals": true, "Next": "Text Model Training" } ], "Default": "Skip Text Training" }, "Skip Text Training": { "Type": "Pass", "End": true }, "Text Model Training": { "Type": "Task", "Comment": "Trainiert das Text Modell und legt das Artefakt in das Verzeichnis für diesen Job Run." , "Resource": "arn:aws:states:::sagemaker:createTrainingJob.sync", "End": true, "Parameters": { "TrainingJobName.$": "$.generated.text_model.TrainingJobName", "AlgorithmSpecification": { "TrainingImage": "${SagemakerTextTrainingImage}", "TrainingInputMode": "File" }, "HyperParameters": { "bucket.$": "$.initialization.text_model.log_bucket", "s3_log_folder": "\"logs/sagemaker\"", "job_name.$": "$.generated.text_model.TrainingJobName", "sagemaker_container_log_level": "20", "sagemaker_enable_cloudwatch_metrics": "false", "sagemaker_program": "\"sagemaker_entry_point.py\"", "sagemaker_region": "\"${AWS::Region}\"", "sagemaker_submit_directory": "\"${textModelArtifactPath}\"" }, "InputDataConfig.$": "$.generated.text_model.InputDataConfig", "OutputDataConfig": { "S3OutputPath.$": "$.generated.output_artifact_paths.text_model_prefix" }, "ResourceConfig": { "InstanceCount": 1, "InstanceType": "ml.m5.2xlarge", "VolumeSizeInGB": 10 }, "RoleArn": "${SagemakerTrainModelRoleArn}", "StoppingCondition": { "MaxRuntimeInSeconds": 14400 }, "Tags": [ { "Key": "service", "Value": "${Service}" }, { "Key": "subservice", "Value": "training" } ] } } } } ] }, "Sequence Model Dataframe generation": { "Type": "Task", "Resource": "arn:aws:states:::batch:submitJob.sync", "Comment": "Baut den DataFrame für das Sequenzmodell aus den Daten vom Text und Imagemodell" , "ResultPath": null, "Parameters": { "JobName": "GenerateSequenceModelDataframe", "JobDefinition": "${PredictionsMergerBatchJob}", "JobQueue": "arn:aws:batch:${AWS::Region}:${AWS::AccountId}:job-queue/MediumPriority-DnaBatchCompute-JobQueue" , "Parameters": { "INPUT_PATH_SVM.$": "$.generated.output_artifact_paths.text_model_prefix", "INPUT_PATH_CNN.$": "$.generated.output_artifact_paths.image_model_prefix", "OUTPUT_PATH.$": "$.generated.output_artifact_paths.sequence_model_training_dataframe" } }, "Catch": [ { "ErrorEquals": [ "States.ALL" ], "Next": "Notify Failure" } ], "Next": "Sequence Model Training" }, "Sequence Model Training": { "Type": "Task", "Comment": "Trainiert das Sequence Modell und legt das Artefakt in das Verzeichnis für diesen Job Run." , "Resource": "arn:aws:states:::sagemaker:createTrainingJob.sync", "Next": "Post Training Artifact Fetcher", "ResultPath": null, "Parameters": { "TrainingJobName.$": "$.generated.sequence_model.TrainingJobName", "AlgorithmSpecification": { "TrainingImage": "520713654638.dkr.ecr.eu-central-1.amazonaws.com/sagemaker-mxnet:1.2-cpu-py3" , "TrainingInputMode": "File" }, "HyperParameters": { "bucket.$": "$.initialization.sequence_model.log_bucket", "job_name": "\"smartcat-SequenceModelTraining\"", "dropout": "0", "epoch_seq": "10000", "epochs": "20", "learning_rate": "0.001", "lr_schedule_epoch": "1", "lr_schedule_factor": "0.5", "mode": "\"both\"", "net_layers": "2", "net_width": "300", "optimizer": "\"rmsprop\"", "rnn_type": "\"gru\"", "val_seq": "3000", "s3_log_folder": "\"logs/sagemaker\"", "sagemaker_container_log_level": "20", "sagemaker_enable_cloudwatch_metrics": "false", "sagemaker_program": "\"train_seq_mx12.py\"", "sagemaker_region": "\"${AWS::Region}\"", "sagemaker_submit_directory": "\"${sequenceModelArtifactPath}\"" }, "InputDataConfig.$": "$.generated.sequence_model.InputDataConfig", "OutputDataConfig": { "S3OutputPath.$": "$.generated.output_artifact_paths.sequence_model_prefix" }, "ResourceConfig": { "InstanceCount": 1, "InstanceType": "ml.m5.large", "VolumeSizeInGB": 10 }, "RoleArn": "${SagemakerTrainModelRoleArn}", "StoppingCondition": { "MaxRuntimeInSeconds": 172800 }, "Tags": [ { "Key": "service", "Value": "${Service}" }, { "Key": "subservice", "Value": "training" } ] } }, "Post Training Artifact Fetcher": { "Type": "Task", "Comment": "Kopiere Artefakte der einzelnen Trainingsschritte in ein Ziel-Verzeichnis" , "End": true, "Resource": "arn:aws:states:::batch:submitJob.sync", "ResultPath": null, "Parameters": { "JobName": "PostTrainingArtifactsFetcherJob", "JobDefinition": "${PostTrainingFetchBatchJob}", "JobQueue": "arn:aws:batch:${AWS::Region}:${AWS::AccountId}:job-queue/MediumPriority-DnaBatchCompute-JobQueue" , "Parameters": { "INPUT_PATH_TXT.$": "$.generated.output_artifact_paths.text_model_prefix", "INPUT_PATH_IMAGE.$": "$.generated.output_artifact_paths.image_model_prefix", "INPUT_PATH_SEQ.$": "$.generated.output_artifact_paths.sequence_model_prefix", "OUTPUT_PATH.$": "$.generated.output_artifact_paths.ready_to_use_artifacts" } }, "Catch": [ { "ErrorEquals": [ "States.ALL" ], "Next": "Notify Failure" } ] }, "Notify Failure": { "Type": "Task", "Resource": "arn:aws:states:::sns:publish", "Parameters": { "Subject": "[ERROR] - ${Service} - Model Training failed!", "Message": "There was an error during the model training!", "TopicArn": "${AlertingTopic}", "MessageAttributes": { "ErrorType": { "DataType": "String", "StringValue.$": "$.Error" }, "CauseOfError": { "DataType": "String", "StringValue.$": "$.Cause" } } }, "Next": "Fail" }, "Fail": { "Type": "Fail" } } }
  • { AlertingTopic: !FindInMap [ETLMappings, !Ref Stage, 'alertingTopic'],
SagemakerTextTrainingImage: !FindInMap [ETLMappings, !Ref Stage, 'sageMakerTextModelTrainingImage'], SagemakerTrainModelRoleArn: !GetAtt SagemakerTrainModelRole.Arn, FetchDataBatchJob: !Ref TrainingFetchBatchJob, PredictionsMergerBatchJob: !Ref TrainingPredictionsMergerBatchJob, SplitImageDataBatchJob: !Ref ImageTrainingSplitBatchJob, PostTrainingFetchBatchJob: !Ref PostTrainingFetchBatchJob}

Reality isn’t as easily debuggable ...

slide-46
SLIDE 46

IaC & ASL ...it gets better with AWS CDK

import cdk = require('@aws-cdk/core' ); import stepfunction = require('@aws-cdk/aws-stepfunctions' ); import stepfunctionTasks = require('@aws-cdk/aws-stepfunctions-tasks' ); ... new stepfunction .CfnStateMachine (this, "state-machine" , { definitionString: fs. readFileSync ("./lib/statemachine.json" ).toString (), roleArn: "..." }); const dataBucket = new S3.Bucket(this, "data-bucket" ) const startState = new stepfunction .Pass(this, 'StartState' ); const trainText = new stepfunction .Task(this, "SageMaker" , { task: new stepfunctionTasks .SagemakerTrainTask ({ trainingJobName: "TextModelTraining" , inputDataConfig: [ { channelName: "channel_1" , dataSource: { s3DataSource: { s3Location: stepfunctionTasks.S3Location. fromBucket (dataBucket, "input") } } } ], ... }) }); const definition = startState. next(trainText) new stepfunction .StateMachine (this, 'StateMachine' , { definition: definition }); { "StartAt": "Train Text Model" , "States": { "Train Text Model": { "Type": "Task", "Resource": "arn:aws:states:::sagemaker:createTrainingJob.sync" , "Parameters": { ... }, "End": true } } }

slide-47
SLIDE 47

Lifehack #1: Rerunning a job via “New Execution”

save yourself some typing or copypasting

slide-48
SLIDE 48

Lifehack #2: Use the Pass state to get started

  • define the result you want
  • go from there
  • use to debug
  • use to get started and see visual

workflow

"PrepareXY": { "Type": "Pass", "Result": { "x": 0.381018, "y": 622.2269926397355 }, "ResultPath": "$.coords", "Next": "FindArealPhoto" }, "FindArealPhoto": { "Type": "Task", ... "End": true }

  • supports Result, ResultPath and

Parameters

slide-49
SLIDE 49

Take-Aways

  • no “infrastructure” to manage (vs. tooling like airflow)
  • easy to get started
  • extendable, evolves with your requirements
  • lots of service integrations
  • complicated logic means a complicated definition
  • need to drag state along (if you want to use JSON-path

expressions to parametrize steps)

slide-50
SLIDE 50

@dreigelb