ensemble learning with sagemaker and step functions
play

Ensemble Learning with Sagemaker and Step-Functions Dr. Benjamin - PowerPoint PPT Presentation

Ensemble Learning with Sagemaker and Step-Functions Dr. Benjamin Weigel | 09.09.2019 Hamburg, Germany Benjamin Weigel Data Engineer & Cloud Coordinator Europace AG https://www.europace.de/ There is manual efgort in obtaining a mortgage


  1. Ensemble Learning with Sagemaker and Step-Functions Dr. Benjamin Weigel | 09.09.2019 Hamburg, Germany

  2. Benjamin Weigel Data Engineer & Cloud Coordinator Europace AG https://www.europace.de/

  3. There is manual efgort in obtaining a mortgage

  4. Smart Document Classification

  5. Text Image Smart Model Model use output as input Document trained on trained on Classification OCR-extracted text page-bitmap Sequence Model trained on sequence information (i.e. “Page 1-4 is a contract”)

  6. Options to Build a Model Training-Pipeline

  7. AWS Sagemaker

  8. AWS Step-Functions - define a distributed workflow as series of steps - visual workflow - long running workflows (max 1yr) - 4000 transitions/month for free - after: 0.025 USD per 1000 transitions - can get expensive quickly

  9. AWS Step-Functions

  10. { "Comment" : "An example of the Amazon States Language." , "StartAt" : "FirstState" , "States" : { "FirstState" : { "Type": "Task", Amazon States "Resource" : "arn:aws:lambda:us-east-1:123456789012:function:..." , "Next": "ChoiceState" }, ... Language } } "ChoiceState": { "Type": "Choice", define a state-machine "Choices": [ ● { "Variable": "$.foo", JSON-based ● "NumericEquals": 1, "Next": "FirstMatchState" describe: }, ● { "Variable": "$.foo", a state and the ○ "NumericEquals": 2, "Next": "SecondMatchState" transition to the next } ], "Default": "DefaultState" error-conditions etc. ○ }

  11. Step Functions can control Sagemaker ● Transform and Training Jobs directly via these Resources: Step Functions & "arn:aws:states:::sagemaker:createTransformJob.sync" Sagemaker "arn:aws:states:::sagemaker:createTrainingJob.sync" easy peasy https://docs.aws.amazon.com/step-functions/latest/dg/connect-sagemaker.html

  12. "Image Model Training" : { "Type" : "Task", "Resource" : "arn:aws:states:::sagemaker:createTrainingJob.sync", "Parameters" : { "TrainingJobName" : "ImageModel", "AlgorithmSpecification" : { "TrainingImage" : "520713654638.dkr.ecr.eu-central-1.amazonaws.com/sagemaker-mxnet:1.3-gpu-py3", "TrainingInputMode" : "File" }, "HyperParameters" : { sagemaker: "epochs" : "80", "batch_size" : "10", "conv_block_length" : "2", "cycle_length" : "10", createTrainingJob.sync "depth" : "5", "dropout" : "0.5", "max_lr" : "0.1", "min_lr" : "0.0001", ... "start_filter" : "4", "worker" : "4" ● configure job via Parameters }, "InputDataConfig.$" : "$.generated.image_model.InputDataConfig", "OutputDataConfig" : { section "S3OutputPath.$" : "$.generated.output_artifact_paths.image_model_prefix" }, "ResourceConfig" : { "InstanceCount" : 4, "InstanceType" : "ml.p2.xlarge", "VolumeSizeInGB" : 10 }, "RoleArn" : "arn:aws:iam::123456789012:role/sm-stepfunction-iam-role", "StoppingCondition" : { "MaxRuntimeInSeconds" : 172800 } } } https://docs.aws.amazon.com/sagemaker/latest/dg/API_CreateTrainingJob.html#API_CreateTrainingJob_RequestSyntax https://docs.aws.amazon.com/step-functions/latest/dg/connect-sagemaker.html

  13. The Good Photo by Joshua Ness on Unsplash

  14. Start simple { "StartAt" : "Train Text Model", "States" : { "Train Text Model" : { "Type" : "Task", "Resource" : "arn:aws:states:::sagemaker:createTrainingJob.sync", "Parameters" : { ... }, "End" : true } } }

  15. Expand from there { "StartAt" : "Fetch Preprocessed Data" , "States" : { "Fetch Preprocessed Data" : { "Type" : "Task", "Resource" : "arn:aws:states:::batch:submitJob.sync" , "Next" : "Train Text Model" , "Parameters" : { "JobName" : "FetchPreparedData" , "JobDefinition" : "arn:aws:batch: us-east-1:1234567890:job-definition/job:2 ", "JobQueue" : "arn:aws:batch:us-east-1:1234567890:job-queue/queue" , "Parameters" : { "DATA_INPUT_PATH.$" : "$.input_data" , "OUTPUT_PATH.$" : "$.ready_to_use_artifacts" } } }, "Train Text Model" : { ... } } }

  16. Retry if possible "Train Text Model" : { "Type" : "Task", "Resource" : "arn:aws:states:::sagemaker:createTrainingJob.sync", ... "Retry" : [ { "ErrorEquals" : [ "SageMaker.AmazonSageMakerException" ], "IntervalSeconds" : 1, "MaxAttempts" : 100, "BackoffRate" : 1.1 }, ... ] }

  17. If all else fails ... "Train Text Model" : { "Type" : "Task", ... "Catch" : [{ "ErrorEquals" : ["States.ALL" ], "Next" : "Notify Failure" }] }, "Notify Failure" : { "Type" : "Task", "Resource" : "arn:aws:states:::sns:publish" , "End" : true, "Parameters" : { "Subject" : "[ERROR] Model Training failed!" , "Message" : "Error during model training!" , "TopicArn" : "arn:aws:sns:*:123456789012:alerting_topic" , "MessageAttributes" : { ... } } }

  18. But there is a catch ...it’s a valid state after all

  19. Fail successfully! "Notify Failure" : { "Type" : "Task", "Resource" : "arn:aws:states:::sns:publish" , "Next" : "Fail", ... }, "Fail" : { "Type" : "Fail" }

  20. Text Image Model Model use output as input Add a few trained on trained on more models ... OCR-extracted text page-bitmap Sequence Model trained on sequence information (i.e. “Page 1-4 is a contract”)

  21. Use concurrency for time effjciency "Fetch Preprocessed Data" : { ... "Next" : "Base Model Training" }, "Base Model Training" : { "Type" : "Parallel" , "Next" : "Train Sequence Model" , "Branches" : [ { "StartAt" : "Train Image Model" , "States" : { "Train Image Model" : { ... "End" : true } } },{ "StartAt" : "Train Text Model" , "States" : { "Train Text Model" : { ... "End" : true }}}]},

  22. Beware of “silent” errors notification trigger won’t fire because there is no state defined for this scenario -> unexpected failure

  23. Everything should fail the same "Base Model Training" : { "Type" : "Parallel", "Next" : "Train Sequence Model" , "Branches" : [...], "Catch" : [ { "ErrorEquals" : [ "States.ALL" ], "Next" : "Notify Failure" } ] }

  24. Some jobs are long running and expensive then something fails and you have to debug (rerun) …

  25. Save time & money skip some steps... "States" : { "Skip Image Model Training?" : { "Type" : "Choice" , "Choices" : [ { "Variable" : "$.train_image_model" , "BooleanEquals" : false, "Next" : "Skip Fetch Preprocessing Artifacts" } ], "Default" : "Train Image Model" }, "Skip Fetch Preprocessing Artifacts" : { "Type" : "Pass", "End" : true }, "Train Image Model" : { ... "End" : true } }

  26. Rinse and repeat ...and add a little sprinkle on top

  27. Our Model Training Workflow - Lambda - Batch Job - Sagemaker - SNS - Choice - Pass (to skip steps) - Fail - Wait

  28. Our Model Training “Setup” Data (S3) Input Workflow - Input to setup state Step Function machine execution - define where the data is - (Hyper)Parameterization - Data & Models stored on S3 (each execution gets its own copy of the data) Models & Data (S3)

  29. The Bad Photo by Markus Spiske on Unsplash

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend