Amazon SageMaker (Batch Transform Jobs, Endpoint Instances, Endpoints, Ground Truth, Processing Jobs, Training Jobs) monitoring

Dynatrace ingests metrics for multiple preselected namespaces, including Amazon SageMaker. You can view metrics for each service instance, split metrics into multiple dimensions, and create custom charts that you can pin to your dashboards.

Prerequisites

To enable monitoring for this service, you need

  • ActiveGate version 1.181+, as follows:

    • For Dynatrace SaaS deployments, you need an Environment ActiveGate or a Multi-environment ActiveGate.

    • For Dynatrace Managed deployments, you can use any kind of ActiveGate.

      For role-based access (whether in a SaaS or Managed deployment), you need an Environment ActiveGate installed on an Amazon EC2 host.

  • Dynatrace version 1.182+

  • An updated AWS monitoring policy to include the additional AWS services.
    To update the AWS IAM policy, use the JSON below, containing the monitoring policy (permissions) for all supporting services.

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": [
"acm-pca:ListCertificateAuthorities",
"apigateway:GET",
"apprunner:ListServices",
"appstream:DescribeFleets",
"appsync:ListGraphqlApis",
"athena:ListWorkGroups",
"autoscaling:DescribeAutoScalingGroups",
"cloudformation:ListStackResources",
"cloudfront:ListDistributions",
"cloudhsm:DescribeClusters",
"cloudsearch:DescribeDomains",
"cloudwatch:GetMetricData",
"cloudwatch:GetMetricStatistics",
"cloudwatch:ListMetrics",
"codebuild:ListProjects",
"datasync:ListTasks",
"dax:DescribeClusters",
"directconnect:DescribeConnections",
"dms:DescribeReplicationInstances",
"dynamodb:ListTables",
"dynamodb:ListTagsOfResource",
"ec2:DescribeAvailabilityZones",
"ec2:DescribeInstances",
"ec2:DescribeNatGateways",
"ec2:DescribeSpotFleetRequests",
"ec2:DescribeTransitGateways",
"ec2:DescribeVolumes",
"ec2:DescribeVpnConnections",
"ecs:ListClusters",
"eks:ListClusters",
"elasticache:DescribeCacheClusters",
"elasticbeanstalk:DescribeEnvironmentResources",
"elasticbeanstalk:DescribeEnvironments",
"elasticfilesystem:DescribeFileSystems",
"elasticloadbalancing:DescribeInstanceHealth",
"elasticloadbalancing:DescribeListeners",
"elasticloadbalancing:DescribeLoadBalancers",
"elasticloadbalancing:DescribeRules",
"elasticloadbalancing:DescribeTags",
"elasticloadbalancing:DescribeTargetHealth",
"elasticmapreduce:ListClusters",
"elastictranscoder:ListPipelines",
"es:ListDomainNames",
"events:ListEventBuses",
"firehose:ListDeliveryStreams",
"fsx:DescribeFileSystems",
"gamelift:ListFleets",
"glue:GetJobs",
"inspector:ListAssessmentTemplates",
"kafka:ListClusters",
"kinesis:ListStreams",
"kinesisanalytics:ListApplications",
"kinesisvideo:ListStreams",
"lambda:ListFunctions",
"lambda:ListTags",
"lex:GetBots",
"logs:DescribeLogGroups",
"mediaconnect:ListFlows",
"mediaconvert:DescribeEndpoints",
"mediapackage-vod:ListPackagingConfigurations",
"mediapackage:ListChannels",
"mediatailor:ListPlaybackConfigurations",
"opsworks:DescribeStacks",
"qldb:ListLedgers",
"rds:DescribeDBClusters",
"rds:DescribeDBInstances",
"rds:DescribeEvents",
"rds:ListTagsForResource",
"redshift:DescribeClusters",
"robomaker:ListSimulationJobs",
"route53:ListHostedZones",
"route53resolver:ListResolverEndpoints",
"s3:ListAllMyBuckets",
"sagemaker:ListEndpoints",
"sns:ListTopics",
"sqs:ListQueues",
"storagegateway:ListGateways",
"sts:GetCallerIdentity",
"swf:ListDomains",
"tag:GetResources",
"tag:GetTagKeys",
"transfer:ListServers",
"workmail:ListOrganizations",
"workspaces:DescribeWorkspaces"
],
"Resource": "*"
}
]
}

If you don't want to add permissions to all services, and just select permissions for certain services, consult the table below. The table contains a set of permissions that are required for All AWS cloud services and, for each supporting service, a list of optional permissions specific to that service.

Permissions required for AWS monitoring integration:
  • "cloudwatch:GetMetricData"
  • "cloudwatch:GetMetricStatistics"
  • "cloudwatch:ListMetrics"
  • "sts:GetCallerIdentity"
  • "tag:GetResources"
  • "tag:GetTagKeys"
  • "ec2:DescribeAvailabilityZones"
Name
Permissions
All monitored Amazon services required
cloudwatch:GetMetricData,
cloudwatch:GetMetricStatistics,
cloudwatch:ListMetrics,
sts:GetCallerIdentity,
tag:GetResources,
tag:GetTagKeys,
ec2:DescribeAvailabilityZones
AWS Certificate Manager Private Certificate Authority
acm-pca:ListCertificateAuthorities
Amazon MQ
Amazon API Gateway
apigateway:GET
AWS App Runner
apprunner:ListServices
Amazon AppStream
appstream:DescribeFleets

Example of JSON policy for one single service.

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": [
"apigateway:GET",
"cloudwatch:GetMetricData",
"cloudwatch:GetMetricStatistics",
"cloudwatch:ListMetrics",
"sts:GetCallerIdentity",
"tag:GetResources",
"tag:GetTagKeys",
"ec2:DescribeAvailabilityZones"
],
"Resource": "*"
}
]
}

In this example, from the complete list of permissions you need to select

  • "apigateway:GET" for Amazon API Gateway
  • "cloudwatch:GetMetricData", "cloudwatch:GetMetricStatistics", "cloudwatch:ListMetrics", "sts:GetCallerIdentity", "tag:GetResources", "tag:GetTagKeys", and "ec2:DescribeAvailabilityZones" for All AWS cloud services.
Endpoint
Service
autoscaling.<REGION>.amazonaws.com
Amazon EC2 Auto Scaling (built-in), Amazon EC2 Auto Scaling
lambda.<REGION>.amazonaws.com
AWS Lambda (built-in), AWS Lambda
elasticloadbalancing.<REGION>.amazonaws.com
Amazon Application and Network Load Balancer (built-in), Amazon Elastic Load Balancer (ELB) (built-in)
dynamodb.<REGION>.amazonaws.com
Amazon DynamoDB (built-in), Amazon DynamoDB
ec2.<REGION>.amazonaws.com
Amazon EBS (built-in), Amazon EC2 (built-in), Amazon EBS, Amazon EC2 Spot Fleet, Amazon VPC NAT Gateways, AWS Transit Gateway, AWS Site-to-Site VPN
rds.<REGION>.amazonaws.com
Amazon RDS (built-in), Amazon Aurora, Amazon DocumentDB, Amazon Neptune, Amazon RDS

Enable monitoring

To learn how to enable service monitoring, see Enable service monitoring.

View service metrics

You can view the service metrics in your Dynatrace environment either on the custom device overview page or on your Dashboards page.

View metrics on the custom device overview page

To access the custom device overview page

  1. Go to Technologies & Processes or Technologies & Processes Classic (latest Dynatrace).
  2. Filter by service name and select the relevant custom device group.
  3. Once you select the custom device group, you're on the custom device group overview page.
  4. The custom device group overview page lists all instances (custom devices) belonging to the group. Select an instance to view the custom device overview page.

View metrics on your dashboard

You can also view metrics in the Dynatrace web UI on dashboards. There is no preset dashboard available for this service, but you can create your own dashboard.

To check the availability of preset dashboards for each AWS service, see the list below.

AWS service
Preset dashboard
Amazon EC2 Auto Scaling (built-in)
Not applicable
AWS Lambda (built-in)
Not applicable
Amazon Application and Network Load Balancer (built-in)
Not applicable
Amazon DynamoDB (built-in)
Not applicable
Amazon EBS (built-in)
Not applicable
Amazon EC2 (built-in)
Not applicable

Available metrics

Amazon SageMaker Batch Transform Jobs

Name
Description
Unit
Statistics
Dimensions
Recommended
CPUUtilization
The percentage of CPU units that are used by the containers on an instance. The value can range between 0% and 100%, and is multiplied by the number of CPUs. For example, if there are four CPUs, CPUUtilization can range from 0% to `400%'.
Percent
Average
Region, Host
Applicable
MemoryUtilization
The percentage of memory that is used by the containers on an instance. This value can range between 0% and 100%.
Percent
Average
Region, Host
Applicable
GPUMemoryUtilization
The percentage of GPU memory used by the containers on an instance. The value can range between 0% and 100% and is multiplied by the number of GPUs. For example, if there are four GPUs, GPUMemoryUtilization can range from 0% to `400%'.
Percent
Average
Region, Host
Applicable
GPUUtilization
The percentage of GPU units that are used by the containers on an instance. The value can range between 0% and 100%and is multiplied by the number of GPUs. For example, if there are four GPUs, GPUUtilization can range from 0% to `400%'.
Percent
Average
Region, Host
Applicable

Amazon SageMaker Processing Jobs, Amazon SageMaker Training Jobs

Name
Description
Unit
Statistics
Dimensions
Recommended
CPUUtilization
The percentage of CPU units that are used by the containers on an instance. The value can range between 0% and 100%, and is multiplied by the number of CPUs. For example, if there are four CPUs, CPUUtilization can range from 0% to `400%'.
Percent
Average
Region, Host
Applicable
DiskUtilization
The percentage of disk space used by the containers on an instance uses. This value can range between 0% and 100%. This metric is not supported for batch transform jobs.
Percent
Average
EndpointName, VariantName
Applicable
GPUMemoryUtilization
The percentage of GPU memory used by the containers on an instance. The value can range between 0% and 100% and is multiplied by the number of GPUs. For example, if there are four GPUs, GPUMemoryUtilization can range from 0% to `400%'.
Percent
Average
Region, Host
Applicable
GPUUtilization
The percentage of GPU units that are used by the containers on an instance. The value can range between 0% and 100%and is multiplied by the number of GPUs. For example, if there are four GPUs, GPUUtilization can range from 0% to `400%'.
Percent
Average
Region, Host
Applicable
MemoryUtilization
The percentage of memory that is used by the containers on an instance. This value can range between 0% and 100%.
Percent
Average
Region, Host
Applicable

Amazon SageMaker Endpoint Instances

EndpointName is the main dimension.

Name
Description
Unit
Statistics
Dimensions
Recommended
CPUUtilization
The percentage of CPU units that are used by the containers on an instance. The value can range between 0% and 100%, and is multiplied by the number of CPUs. For example, if there are four CPUs, CPUUtilization can range from 0% to `400%'.
Percent
Average
EndpointName, VariantName
Applicable
DiskUtilization
The percentage of disk space used by the containers on an instance uses. This value can range between 0% and 100%. This metric is not supported for batch transform jobs.
Percent
Average
EndpointName, VariantName
Applicable
GPUMemoryUtilization
The percentage of GPU units that are used by the containers on an instance. The value can range between 0% and 100%and is multiplied by the number of GPUs. For example, if there are four GPUs, GPUUtilization can range from 0% to `400%'.
Percent
Average
EndpointName, VariantName
Applicable
GPUUtilization
The percentage of GPU units that are used by the containers on an instance. The value can range between 0% and 100%and is multiplied by the number of GPUs. For example, if there are four GPUs, GPUUtilization can range from 0% to `400%'.
Percent
Average
EndpointName, VariantName
Applicable
LoadedModelCount
The number of models loaded in the containers of the multi-model endpoint. This metric is emitted per instance.
None
Average
EndpointName, VariantName
LoadedModelCount
None
Sum
EndpointName, VariantName

Amazon SageMaker Endpoints

EndpointName is the main dimension.

Name
Description
Unit
Statistics
Dimensions
Recommended
Invocation4XXErrors
The number of InvokeEndpoint requests where the model returned a 4xx HTTP response code. For each 4xx response, 1 is sent; otherwise, 0 is sent.
None
Average
EndpointName, VariantName
Invocation4XXErrors
None
Sum
EndpointName, VariantName
Invocation5XXErrors
The number of InvokeEndpoint requests where the model returned a 5xx HTTP response code. For each 5xx response, 1 is sent; otherwise, 0 is sent.
None
Average
EndpointName, VariantName
Invocation5XXErrors
None
Sum
EndpointName, VariantName
Applicable
Invocations
The number of InvokeEndpoint requests sent to a model endpoint
None
Sum
EndpointName, VariantName
Applicable
Invocations
None
Count
EndpointName, VariantName

Amazon SageMaker Ground Truth

Name
Description
Dimensions
Statistics
Unit
Recommended
ActiveWorkers
The number of workers on a private work team performing a labeling job
Region, LabelingJobName
Maximum
None
DatasetObjectsAutoAnnotated
The number of dataset objects auto-annotated in a labeling job. This metric is only emitted when automated labeling is enabled.
Region, LabelingJobName
Maximum
None
Applicable
DatasetObjectsHumanAnnotated
The number of dataset objects annotated by a human in a labeling job
Region, LabelingJobName
Maximum
None
Applicable
DatasetObjectsLabelingFailed
The number of dataset objects that failed labeling in a labeling job
Region, LabelingJobName
Maximum
None
Applicable
JobsFailed
The number of labeling jobs that failed
Region
Count
None
JobsFailed
Region
Sum
None
Applicable