Amazon SageMaker (Batch Transform Jobs, Endpoint Instances, Endpoints, Ground Truth, Processing Jobs, Training Jobs) monitoring

Dynatrace ingests metrics for multiple preselected namespaces, including Amazon SageMaker. You can view metrics for each service instance, split metrics into multiple dimensions, and create custom charts that you can pin to your dashboards.

Prerequisites

To enable monitoring for this service, you need

ActiveGate version 1.181+, as follows:
- For Dynatrace SaaS deployments, you need an Environment ActiveGate or a Multi-environment ActiveGate.
- For Dynatrace Managed deployments, you can use any kind of ActiveGate.
  
  For role-based access (whether in a SaaS or Managed deployment), you need an Environment ActiveGate installed on an Amazon EC2 host.
Dynatrace version 1.182+
An updated AWS monitoring policy to include the additional AWS services.
To update the AWS IAM policy, use the JSON below, containing the monitoring policy (permissions) for all supporting services.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "VisualEditor0",
      "Effect": "Allow",
      "Action": [
        "acm-pca:ListCertificateAuthorities",
        "apigateway:GET",
        "apprunner:ListServices",
        "appstream:DescribeFleets",
        "appsync:ListGraphqlApis",
        "athena:ListWorkGroups",
        "autoscaling:DescribeAutoScalingGroups",
        "cloudformation:ListStackResources",
        "cloudfront:ListDistributions",
        "cloudhsm:DescribeClusters",
        "cloudsearch:DescribeDomains",
        "cloudwatch:GetMetricData",
        "cloudwatch:GetMetricStatistics",
        "cloudwatch:ListMetrics",
        "codebuild:ListProjects",
        "datasync:ListTasks",
        "dax:DescribeClusters",
        "directconnect:DescribeConnections",
        "dms:DescribeReplicationInstances",
        "dynamodb:ListTables",
        "dynamodb:ListTagsOfResource",
        "ec2:DescribeAvailabilityZones",
        "ec2:DescribeInstances",
        "ec2:DescribeNatGateways",
        "ec2:DescribeSpotFleetRequests",
        "ec2:DescribeTransitGateways",
        "ec2:DescribeVolumes",
        "ec2:DescribeVpnConnections",
        "ecs:ListClusters",
        "eks:ListClusters",
        "elasticache:DescribeCacheClusters",
        "elasticbeanstalk:DescribeEnvironmentResources",
        "elasticbeanstalk:DescribeEnvironments",
        "elasticfilesystem:DescribeFileSystems",
        "elasticloadbalancing:DescribeInstanceHealth",
        "elasticloadbalancing:DescribeListeners",
        "elasticloadbalancing:DescribeLoadBalancers",
        "elasticloadbalancing:DescribeRules",
        "elasticloadbalancing:DescribeTags",
        "elasticloadbalancing:DescribeTargetHealth",
        "elasticmapreduce:ListClusters",
        "elastictranscoder:ListPipelines",
        "es:ListDomainNames",
        "events:ListEventBuses",
        "firehose:ListDeliveryStreams",
        "fsx:DescribeFileSystems",
        "gamelift:ListFleets",
        "glue:GetJobs",
        "inspector:ListAssessmentTemplates",
        "kafka:ListClusters",
        "kinesis:ListStreams",
        "kinesisanalytics:ListApplications",
        "kinesisvideo:ListStreams",
        "lambda:ListFunctions",
        "lambda:ListTags",
        "lex:GetBots",
        "logs:DescribeLogGroups",
        "mediaconnect:ListFlows",
        "mediaconvert:DescribeEndpoints",
        "mediapackage-vod:ListPackagingConfigurations",
        "mediapackage:ListChannels",
        "mediatailor:ListPlaybackConfigurations",
        "opsworks:DescribeStacks",
        "qldb:ListLedgers",
        "rds:DescribeDBClusters",
        "rds:DescribeDBInstances",
        "rds:DescribeEvents",
        "rds:ListTagsForResource",
        "redshift:DescribeClusters",
        "robomaker:ListSimulationJobs",
        "route53:ListHostedZones",
        "route53resolver:ListResolverEndpoints",
        "s3:ListAllMyBuckets",
        "sagemaker:ListEndpoints",
        "sns:ListTopics",
        "sqs:ListQueues",
        "storagegateway:ListGateways",
        "sts:GetCallerIdentity",
        "swf:ListDomains",
        "tag:GetResources",
        "tag:GetTagKeys",
        "transfer:ListServers",
        "workmail:ListOrganizations",
        "workspaces:DescribeWorkspaces"
      ],
      "Resource": "*"
    }
  ]
}

If you don't want to add permissions to all services, and just select permissions for certain services, consult the table below. The table contains a set of permissions that are required for All AWS cloud services and, for each supporting service, a list of optional permissions specific to that service.

Permissions required for AWS monitoring integration:

"cloudwatch:GetMetricData"
"cloudwatch:GetMetricStatistics"
"cloudwatch:ListMetrics"
"sts:GetCallerIdentity"
"tag:GetResources"
"tag:GetTagKeys"
"ec2:DescribeAvailabilityZones"

Name

Permissions

All monitored Amazon services required

cloudwatch:GetMetricData,
cloudwatch:GetMetricStatistics,
cloudwatch:ListMetrics,
sts:GetCallerIdentity,
tag:GetResources,
tag:GetTagKeys,
ec2:DescribeAvailabilityZones

AWS Certificate Manager Private Certificate Authority

acm-pca:ListCertificateAuthorities

Amazon MQ

Amazon API Gateway

apigateway:GET

AWS App Runner

apprunner:ListServices

Amazon AppStream

appstream:DescribeFleets

Example of JSON policy for one single service.

{
  "Version": "2012-10-17",
  "Statement": [
          {
                  "Sid": "VisualEditor0",
                  "Effect": "Allow",
                  "Action": [
                          "apigateway:GET",
                          "cloudwatch:GetMetricData",
                          "cloudwatch:GetMetricStatistics",
                          "cloudwatch:ListMetrics",
                          "sts:GetCallerIdentity",
                          "tag:GetResources",
                          "tag:GetTagKeys",
                          "ec2:DescribeAvailabilityZones"
                  ],
                  "Resource": "*"
          }
      ]
}

In this example, from the complete list of permissions you need to select

"apigateway:GET" for Amazon API Gateway
"cloudwatch:GetMetricData", "cloudwatch:GetMetricStatistics", "cloudwatch:ListMetrics", "sts:GetCallerIdentity", "tag:GetResources", "tag:GetTagKeys", and "ec2:DescribeAvailabilityZones" for All AWS cloud services.

Endpoint

Service

autoscaling.<REGION>.amazonaws.com

Amazon EC2 Auto Scaling (built-in), Amazon EC2 Auto Scaling

lambda.<REGION>.amazonaws.com

AWS Lambda (built-in), AWS Lambda

elasticloadbalancing.<REGION>.amazonaws.com

Amazon Application and Network Load Balancer (built-in), Amazon Elastic Load Balancer (ELB) (built-in)

dynamodb.<REGION>.amazonaws.com

Amazon DynamoDB (built-in), Amazon DynamoDB

ec2.<REGION>.amazonaws.com

Amazon EBS (built-in), Amazon EC2 (built-in), Amazon EBS, Amazon EC2 Spot Fleet, Amazon VPC NAT Gateways, AWS Transit Gateway, AWS Site-to-Site VPN

rds.<REGION>.amazonaws.com

Amazon RDS (built-in), Amazon Aurora, Amazon DocumentDB, Amazon Neptune, Amazon RDS

Enable monitoring

To learn how to enable service monitoring, see Enable service monitoring.

View service metrics

You can view the service metrics in your Dynatrace environment either on the custom device overview page or on your Dashboards page.

View metrics on the custom device overview page

To access the custom device overview page

Go to Technologies & Processes or Technologies & Processes Classic (latest Dynatrace).
Filter by service name and select the relevant custom device group.
Once you select the custom device group, you're on the custom device group overview page.
The custom device group overview page lists all instances (custom devices) belonging to the group. Select an instance to view the custom device overview page.

View metrics on your dashboard

You can also view metrics in the Dynatrace web UI on dashboards. There is no preset dashboard available for this service, but you can create your own dashboard.

To check the availability of preset dashboards for each AWS service, see the list below.

AWS service

Preset dashboard

Amazon EC2 Auto Scaling (built-in)

AWS Lambda (built-in)

Amazon Application and Network Load Balancer (built-in)

Amazon DynamoDB (built-in)

Amazon EBS (built-in)

Amazon EC2 (built-in)

Available metrics

Amazon SageMaker Batch Transform Jobs

Name

Description

Unit

Statistics

Dimensions

Recommended

CPUUtilization

The percentage of CPU units that are used by the containers on an instance. The value can range between 0% and 100%, and is multiplied by the number of CPUs. For example, if there are four CPUs, CPUUtilization can range from 0% to `400%'.

Percent

Average

Region, Host

MemoryUtilization

The percentage of memory that is used by the containers on an instance. This value can range between 0% and 100%.

Percent

Average

Region, Host

GPUMemoryUtilization

The percentage of GPU memory used by the containers on an instance. The value can range between 0% and 100% and is multiplied by the number of GPUs. For example, if there are four GPUs, GPUMemoryUtilization can range from 0% to `400%'.

Percent

Average

Region, Host

GPUUtilization

The percentage of GPU units that are used by the containers on an instance. The value can range between 0% and 100%and is multiplied by the number of GPUs. For example, if there are four GPUs, GPUUtilization can range from 0% to `400%'.

Percent

Average

Region, Host

Amazon SageMaker Processing Jobs, Amazon SageMaker Training Jobs

Name

Description

Unit

Statistics

Dimensions

Recommended

CPUUtilization

Percent

Average

Region, Host

DiskUtilization

The percentage of disk space used by the containers on an instance uses. This value can range between 0% and 100%. This metric is not supported for batch transform jobs.

Percent

Average

EndpointName, VariantName

GPUMemoryUtilization

Percent

Average

Region, Host

GPUUtilization

Percent

Average

Region, Host

MemoryUtilization

The percentage of memory that is used by the containers on an instance. This value can range between 0% and 100%.

Percent

Average

Region, Host

Amazon SageMaker Endpoint Instances

EndpointName is the main dimension.

Name

Description

Unit

Statistics

Dimensions

Recommended

CPUUtilization

Percent

Average

EndpointName, VariantName

DiskUtilization

The percentage of disk space used by the containers on an instance uses. This value can range between 0% and 100%. This metric is not supported for batch transform jobs.

Percent

Average

EndpointName, VariantName

GPUMemoryUtilization

Percent

Average

EndpointName, VariantName

GPUUtilization

Percent

Average

EndpointName, VariantName

LoadedModelCount

The number of models loaded in the containers of the multi-model endpoint. This metric is emitted per instance.

None

Average

EndpointName, VariantName

LoadedModelCount

None

Sum

EndpointName, VariantName

Amazon SageMaker Endpoints

EndpointName is the main dimension.

Name

Description

Unit

Statistics

Dimensions

Recommended

Invocation4XXErrors

The number of InvokeEndpoint requests where the model returned a 4xx HTTP response code. For each 4xx response, 1 is sent; otherwise, 0 is sent.

None

Average

EndpointName, VariantName

Invocation4XXErrors

None

Sum

EndpointName, VariantName

Invocation5XXErrors

The number of InvokeEndpoint requests where the model returned a 5xx HTTP response code. For each 5xx response, 1 is sent; otherwise, 0 is sent.

None

Average

EndpointName, VariantName

Invocation5XXErrors

None

Sum

EndpointName, VariantName

Invocations

The number of InvokeEndpoint requests sent to a model endpoint

None

Sum

EndpointName, VariantName

Invocations

None

Count

EndpointName, VariantName

Amazon SageMaker Ground Truth

Name

Description

Dimensions

Statistics

Unit

Recommended

ActiveWorkers

The number of workers on a private work team performing a labeling job

Region, LabelingJobName

Maximum

None

DatasetObjectsAutoAnnotated

The number of dataset objects auto-annotated in a labeling job. This metric is only emitted when automated labeling is enabled.

Region, LabelingJobName

Maximum

None

DatasetObjectsHumanAnnotated

The number of dataset objects annotated by a human in a labeling job

Region, LabelingJobName

Maximum

None

DatasetObjectsLabelingFailed

The number of dataset objects that failed labeling in a labeling job

Region, LabelingJobName

Maximum

None

JobsFailed

The number of labeling jobs that failed

Region

Count

None

JobsFailed

Region

Sum

None