Amazon SageMaker (Batch Transform Jobs, Endpoint Instances, Endpoints, Ground Truth, Processing Jobs, Training Jobs) monitoring

Latest Dynatrace
How-to guide
9-min read
Published Oct 16, 2020

Dynatrace ingests metrics for multiple preselected namespaces, including Amazon SageMaker. You can view metrics for each service instance, split metrics into multiple dimensions, and create custom charts that you can pin to your dashboards.

Prerequisites

To enable monitoring for this service, you need

ActiveGate version 1.181+, as follows:
- For Dynatrace SaaS deployments, you need an Environment ActiveGate or a Multi-environment ActiveGate.
- For Dynatrace Managed deployments, you can use any kind of ActiveGate.
  
  For role-based access (whether in a SaaS or Managed deployment), you need an Environment ActiveGate installed on an Amazon EC2 host.
Dynatrace version 1.182+
An updated AWS monitoring policy to include the additional AWS services.
To update the AWS IAM policy, use the JSON below, containing the monitoring policy (permissions) for all supporting services.

JSON predefined policy for all supporting services

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "VisualEditor0",
      "Effect": "Allow",
      "Action": [
        "acm-pca:ListCertificateAuthorities",
        "apigateway:GET",
        "apprunner:ListServices",
        "appstream:DescribeFleets",
        "appsync:ListGraphqlApis",
        "athena:ListWorkGroups",
        "autoscaling:DescribeAutoScalingGroups",
        "cloudformation:ListStackResources",
        "cloudfront:ListDistributions",
        "cloudhsm:DescribeClusters",
        "cloudsearch:DescribeDomains",
        "cloudwatch:GetMetricData",
        "cloudwatch:GetMetricStatistics",
        "cloudwatch:ListMetrics",
        "codebuild:ListProjects",
        "datasync:ListTasks",
        "dax:DescribeClusters",
        "directconnect:DescribeConnections",
        "dms:DescribeReplicationInstances",
        "dynamodb:ListTables",
        "dynamodb:ListTagsOfResource",
        "ec2:DescribeAvailabilityZones",
        "ec2:DescribeInstances",
        "ec2:DescribeNatGateways",
        "ec2:DescribeSpotFleetRequests",
        "ec2:DescribeTransitGateways",
        "ec2:DescribeVolumes",
        "ec2:DescribeVpnConnections",
        "ecs:ListClusters",
        "eks:ListClusters",
        "elasticache:DescribeCacheClusters",
        "elasticbeanstalk:DescribeEnvironmentResources",
        "elasticbeanstalk:DescribeEnvironments",
        "elasticfilesystem:DescribeFileSystems",
        "elasticloadbalancing:DescribeInstanceHealth",
        "elasticloadbalancing:DescribeListeners",
        "elasticloadbalancing:DescribeLoadBalancers",
        "elasticloadbalancing:DescribeRules",
        "elasticloadbalancing:DescribeTags",
        "elasticloadbalancing:DescribeTargetHealth",
        "elasticmapreduce:ListClusters",
        "elastictranscoder:ListPipelines",
        "es:ListDomainNames",
        "events:ListEventBuses",
        "firehose:ListDeliveryStreams",
        "fsx:DescribeFileSystems",
        "gamelift:ListFleets",
        "glue:GetJobs",
        "inspector:ListAssessmentTemplates",
        "kafka:ListClusters",
        "kinesis:ListStreams",
        "kinesisanalytics:ListApplications",
        "kinesisvideo:ListStreams",
        "lambda:ListFunctions",
        "lambda:ListTags",
        "lex:GetBots",
        "logs:DescribeLogGroups",
        "mediaconnect:ListFlows",
        "mediaconvert:DescribeEndpoints",
        "mediapackage-vod:ListPackagingConfigurations",
        "mediapackage:ListChannels",
        "mediatailor:ListPlaybackConfigurations",
        "opsworks:DescribeStacks",
        "qldb:ListLedgers",
        "rds:DescribeDBClusters",
        "rds:DescribeDBInstances",
        "rds:DescribeEvents",
        "rds:ListTagsForResource",
        "redshift:DescribeClusters",
        "robomaker:ListSimulationJobs",
        "route53:ListHostedZones",
        "route53resolver:ListResolverEndpoints",
        "s3:ListAllMyBuckets",
        "sagemaker:ListEndpoints",
        "sns:ListTopics",
        "sqs:ListQueues",
        "storagegateway:ListGateways",
        "sts:GetCallerIdentity",
        "swf:ListDomains",
        "tag:GetResources",
        "tag:GetTagKeys",
        "transfer:ListServers",
        "workmail:ListOrganizations",
        "workspaces:DescribeWorkspaces"
      ],
      "Resource": "*"
    }
  ]
}

If you don't want to add permissions to all services, and just select permissions for certain services, consult the table below. The table contains a set of permissions that are required for All AWS cloud services and, for each supporting service, a list of optional permissions specific to that service.

Permissions required for AWS monitoring integration:

"cloudwatch:GetMetricData"
"cloudwatch:GetMetricStatistics"
"cloudwatch:ListMetrics"
"sts:GetCallerIdentity"
"tag:GetResources"
"tag:GetTagKeys"
"ec2:DescribeAvailabilityZones"

Complete list of permissions for cloud services

Name	Permissions
All monitored Amazon services Required	`cloudwatch:GetMetricData`, `cloudwatch:GetMetricStatistics`, `cloudwatch:ListMetrics`, `sts:GetCallerIdentity`, `tag:GetResources`, `tag:GetTagKeys`, `ec2:DescribeAvailabilityZones`
AWS Certificate Manager Private Certificate Authority	`acm-pca:ListCertificateAuthorities`
Amazon MQ
Amazon API Gateway	`apigateway:GET`
AWS App Runner	`apprunner:ListServices`
Amazon AppStream	`appstream:DescribeFleets`
AWS AppSync	`appsync:ListGraphqlApis`
Amazon Athena	`athena:ListWorkGroups`
Amazon Aurora	`rds:DescribeDBClusters`
Amazon EC2 Auto Scaling	`autoscaling:DescribeAutoScalingGroups`
Amazon EC2 Auto Scaling (built-in)	`autoscaling:DescribeAutoScalingGroups`
AWS Billing
Amazon Keyspaces
AWS Chatbot
Amazon CloudFront	`cloudfront:ListDistributions`
AWS CloudHSM	`cloudhsm:DescribeClusters`
Amazon CloudSearch	`cloudsearch:DescribeDomains`
AWS CodeBuild	`codebuild:ListProjects`
Amazon Cognito
Amazon Connect
Amazon Elastic Kubernetes Service (EKS)	`eks:ListClusters`
AWS DataSync	`datasync:ListTasks`
Amazon DynamoDB Accelerator (DAX)	`dax:DescribeClusters`
AWS Database Migration Service (AWS DMS)	`dms:DescribeReplicationInstances`
Amazon DocumentDB	`rds:DescribeDBClusters`
AWS Direct Connect	`directconnect:DescribeConnections`
Amazon DynamoDB	`dynamodb:ListTables`
Amazon DynamoDB (built-in)	`dynamodb:ListTables`, `dynamodb:ListTagsOfResource`
Amazon EBS	`ec2:DescribeVolumes`
Amazon EBS (built-in)	`ec2:DescribeVolumes`
Amazon EC2 API
Amazon EC2 (built-in)	`ec2:DescribeInstances`
Amazon EC2 Spot Fleet	`ec2:DescribeSpotFleetRequests`
Amazon Elastic Container Service (ECS)	`ecs:ListClusters`
Amazon ECS Container Insights	`ecs:ListClusters`
Amazon ElastiCache (EC)	`elasticache:DescribeCacheClusters`
AWS Elastic Beanstalk	`elasticbeanstalk:DescribeEnvironments`
Amazon Elastic File System (EFS)	`elasticfilesystem:DescribeFileSystems`
Amazon Elastic Inference
Amazon Elastic Map Reduce (EMR)	`elasticmapreduce:ListClusters`
Amazon Elasticsearch Service (ES)	`es:ListDomainNames`
Amazon Elastic Transcoder	`elastictranscoder:ListPipelines`
Amazon Elastic Load Balancer (ELB) (built-in)	`elasticloadbalancing:DescribeInstanceHealth`, `elasticloadbalancing:DescribeListeners`, `elasticloadbalancing:DescribeLoadBalancers`, `elasticloadbalancing:DescribeRules`, `elasticloadbalancing:DescribeTags`, `elasticloadbalancing:DescribeTargetHealth`
Amazon EventBridge	`events:ListEventBuses`
Amazon FSx	`fsx:DescribeFileSystems`
Amazon GameLift	`gamelift:ListFleets`
AWS Glue	`glue:GetJobs`
Amazon Inspector	`inspector:ListAssessmentTemplates`
AWS Internet of Things (IoT)
AWS IoT Analytics
Amazon Managed Streaming for Kafka	`kafka:ListClusters`
Amazon Kinesis Data Analytics	`kinesisanalytics:ListApplications`
Amazon Data Firehose	`firehose:ListDeliveryStreams`
Amazon Kinesis Data Streams	`kinesis:ListStreams`
Amazon Kinesis Video Streams	`kinesisvideo:ListStreams`
AWS Lambda	`lambda:ListFunctions`
AWS Lambda (built-in)	`lambda:ListFunctions`, `lambda:ListTags`
Amazon Lex	`lex:GetBots`
Amazon Application and Network Load Balancer (built-in)	`elasticloadbalancing:DescribeInstanceHealth`, `elasticloadbalancing:DescribeListeners`, `elasticloadbalancing:DescribeLoadBalancers`, `elasticloadbalancing:DescribeRules`, `elasticloadbalancing:DescribeTags`, `elasticloadbalancing:DescribeTargetHealth`
Amazon CloudWatch Logs	`logs:DescribeLogGroups`
AWS Elemental MediaConnect	`mediaconnect:ListFlows`
AWS Elemental MediaConvert	`mediaconvert:DescribeEndpoints`
AWS Elemental MediaPackage Live	`mediapackage:ListChannels`
AWS Elemental MediaPackage Video on Demand	`mediapackage-vod:ListPackagingConfigurations`
AWS Elemental MediaTailor	`mediatailor:ListPlaybackConfigurations`
Amazon VPC NAT Gateways	`ec2:DescribeNatGateways`
Amazon Neptune	`rds:DescribeDBClusters`
AWS OpsWorks	`opsworks:DescribeStacks`
Amazon Polly
Amazon QLDB	`qldb:ListLedgers`
Amazon RDS	`rds:DescribeDBInstances`
Amazon RDS (built-in)	`rds:DescribeDBInstances`, `rds:DescribeEvents`, `rds:ListTagsForResource`
Amazon Redshift	`redshift:DescribeClusters`
Amazon Rekognition
AWS RoboMaker	`robomaker:ListSimulationJobs`
Amazon Route 53	`route53:ListHostedZones`
Amazon Route 53 Resolver	`route53resolver:ListResolverEndpoints`
Amazon S3	`s3:ListAllMyBuckets`
Amazon S3 (built-in)	`s3:ListAllMyBuckets`
Amazon SageMaker Batch Transform Jobs
Amazon SageMaker Endpoint Instances	`sagemaker:ListEndpoints`
Amazon SageMaker Endpoints	`sagemaker:ListEndpoints`
Amazon SageMaker Ground Truth
Amazon SageMaker Processing Jobs
Amazon SageMaker Training Jobs
AWS Service Catalog
Amazon Simple Email Service (SES)
Amazon Simple Notification Service (SNS)	`sns:ListTopics`
Amazon Simple Queue Service (SQS)	`sqs:ListQueues`
AWS Systems Manager - Run Command
AWS Step Functions
AWS Storage Gateway	`storagegateway:ListGateways`
Amazon SWF	`swf:ListDomains`
Amazon Textract
AWS IoT Things Graph
AWS Transfer Family	`transfer:ListServers`
AWS Transit Gateway	`ec2:DescribeTransitGateways`
Amazon Translate
AWS Trusted Advisor
AWS API Usage
AWS Site-to-Site VPN	`ec2:DescribeVpnConnections`
AWS WAF Classic
AWS WAF
Amazon WorkMail	`workmail:ListOrganizations`
Amazon WorkSpaces	`workspaces:DescribeWorkspaces`

Example of JSON policy for one single service.

JSON policy for Amazon API Gateway

{
  "Version": "2012-10-17",
  "Statement": [
          {
                  "Sid": "VisualEditor0",
                  "Effect": "Allow",
                  "Action": [
                          "apigateway:GET",
                          "cloudwatch:GetMetricData",
                          "cloudwatch:GetMetricStatistics",
                          "cloudwatch:ListMetrics",
                          "sts:GetCallerIdentity",
                          "tag:GetResources",
                          "tag:GetTagKeys",
                          "ec2:DescribeAvailabilityZones"
                  ],
                  "Resource": "*"
          }
      ]
}

In this example, from the complete list of permissions you need to select

"apigateway:GET" for Amazon API Gateway
"cloudwatch:GetMetricData", "cloudwatch:GetMetricStatistics", "cloudwatch:ListMetrics", "sts:GetCallerIdentity", "tag:GetResources", "tag:GetTagKeys", and "ec2:DescribeAvailabilityZones" for All AWS cloud services.

AWS endpoints that need to be reachable from ActiveGate with corresponding AWS services

Endpoint	Service
`autoscaling.<REGION>.amazonaws.com`	Amazon EC2 Auto Scaling (built-in), Amazon EC2 Auto Scaling
`lambda.<REGION>.amazonaws.com`	AWS Lambda (built-in), AWS Lambda
`elasticloadbalancing.<REGION>.amazonaws.com`	Amazon Application and Network Load Balancer (built-in), Amazon Elastic Load Balancer (ELB) (built-in)
`dynamodb.<REGION>.amazonaws.com`	Amazon DynamoDB (built-in), Amazon DynamoDB
`ec2.<REGION>.amazonaws.com`	Amazon EBS (built-in), Amazon EC2 (built-in), Amazon EBS, Amazon EC2 Spot Fleet, Amazon VPC NAT Gateways, AWS Transit Gateway, AWS Site-to-Site VPN
`rds.<REGION>.amazonaws.com`	Amazon RDS (built-in), Amazon Aurora, Amazon DocumentDB, Amazon Neptune, Amazon RDS
`s3.<REGION>.amazonaws.com`	Amazon S3 (built-in)
`acm-pca.<REGION>.amazonaws.com`	AWS Certificate Manager Private Certificate Authority
`apigateway.<REGION>.amazonaws.com`	Amazon API Gateway
`apprunner.<REGION>.amazonaws.com`	AWS App Runner
`appstream2.<REGION>.amazonaws.com`	Amazon AppStream
`appsync.<REGION>.amazonaws.com`	AWS AppSync
`athena.<REGION>.amazonaws.com`	Amazon Athena
`cloudfront.amazonaws.com`	Amazon CloudFront
`cloudhsmv2.<REGION>.amazonaws.com`	AWS CloudHSM
`cloudsearch.<REGION>.amazonaws.com`	Amazon CloudSearch
`codebuild.<REGION>.amazonaws.com`	AWS CodeBuild
`datasync.<REGION>.amazonaws.com`	AWS DataSync
`dax.<REGION>.amazonaws.com`	Amazon DynamoDB Accelerator (DAX)
`dms.<REGION>.amazonaws.com`	AWS Database Migration Service (AWS DMS)
`directconnect.<REGION>.amazonaws.com`	AWS Direct Connect
`ecs.<REGION>.amazonaws.com`	Amazon Elastic Container Service (ECS), Amazon ECS Container Insights
`elasticfilesystem.<REGION>.amazonaws.com`	Amazon Elastic File System (EFS)
`eks.<REGION>.amazonaws.com`	Amazon Elastic Kubernetes Service (EKS)
`elasticache.<REGION>.amazonaws.com`	Amazon ElastiCache (EC)
`elasticbeanstalk.<REGION>.amazonaws.com`	AWS Elastic Beanstalk
`elastictranscoder.<REGION>.amazonaws.com`	Amazon Elastic Transcoder
`es.<REGION>.amazonaws.com`	Amazon Elasticsearch Service (ES)
`events.<REGION>.amazonaws.com`	Amazon EventBridge
`fsx.<REGION>.amazonaws.com`	Amazon FSx
`gamelift.<REGION>.amazonaws.com`	Amazon GameLift
`glue.<REGION>.amazonaws.com`	AWS Glue
`inspector.<REGION>.amazonaws.com`	Amazon Inspector
`kafka.<REGION>.amazonaws.com`	Amazon Managed Streaming for Kafka
`models.lex.<REGION>.amazonaws.com`	Amazon Lex
`logs.<REGION>.amazonaws.com`	Amazon CloudWatch Logs
`api.mediatailor.<REGION>.amazonaws.com`	AWS Elemental MediaTailor
`mediaconnect.<REGION>.amazonaws.com`	AWS Elemental MediaConnect
`mediapackage.<REGION>.amazonaws.com`	AWS Elemental MediaPackage Live
`mediapackage-vod.<REGION>.amazonaws.com`	AWS Elemental MediaPackage Video on Demand
`opsworks.<REGION>.amazonaws.com`	AWS OpsWorks
`qldb.<REGION>.amazonaws.com`	Amazon QLDB
`redshift.<REGION>.amazonaws.com`	Amazon Redshift
`robomaker.<REGION>.amazonaws.com`	AWS RoboMaker
`route53.amazonaws.com`	Amazon Route 53
`route53resolver.<REGION>.amazonaws.com`	Amazon Route 53 Resolver
`api.sagemaker.<REGION>.amazonaws.com`	Amazon SageMaker Endpoints, Amazon SageMaker Endpoint Instances
`sns.<REGION>.amazonaws.com`	Amazon Simple Notification Service (SNS)
`sqs.<REGION>.amazonaws.com`	Amazon Simple Queue Service (SQS)
`storagegateway.<REGION>.amazonaws.com`	AWS Storage Gateway
`swf.<REGION>.amazonaws.com`	Amazon SWF
`transfer.<REGION>.amazonaws.com`	AWS Transfer Family
`workmail.<REGION>.amazonaws.com`	Amazon WorkMail
`workspaces.<REGION>.amazonaws.com`	Amazon WorkSpaces

Enable monitoring

To learn how to enable service monitoring, see Enable service monitoring.

View service metrics

You can view the service metrics in your Dynatrace environment either on the custom device overview page or on your Dashboards page.

View metrics on the custom device overview page

To access the custom device overview page

Go to Technologies & Processes (previous Dynatrace) or Technologies & Processes Classic.
Filter by service name and select the relevant custom device group.
Once you select the custom device group, you're on the custom device group overview page.
The custom device group overview page lists all instances (custom devices) belonging to the group. Select an instance to view the custom device overview page.

View metrics on your dashboard

You can also view metrics in the Dynatrace web UI on dashboards. There is no preset dashboard available for this service, but you can create your own dashboard.

To check the availability of preset dashboards for each AWS service, see the list below.

Preset dashboard availability list

AWS service	Preset dashboard
Amazon EC2 Auto Scaling (built-in)
AWS Lambda (built-in)
Amazon Application and Network Load Balancer (built-in)
Amazon DynamoDB (built-in)
Amazon EBS (built-in)
Amazon EC2 (built-in)
Amazon Elastic Load Balancer (ELB) (built-in)
Amazon RDS (built-in)
Amazon S3 (built-in)
AWS Certificate Manager Private Certificate Authority
All monitored Amazon services
Amazon API Gateway
AWS App Runner
Amazon AppStream
AWS AppSync
Amazon Athena
Amazon Aurora
Amazon EC2 Auto Scaling
AWS Billing
Amazon Keyspaces
AWS Chatbot
Amazon CloudFront
AWS CloudHSM
Amazon CloudSearch
AWS CodeBuild
Amazon Cognito
Amazon Connect
AWS DataSync
Amazon DynamoDB Accelerator (DAX)
AWS Database Migration Service (AWS DMS)
Amazon DocumentDB
AWS Direct Connect
Amazon DynamoDB
Amazon EBS
Amazon EC2 Spot Fleet
Amazon EC2 API
Amazon Elastic Container Service (ECS)
Amazon ECS Container Insights
Amazon Elastic File System (EFS)
Amazon Elastic Kubernetes Service (EKS)
Amazon ElastiCache (EC)
AWS Elastic Beanstalk
Amazon Elastic Inference
Amazon Elastic Transcoder
Amazon Elastic Map Reduce (EMR)
Amazon Elasticsearch Service (ES)
Amazon EventBridge
Amazon FSx
Amazon GameLift
AWS Glue
Amazon Inspector
AWS Internet of Things (IoT)
AWS IoT Things Graph
AWS IoT Analytics
Amazon Managed Streaming for Kafka
Amazon Kinesis Data Analytics
Amazon Data Firehose
Amazon Kinesis Data Streams
Amazon Kinesis Video Streams
AWS Lambda
Amazon Lex
Amazon CloudWatch Logs
AWS Elemental MediaTailor
AWS Elemental MediaConnect
AWS Elemental MediaConvert
AWS Elemental MediaPackage Live
AWS Elemental MediaPackage Video on Demand
Amazon MQ
Amazon VPC NAT Gateways
Amazon Neptune
AWS OpsWorks
Amazon Polly
Amazon QLDB
Amazon RDS
Amazon Redshift
Amazon Rekognition
AWS RoboMaker
Amazon Route 53
Amazon Route 53 Resolver
Amazon S3
Amazon SageMaker Batch Transform Jobs
Amazon SageMaker Endpoints
Amazon SageMaker Endpoint Instances
Amazon SageMaker Ground Truth
Amazon SageMaker Processing Jobs
Amazon SageMaker Training Jobs
AWS Service Catalog
Amazon Simple Email Service (SES)
Amazon Simple Notification Service (SNS)
Amazon Simple Queue Service (SQS)
AWS Systems Manager - Run Command
AWS Step Functions
AWS Storage Gateway
Amazon SWF
Amazon Textract
AWS Transfer Family
AWS Transit Gateway
Amazon Translate
AWS Trusted Advisor
AWS API Usage
AWS Site-to-Site VPN
AWS WAF Classic
AWS WAF
Amazon WorkMail
Amazon WorkSpaces

Available metrics

Amazon SageMaker Batch Transform Jobs

Name	Description	Unit	Statistics	Dimensions
CPUUtilization	The percentage of CPU units that are used by the containers on an instance. The value can range between `0%` and `100%`, and is multiplied by the number of CPUs. For example, if there are four CPUs, `CPUUtilization` can range from `0%` to `400%'.	Percent	Average	Region, Host
MemoryUtilization	The percentage of memory that is used by the containers on an instance. This value can range between `0%` and `100%`.	Percent	Average	Region, Host
GPUMemoryUtilization	The percentage of GPU memory used by the containers on an instance. The value can range between `0%` and `100%` and is multiplied by the number of GPUs. For example, if there are four GPUs, `GPUMemoryUtilization` can range from `0%` to `400%'.	Percent	Average	Region, Host
GPUUtilization	The percentage of GPU units that are used by the containers on an instance. The value can range between `0%` and `100%`and is multiplied by the number of GPUs. For example, if there are four GPUs, `GPUUtilization` can range from `0%` to `400%'.	Percent	Average	Region, Host

Amazon SageMaker Processing Jobs, Amazon SageMaker Training Jobs

Name	Description	Unit	Statistics	Dimensions
CPUUtilization	The percentage of CPU units that are used by the containers on an instance. The value can range between `0%` and `100%`, and is multiplied by the number of CPUs. For example, if there are four CPUs, `CPUUtilization` can range from `0%` to `400%'.	Percent	Average	Region, Host
DiskUtilization	The percentage of disk space used by the containers on an instance uses. This value can range between `0%` and `100%`. This metric is not supported for batch transform jobs.	Percent	Average	EndpointName, VariantName
GPUMemoryUtilization	The percentage of GPU memory used by the containers on an instance. The value can range between `0%` and `100%` and is multiplied by the number of GPUs. For example, if there are four GPUs, `GPUMemoryUtilization` can range from `0%` to `400%'.	Percent	Average	Region, Host
GPUUtilization	The percentage of GPU units that are used by the containers on an instance. The value can range between `0%` and `100%`and is multiplied by the number of GPUs. For example, if there are four GPUs, `GPUUtilization` can range from `0%` to `400%'.	Percent	Average	Region, Host
MemoryUtilization	The percentage of memory that is used by the containers on an instance. This value can range between `0%` and `100%`.	Percent	Average	Region, Host

Amazon SageMaker Endpoint Instances

EndpointName is the main dimension.

Name	Description	Unit	Statistics	Dimensions
CPUUtilization	The percentage of CPU units that are used by the containers on an instance. The value can range between `0%` and `100%`, and is multiplied by the number of CPUs. For example, if there are four CPUs, `CPUUtilization` can range from `0%` to `400%'.	Percent	Average	EndpointName, VariantName
DiskUtilization	The percentage of disk space used by the containers on an instance uses. This value can range between `0%` and `100%`. This metric is not supported for batch transform jobs.	Percent	Average	EndpointName, VariantName
GPUMemoryUtilization	The percentage of GPU units that are used by the containers on an instance. The value can range between `0%` and `100%`and is multiplied by the number of GPUs. For example, if there are four GPUs, `GPUUtilization` can range from `0%` to `400%'.	Percent	Average	EndpointName, VariantName
GPUUtilization	The percentage of GPU units that are used by the containers on an instance. The value can range between `0%` and `100%`and is multiplied by the number of GPUs. For example, if there are four GPUs, `GPUUtilization` can range from `0%` to `400%'.	Percent	Average	EndpointName, VariantName
LoadedModelCount	The number of models loaded in the containers of the multi-model endpoint. This metric is emitted per instance.	None	Average	EndpointName, VariantName
LoadedModelCount		None	Sum	EndpointName, VariantName
MemoryUtilization	The percentage of memory that is used by the containers on an instance. This value can range between `0%` and `100%`.	Percent	Average	EndpointName, VariantName

Amazon SageMaker Endpoints

EndpointName is the main dimension.

Name	Description	Unit	Statistics	Dimensions
Invocation4XXErrors	The number of `InvokeEndpoint` requests where the model returned a `4xx` HTTP response code. For each `4xx` response, `1` is sent; otherwise, `0` is sent.	None	Average	EndpointName, VariantName
Invocation4XXErrors		None	Sum	EndpointName, VariantName
Invocation5XXErrors	The number of `InvokeEndpoint` requests where the model returned a `5xx` HTTP response code. For each `5xx` response, `1` is sent; otherwise, `0` is sent.	None	Average	EndpointName, VariantName
Invocation5XXErrors		None	Sum	EndpointName, VariantName
Invocations	The number of `InvokeEndpoint` requests sent to a model endpoint	None	Sum	EndpointName, VariantName
Invocations		None	Count	EndpointName, VariantName
InvocationsPerInstance	The number of invocations sent to a model, normalized by `InstanceCount` in each `ProductionVariant`. `1/numberOfInstances` is sent as the value on each request, where `numberOfInstances` is the number of active instances for the `ProductionVariant` behind the endpoint at the time of the request.	None	Sum	EndpointName, VariantName
ModelCacheHit	The number of `InvokeEndpoint` requests sent to the multi-model endpoint for which the model was already loaded	None	Sum	EndpointName, VariantName
ModelCacheHit		None	Average	EndpointName, VariantName
ModelCacheHit		None	Count	EndpointName, VariantName
ModelLatency	The interval of time taken by a model to respond as viewed from SageMaker. This interval includes the local communication times taken to send the request and to fetch the response from the container of a model and the time taken to complete the inference in the container.	Microseconds	Multi	EndpointName, VariantName
ModelLatency		Microseconds	Sum	EndpointName, VariantName
ModelLatency		Microseconds	Count	EndpointName, VariantName
ModelLoadingTime	The interval of time that it took to load the model through the container's `LoadModel` API call.	Microseconds	Multi	EndpointName, VariantName
ModelLoadingTime		Microseconds	Sum	EndpointName, VariantName
ModelLoadingTime		Microseconds	Count	EndpointName, VariantName
ModelLoadingWaitTime	The interval of time that an invocation request has waited for the target model to be downloaded, or loaded, or both in order to perform inference	Microseconds	Multi	EndpointName, VariantName
ModelLoadingWaitTime		Microseconds	Sum	EndpointName, VariantName
ModelLoadingWaitTime		Microseconds	Count	EndpointName, VariantName
ModelDownloadingTime	The interval of time that it took to download the model from Amazon Simple Storage Service (Amazon S3)	Microseconds	Multi	EndpointName, VariantName
ModelDownloadingTime		Microseconds	Sum	EndpointName, VariantName
ModelDownloadingTime		Microseconds	Count	EndpointName, VariantName
ModelUnloadingTime	The interval of time that it took to unload the model through the container's `UnloadModel` API call	Microseconds	Multi	EndpointName, VariantName
ModelUnloadingTime		Microseconds	Sum	EndpointName, VariantName
ModelUnloadingTime		Microseconds	Count	EndpointName, VariantName
OverheadLatency	The interval of time added to the time taken to respond to a client request by SageMaker overheads. This interval is measured from the time SageMaker receives the request until it returns a response to the client, minus the `ModelLatency`.	Microseconds	Multi	EndpointName, VariantName
OverheadLatency		Microseconds	Sum	EndpointName, VariantName
OverheadLatency		Microseconds	Count	EndpointName, VariantName

Amazon SageMaker Ground Truth

Name	Description	Dimensions	Statistics	Unit
ActiveWorkers	The number of workers on a private work team performing a labeling job	Region, LabelingJobName	Maximum	None
DatasetObjectsAutoAnnotated	The number of dataset objects auto-annotated in a labeling job. This metric is only emitted when automated labeling is enabled.	Region, LabelingJobName	Maximum	None
DatasetObjectsHumanAnnotated	The number of dataset objects annotated by a human in a labeling job	Region, LabelingJobName	Maximum	None
DatasetObjectsLabelingFailed	The number of dataset objects that failed labeling in a labeling job	Region, LabelingJobName	Maximum	None
JobsFailed	The number of labeling jobs that failed	Region	Count	None
JobsFailed		Region	Sum	None
JobsStopped	The number of labeling jobs that were stopped	Region	Count	None
JobsStopped		Region	Sum	None
JobsSucceeded	The number of labeling jobs that succeeded	Region	Count	None
JobsSucceeded		Region	Sum	None
TasksSubmitted	The number of tasks submitted/completed by a private work team	Region, LabelingJobName	Maximum	None
TimeSpent	Time spent on a task completed by a private work team	Region, LabelingJobName	Maximum	Seconds
TotalDatasetObjectsLabeled	The number of dataset objects labeled successfully in a labeling job	Region, LabelingJobName	Maximum	None