Amazon SageMaker (Batch Transform Jobs, Endpoint Instances, Endpoints, Ground Truth, Processing Jobs, Training Jobs) monitoring

Dynatrace ingests metrics for multiple preselected namespaces, including Amazon SageMaker. You can view metrics for each service instance, split metrics into multiple dimensions, and create custom charts that you can pin to your dashboards.

Prerequisites

To enable monitoring for this service, you need

  • ActiveGate version 1.181+, as follows:

    • For Dynatrace SaaS deployments, you need an Environment ActiveGate or a Multi-environment ActiveGate.

    • For Dynatrace Managed deployments, you can use any kind of ActiveGate.

      For role-based access (whether in a SaaS or Managed deployment), you need an Environment ActiveGate installed on an Amazon EC2 host.

  • Dynatrace version 1.182+

  • An updated AWS monitoring policy to include the additional AWS services.
    To update the AWS IAM policy, use the JSON below, containing the monitoring policy (permissions) for all supporting services.

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": [
"acm-pca:ListCertificateAuthorities",
"apigateway:GET",
"apprunner:ListServices",
"appstream:DescribeFleets",
"appsync:ListGraphqlApis",
"athena:ListWorkGroups",
"autoscaling:DescribeAutoScalingGroups",
"cloudformation:ListStackResources",
"cloudfront:ListDistributions",
"cloudhsm:DescribeClusters",
"cloudsearch:DescribeDomains",
"cloudwatch:GetMetricData",
"cloudwatch:GetMetricStatistics",
"cloudwatch:ListMetrics",
"codebuild:ListProjects",
"datasync:ListTasks",
"dax:DescribeClusters",
"directconnect:DescribeConnections",
"dms:DescribeReplicationInstances",
"dynamodb:ListTables",
"dynamodb:ListTagsOfResource",
"ec2:DescribeAvailabilityZones",
"ec2:DescribeInstances",
"ec2:DescribeNatGateways",
"ec2:DescribeSpotFleetRequests",
"ec2:DescribeTransitGateways",
"ec2:DescribeVolumes",
"ec2:DescribeVpnConnections",
"ecs:ListClusters",
"eks:ListClusters",
"elasticache:DescribeCacheClusters",
"elasticbeanstalk:DescribeEnvironmentResources",
"elasticbeanstalk:DescribeEnvironments",
"elasticfilesystem:DescribeFileSystems",
"elasticloadbalancing:DescribeInstanceHealth",
"elasticloadbalancing:DescribeListeners",
"elasticloadbalancing:DescribeLoadBalancers",
"elasticloadbalancing:DescribeRules",
"elasticloadbalancing:DescribeTags",
"elasticloadbalancing:DescribeTargetHealth",
"elasticmapreduce:ListClusters",
"elastictranscoder:ListPipelines",
"es:ListDomainNames",
"events:ListEventBuses",
"firehose:ListDeliveryStreams",
"fsx:DescribeFileSystems",
"gamelift:ListFleets",
"glue:GetJobs",
"inspector:ListAssessmentTemplates",
"kafka:ListClusters",
"kinesis:ListStreams",
"kinesisanalytics:ListApplications",
"kinesisvideo:ListStreams",
"lambda:ListFunctions",
"lambda:ListTags",
"lex:GetBots",
"logs:DescribeLogGroups",
"mediaconnect:ListFlows",
"mediaconvert:DescribeEndpoints",
"mediapackage-vod:ListPackagingConfigurations",
"mediapackage:ListChannels",
"mediatailor:ListPlaybackConfigurations",
"opsworks:DescribeStacks",
"qldb:ListLedgers",
"rds:DescribeDBClusters",
"rds:DescribeDBInstances",
"rds:DescribeEvents",
"rds:ListTagsForResource",
"redshift:DescribeClusters",
"robomaker:ListSimulationJobs",
"route53:ListHostedZones",
"route53resolver:ListResolverEndpoints",
"s3:ListAllMyBuckets",
"sagemaker:ListEndpoints",
"sns:ListTopics",
"sqs:ListQueues",
"storagegateway:ListGateways",
"sts:GetCallerIdentity",
"swf:ListDomains",
"tag:GetResources",
"tag:GetTagKeys",
"transfer:ListServers",
"workmail:ListOrganizations",
"workspaces:DescribeWorkspaces"
],
"Resource": "*"
}
]
}

If you don't want to add permissions to all services, and just select permissions for certain services, consult the table below. The table contains a set of permissions that are required for All AWS cloud services and, for each supporting service, a list of optional permissions specific to that service.

Permissions required for AWS monitoring integration:
  • "cloudwatch:GetMetricData"
  • "cloudwatch:GetMetricStatistics"
  • "cloudwatch:ListMetrics"
  • "sts:GetCallerIdentity"
  • "tag:GetResources"
  • "tag:GetTagKeys"
  • "ec2:DescribeAvailabilityZones"
Name
Permissions
All monitored Amazon services required
cloudwatch:GetMetricData,
cloudwatch:GetMetricStatistics,
cloudwatch:ListMetrics,
sts:GetCallerIdentity,
tag:GetResources,
tag:GetTagKeys,
ec2:DescribeAvailabilityZones
AWS Certificate Manager Private Certificate Authority
acm-pca:ListCertificateAuthorities
Amazon MQ
Amazon API Gateway
apigateway:GET
AWS App Runner
apprunner:ListServices
Amazon AppStream
appstream:DescribeFleets
AWS AppSync
appsync:ListGraphqlApis
Amazon Athena
athena:ListWorkGroups
Amazon Aurora
rds:DescribeDBClusters
Amazon EC2 Auto Scaling
autoscaling:DescribeAutoScalingGroups
Amazon EC2 Auto Scaling (built-in)
autoscaling:DescribeAutoScalingGroups
AWS Billing
Amazon Keyspaces
AWS Chatbot
Amazon CloudFront
cloudfront:ListDistributions
AWS CloudHSM
cloudhsm:DescribeClusters
Amazon CloudSearch
cloudsearch:DescribeDomains
AWS CodeBuild
codebuild:ListProjects
Amazon Cognito
Amazon Connect
Amazon Elastic Kubernetes Service (EKS)
eks:ListClusters
AWS DataSync
datasync:ListTasks
Amazon DynamoDB Accelerator (DAX)
dax:DescribeClusters
AWS Database Migration Service (AWS DMS)
dms:DescribeReplicationInstances
Amazon DocumentDB
rds:DescribeDBClusters
AWS Direct Connect
directconnect:DescribeConnections
Amazon DynamoDB
dynamodb:ListTables
Amazon DynamoDB (built-in)
dynamodb:ListTables,
dynamodb:ListTagsOfResource
Amazon EBS
ec2:DescribeVolumes
Amazon EBS (built-in)
ec2:DescribeVolumes
Amazon EC2 API
Amazon EC2 (built-in)
ec2:DescribeInstances
Amazon EC2 Spot Fleet
ec2:DescribeSpotFleetRequests
Amazon Elastic Container Service (ECS)
ecs:ListClusters
Amazon ECS Container Insights
ecs:ListClusters
Amazon ElastiCache (EC)
elasticache:DescribeCacheClusters
AWS Elastic Beanstalk
elasticbeanstalk:DescribeEnvironments
Amazon Elastic File System (EFS)
elasticfilesystem:DescribeFileSystems
Amazon Elastic Inference
Amazon Elastic Map Reduce (EMR)
elasticmapreduce:ListClusters
Amazon Elasticsearch Service (ES)
es:ListDomainNames
Amazon Elastic Transcoder
elastictranscoder:ListPipelines
Amazon Elastic Load Balancer (ELB) (built-in)
elasticloadbalancing:DescribeInstanceHealth,
elasticloadbalancing:DescribeListeners,
elasticloadbalancing:DescribeLoadBalancers,
elasticloadbalancing:DescribeRules,
elasticloadbalancing:DescribeTags,
elasticloadbalancing:DescribeTargetHealth
Amazon EventBridge
events:ListEventBuses
Amazon FSx
fsx:DescribeFileSystems
Amazon GameLift
gamelift:ListFleets
AWS Glue
glue:GetJobs
Amazon Inspector
inspector:ListAssessmentTemplates
AWS Internet of Things (IoT)
AWS IoT Analytics
Amazon Managed Streaming for Kafka
kafka:ListClusters
Amazon Kinesis Data Analytics
kinesisanalytics:ListApplications
Amazon Data Firehose
firehose:ListDeliveryStreams
Amazon Kinesis Data Streams
kinesis:ListStreams
Amazon Kinesis Video Streams
kinesisvideo:ListStreams
AWS Lambda
lambda:ListFunctions
AWS Lambda (built-in)
lambda:ListFunctions,
lambda:ListTags
Amazon Lex
lex:GetBots
Amazon Application and Network Load Balancer (built-in)
elasticloadbalancing:DescribeInstanceHealth,
elasticloadbalancing:DescribeListeners,
elasticloadbalancing:DescribeLoadBalancers,
elasticloadbalancing:DescribeRules,
elasticloadbalancing:DescribeTags,
elasticloadbalancing:DescribeTargetHealth
Amazon CloudWatch Logs
logs:DescribeLogGroups
AWS Elemental MediaConnect
mediaconnect:ListFlows
AWS Elemental MediaConvert
mediaconvert:DescribeEndpoints
AWS Elemental MediaPackage Live
mediapackage:ListChannels
AWS Elemental MediaPackage Video on Demand
mediapackage-vod:ListPackagingConfigurations
AWS Elemental MediaTailor
mediatailor:ListPlaybackConfigurations
Amazon VPC NAT Gateways
ec2:DescribeNatGateways
Amazon Neptune
rds:DescribeDBClusters
AWS OpsWorks
opsworks:DescribeStacks
Amazon Polly
Amazon QLDB
qldb:ListLedgers
Amazon RDS
rds:DescribeDBInstances
Amazon RDS (built-in)
rds:DescribeDBInstances,
rds:DescribeEvents,
rds:ListTagsForResource
Amazon Redshift
redshift:DescribeClusters
Amazon Rekognition
AWS RoboMaker
robomaker:ListSimulationJobs
Amazon Route 53
route53:ListHostedZones
Amazon Route 53 Resolver
route53resolver:ListResolverEndpoints
Amazon S3
s3:ListAllMyBuckets
Amazon S3 (built-in)
s3:ListAllMyBuckets
Amazon SageMaker Batch Transform Jobs
Amazon SageMaker Endpoint Instances
sagemaker:ListEndpoints
Amazon SageMaker Endpoints
sagemaker:ListEndpoints
Amazon SageMaker Ground Truth
Amazon SageMaker Processing Jobs
Amazon SageMaker Training Jobs
AWS Service Catalog
Amazon Simple Email Service (SES)
Amazon Simple Notification Service (SNS)
sns:ListTopics
Amazon Simple Queue Service (SQS)
sqs:ListQueues
AWS Systems Manager - Run Command
AWS Step Functions
AWS Storage Gateway
storagegateway:ListGateways
Amazon SWF
swf:ListDomains
Amazon Textract
AWS IoT Things Graph
AWS Transfer Family
transfer:ListServers
AWS Transit Gateway
ec2:DescribeTransitGateways
Amazon Translate
AWS Trusted Advisor
AWS API Usage
AWS Site-to-Site VPN
ec2:DescribeVpnConnections
AWS WAF Classic
AWS WAF
Amazon WorkMail
workmail:ListOrganizations
Amazon WorkSpaces
workspaces:DescribeWorkspaces

Example of JSON policy for one single service.

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": [
"apigateway:GET",
"cloudwatch:GetMetricData",
"cloudwatch:GetMetricStatistics",
"cloudwatch:ListMetrics",
"sts:GetCallerIdentity",
"tag:GetResources",
"tag:GetTagKeys",
"ec2:DescribeAvailabilityZones"
],
"Resource": "*"
}
]
}

In this example, from the complete list of permissions you need to select

  • "apigateway:GET" for Amazon API Gateway
  • "cloudwatch:GetMetricData", "cloudwatch:GetMetricStatistics", "cloudwatch:ListMetrics", "sts:GetCallerIdentity", "tag:GetResources", "tag:GetTagKeys", and "ec2:DescribeAvailabilityZones" for All AWS cloud services.
Endpoint
Service
autoscaling.<REGION>.amazonaws.com
Amazon EC2 Auto Scaling (built-in), Amazon EC2 Auto Scaling
lambda.<REGION>.amazonaws.com
AWS Lambda (built-in), AWS Lambda
elasticloadbalancing.<REGION>.amazonaws.com
Amazon Application and Network Load Balancer (built-in), Amazon Elastic Load Balancer (ELB) (built-in)
dynamodb.<REGION>.amazonaws.com
Amazon DynamoDB (built-in), Amazon DynamoDB
ec2.<REGION>.amazonaws.com
Amazon EBS (built-in), Amazon EC2 (built-in), Amazon EBS, Amazon EC2 Spot Fleet, Amazon VPC NAT Gateways, AWS Transit Gateway, AWS Site-to-Site VPN
rds.<REGION>.amazonaws.com
Amazon RDS (built-in), Amazon Aurora, Amazon DocumentDB, Amazon Neptune, Amazon RDS
s3.<REGION>.amazonaws.com
Amazon S3 (built-in)
acm-pca.<REGION>.amazonaws.com
AWS Certificate Manager Private Certificate Authority
apigateway.<REGION>.amazonaws.com
Amazon API Gateway
apprunner.<REGION>.amazonaws.com
AWS App Runner
appstream2.<REGION>.amazonaws.com
Amazon AppStream
appsync.<REGION>.amazonaws.com
AWS AppSync
athena.<REGION>.amazonaws.com
Amazon Athena
cloudfront.amazonaws.com
Amazon CloudFront
cloudhsmv2.<REGION>.amazonaws.com
AWS CloudHSM
cloudsearch.<REGION>.amazonaws.com
Amazon CloudSearch
codebuild.<REGION>.amazonaws.com
AWS CodeBuild
datasync.<REGION>.amazonaws.com
AWS DataSync
dax.<REGION>.amazonaws.com
Amazon DynamoDB Accelerator (DAX)
dms.<REGION>.amazonaws.com
AWS Database Migration Service (AWS DMS)
directconnect.<REGION>.amazonaws.com
AWS Direct Connect
ecs.<REGION>.amazonaws.com
Amazon Elastic Container Service (ECS), Amazon ECS Container Insights
elasticfilesystem.<REGION>.amazonaws.com
Amazon Elastic File System (EFS)
eks.<REGION>.amazonaws.com
Amazon Elastic Kubernetes Service (EKS)
elasticache.<REGION>.amazonaws.com
Amazon ElastiCache (EC)
elasticbeanstalk.<REGION>.amazonaws.com
AWS Elastic Beanstalk
elastictranscoder.<REGION>.amazonaws.com
Amazon Elastic Transcoder
es.<REGION>.amazonaws.com
Amazon Elasticsearch Service (ES)
events.<REGION>.amazonaws.com
Amazon EventBridge
fsx.<REGION>.amazonaws.com
Amazon FSx
gamelift.<REGION>.amazonaws.com
Amazon GameLift
glue.<REGION>.amazonaws.com
AWS Glue
inspector.<REGION>.amazonaws.com
Amazon Inspector
kafka.<REGION>.amazonaws.com
Amazon Managed Streaming for Kafka
models.lex.<REGION>.amazonaws.com
Amazon Lex
logs.<REGION>.amazonaws.com
Amazon CloudWatch Logs
api.mediatailor.<REGION>.amazonaws.com
AWS Elemental MediaTailor
mediaconnect.<REGION>.amazonaws.com
AWS Elemental MediaConnect
mediapackage.<REGION>.amazonaws.com
AWS Elemental MediaPackage Live
mediapackage-vod.<REGION>.amazonaws.com
AWS Elemental MediaPackage Video on Demand
opsworks.<REGION>.amazonaws.com
AWS OpsWorks
qldb.<REGION>.amazonaws.com
Amazon QLDB
redshift.<REGION>.amazonaws.com
Amazon Redshift
robomaker.<REGION>.amazonaws.com
AWS RoboMaker
route53.amazonaws.com
Amazon Route 53
route53resolver.<REGION>.amazonaws.com
Amazon Route 53 Resolver
api.sagemaker.<REGION>.amazonaws.com
Amazon SageMaker Endpoints, Amazon SageMaker Endpoint Instances
sns.<REGION>.amazonaws.com
Amazon Simple Notification Service (SNS)
sqs.<REGION>.amazonaws.com
Amazon Simple Queue Service (SQS)
storagegateway.<REGION>.amazonaws.com
AWS Storage Gateway
swf.<REGION>.amazonaws.com
Amazon SWF
transfer.<REGION>.amazonaws.com
AWS Transfer Family
workmail.<REGION>.amazonaws.com
Amazon WorkMail
workspaces.<REGION>.amazonaws.com
Amazon WorkSpaces

Enable monitoring

To learn how to enable service monitoring, see Enable service monitoring.

View service metrics

You can view the service metrics in your Dynatrace environment either on the custom device overview page or on your Dashboards page.

View metrics on the custom device overview page

To access the custom device overview page

  1. Go to Technologies & Processes or Technologies & Processes Classic (latest Dynatrace).
  2. Filter by service name and select the relevant custom device group.
  3. Once you select the custom device group, you're on the custom device group overview page.
  4. The custom device group overview page lists all instances (custom devices) belonging to the group. Select an instance to view the custom device overview page.

View metrics on your dashboard

You can also view metrics in the Dynatrace web UI on dashboards. There is no preset dashboard available for this service, but you can create your own dashboard.

To check the availability of preset dashboards for each AWS service, see the list below.

AWS service
Preset dashboard
Amazon EC2 Auto Scaling (built-in)
Not applicable
AWS Lambda (built-in)
Not applicable
Amazon Application and Network Load Balancer (built-in)
Not applicable
Amazon DynamoDB (built-in)
Not applicable
Amazon EBS (built-in)
Not applicable
Amazon EC2 (built-in)
Not applicable
Amazon Elastic Load Balancer (ELB) (built-in)
Not applicable
Amazon RDS (built-in)
Not applicable
Amazon S3 (built-in)
Not applicable
AWS Certificate Manager Private Certificate Authority
Not applicable
All monitored Amazon services
Not applicable
Amazon API Gateway
Not applicable
AWS App Runner
Not applicable
Amazon AppStream
Applicable
AWS AppSync
Applicable
Amazon Athena
Applicable
Amazon Aurora
Not applicable
Amazon EC2 Auto Scaling
Applicable
AWS Billing
Applicable
Amazon Keyspaces
Applicable
AWS Chatbot
Applicable
Amazon CloudFront
Not applicable
AWS CloudHSM
Applicable
Amazon CloudSearch
Applicable
AWS CodeBuild
Applicable
Amazon Cognito
Not applicable
Amazon Connect
Applicable
AWS DataSync
Applicable
Amazon DynamoDB Accelerator (DAX)
Applicable
AWS Database Migration Service (AWS DMS)
Applicable
Amazon DocumentDB
Applicable
AWS Direct Connect
Applicable
Amazon DynamoDB
Not applicable
Amazon EBS
Not applicable
Amazon EC2 Spot Fleet
Not applicable
Amazon EC2 API
Applicable
Amazon Elastic Container Service (ECS)
Not applicable
Amazon ECS Container Insights
Applicable
Amazon Elastic File System (EFS)
Not applicable
Amazon Elastic Kubernetes Service (EKS)
Applicable
Amazon ElastiCache (EC)
Not applicable
AWS Elastic Beanstalk
Applicable
Amazon Elastic Inference
Applicable
Amazon Elastic Transcoder
Applicable
Amazon Elastic Map Reduce (EMR)
Not applicable
Amazon Elasticsearch Service (ES)
Not applicable
Amazon EventBridge
Applicable
Amazon FSx
Applicable
Amazon GameLift
Applicable
AWS Glue
Not applicable
Amazon Inspector
Applicable
AWS Internet of Things (IoT)
Not applicable
AWS IoT Things Graph
Applicable
AWS IoT Analytics
Applicable
Amazon Managed Streaming for Kafka
Applicable
Amazon Kinesis Data Analytics
Not applicable
Amazon Data Firehose
Not applicable
Amazon Kinesis Data Streams
Not applicable
Amazon Kinesis Video Streams
Not applicable
AWS Lambda
Not applicable
Amazon Lex
Applicable
Amazon CloudWatch Logs
Applicable
AWS Elemental MediaTailor
Applicable
AWS Elemental MediaConnect
Applicable
AWS Elemental MediaConvert
Applicable
AWS Elemental MediaPackage Live
Applicable
AWS Elemental MediaPackage Video on Demand
Applicable
Amazon MQ
Applicable
Amazon VPC NAT Gateways
Not applicable
Amazon Neptune
Applicable
AWS OpsWorks
Applicable
Amazon Polly
Applicable
Amazon QLDB
Applicable
Amazon RDS
Not applicable
Amazon Redshift
Not applicable
Amazon Rekognition
Applicable
AWS RoboMaker
Applicable
Amazon Route 53
Applicable
Amazon Route 53 Resolver
Applicable
Amazon S3
Not applicable
Amazon SageMaker Batch Transform Jobs
Not applicable
Amazon SageMaker Endpoints
Not applicable
Amazon SageMaker Endpoint Instances
Not applicable
Amazon SageMaker Ground Truth
Not applicable
Amazon SageMaker Processing Jobs
Not applicable
Amazon SageMaker Training Jobs
Not applicable
AWS Service Catalog
Applicable
Amazon Simple Email Service (SES)
Not applicable
Amazon Simple Notification Service (SNS)
Not applicable
Amazon Simple Queue Service (SQS)
Not applicable
AWS Systems Manager - Run Command
Applicable
AWS Step Functions
Applicable
AWS Storage Gateway
Applicable
Amazon SWF
Applicable
Amazon Textract
Applicable
AWS Transfer Family
Applicable
AWS Transit Gateway
Applicable
Amazon Translate
Applicable
AWS Trusted Advisor
Applicable
AWS API Usage
Applicable
AWS Site-to-Site VPN
Applicable
AWS WAF Classic
Applicable
AWS WAF
Applicable
Amazon WorkMail
Applicable
Amazon WorkSpaces
Applicable

Available metrics

Amazon SageMaker Batch Transform Jobs

Name
Description
Unit
Statistics
Dimensions
Recommended
CPUUtilization
The percentage of CPU units that are used by the containers on an instance. The value can range between 0% and 100%, and is multiplied by the number of CPUs. For example, if there are four CPUs, CPUUtilization can range from 0% to `400%'.
Percent
Average
Region, Host
Applicable
MemoryUtilization
The percentage of memory that is used by the containers on an instance. This value can range between 0% and 100%.
Percent
Average
Region, Host
Applicable
GPUMemoryUtilization
The percentage of GPU memory used by the containers on an instance. The value can range between 0% and 100% and is multiplied by the number of GPUs. For example, if there are four GPUs, GPUMemoryUtilization can range from 0% to `400%'.
Percent
Average
Region, Host
Applicable
GPUUtilization
The percentage of GPU units that are used by the containers on an instance. The value can range between 0% and 100%and is multiplied by the number of GPUs. For example, if there are four GPUs, GPUUtilization can range from 0% to `400%'.
Percent
Average
Region, Host
Applicable

Amazon SageMaker Processing Jobs, Amazon SageMaker Training Jobs

Name
Description
Unit
Statistics
Dimensions
Recommended
CPUUtilization
The percentage of CPU units that are used by the containers on an instance. The value can range between 0% and 100%, and is multiplied by the number of CPUs. For example, if there are four CPUs, CPUUtilization can range from 0% to `400%'.
Percent
Average
Region, Host
Applicable
DiskUtilization
The percentage of disk space used by the containers on an instance uses. This value can range between 0% and 100%. This metric is not supported for batch transform jobs.
Percent
Average
EndpointName, VariantName
Applicable
GPUMemoryUtilization
The percentage of GPU memory used by the containers on an instance. The value can range between 0% and 100% and is multiplied by the number of GPUs. For example, if there are four GPUs, GPUMemoryUtilization can range from 0% to `400%'.
Percent
Average
Region, Host
Applicable
GPUUtilization
The percentage of GPU units that are used by the containers on an instance. The value can range between 0% and 100%and is multiplied by the number of GPUs. For example, if there are four GPUs, GPUUtilization can range from 0% to `400%'.
Percent
Average
Region, Host
Applicable
MemoryUtilization
The percentage of memory that is used by the containers on an instance. This value can range between 0% and 100%.
Percent
Average
Region, Host
Applicable

Amazon SageMaker Endpoint Instances

Name
Description
Unit
Statistics
Dimensions
Recommended
CPUUtilization
The percentage of CPU units that are used by the containers on an instance. The value can range between 0% and 100%, and is multiplied by the number of CPUs. For example, if there are four CPUs, CPUUtilization can range from 0% to `400%'.
Percent
Average
EndpointName, VariantName
Applicable
DiskUtilization
The percentage of disk space used by the containers on an instance uses. This value can range between 0% and 100%. This metric is not supported for batch transform jobs.
Percent
Average
EndpointName, VariantName
Applicable
GPUMemoryUtilization
The percentage of GPU units that are used by the containers on an instance. The value can range between 0% and 100%and is multiplied by the number of GPUs. For example, if there are four GPUs, GPUUtilization can range from 0% to `400%'.
Percent
Average
EndpointName, VariantName
Applicable
GPUUtilization
The percentage of GPU units that are used by the containers on an instance. The value can range between 0% and 100%and is multiplied by the number of GPUs. For example, if there are four GPUs, GPUUtilization can range from 0% to `400%'.
Percent
Average
EndpointName, VariantName
Applicable
LoadedModelCount
The number of models loaded in the containers of the multi-model endpoint. This metric is emitted per instance.
None
Average
EndpointName, VariantName
LoadedModelCount
None
Sum
EndpointName, VariantName
MemoryUtilization
The percentage of memory that is used by the containers on an instance. This value can range between 0% and 100%.
Percent
Average
EndpointName, VariantName
Applicable

Amazon SageMaker Endpoints

Name
Description
Unit
Statistics
Dimensions
Recommended
Invocation4XXErrors
The number of InvokeEndpoint requests where the model returned a 4xx HTTP response code. For each 4xx response, 1 is sent; otherwise, 0 is sent.
None
Average
EndpointName, VariantName
Invocation4XXErrors
None
Sum
EndpointName, VariantName
Invocation5XXErrors
The number of InvokeEndpoint requests where the model returned a 5xx HTTP response code. For each 5xx response, 1 is sent; otherwise, 0 is sent.
None
Average
EndpointName, VariantName
Invocation5XXErrors
None
Sum
EndpointName, VariantName
Applicable
Invocations
The number of InvokeEndpoint requests sent to a model endpoint
None
Sum
EndpointName, VariantName
Applicable
Invocations
None
Count
EndpointName, VariantName
InvocationsPerInstance
The number of invocations sent to a model, normalized by InstanceCount in each ProductionVariant. 1/numberOfInstances is sent as the value on each request, where numberOfInstances is the number of active instances for the ProductionVariant behind the endpoint at the time of the request.
None
Sum
EndpointName, VariantName
ModelCacheHit
The number of InvokeEndpoint requests sent to the multi-model endpoint for which the model was already loaded
None
Sum
EndpointName, VariantName
ModelCacheHit
None
Average
EndpointName, VariantName
ModelCacheHit
None
Count
EndpointName, VariantName
ModelLatency
The interval of time taken by a model to respond as viewed from SageMaker. This interval includes the local communication times taken to send the request and to fetch the response from the container of a model and the time taken to complete the inference in the container.
Microseconds
Multi
EndpointName, VariantName
Applicable
ModelLatency
Microseconds
Sum
EndpointName, VariantName
ModelLatency
Microseconds
Count
EndpointName, VariantName
ModelLoadingTime
The interval of time that it took to load the model through the container's LoadModel API call.
Microseconds
Multi
EndpointName, VariantName
ModelLoadingTime
Microseconds
Sum
EndpointName, VariantName
ModelLoadingTime
Microseconds
Count
EndpointName, VariantName
ModelLoadingWaitTime
The interval of time that an invocation request has waited for the target model to be downloaded, or loaded, or both in order to perform inference
Microseconds
Multi
EndpointName, VariantName
ModelLoadingWaitTime
Microseconds
Sum
EndpointName, VariantName
ModelLoadingWaitTime
Microseconds
Count
EndpointName, VariantName
ModelDownloadingTime
The interval of time that it took to download the model from Amazon Simple Storage Service (Amazon S3)
Microseconds
Multi
EndpointName, VariantName
ModelDownloadingTime
Microseconds
Sum
EndpointName, VariantName
ModelDownloadingTime
Microseconds
Count
EndpointName, VariantName
ModelUnloadingTime
The interval of time that it took to unload the model through the container's UnloadModel API call
Microseconds
Multi
EndpointName, VariantName
ModelUnloadingTime
Microseconds
Sum
EndpointName, VariantName
ModelUnloadingTime
Microseconds
Count
EndpointName, VariantName
OverheadLatency
The interval of time added to the time taken to respond to a client request by SageMaker overheads. This interval is measured from the time SageMaker receives the request until it returns a response to the client, minus the ModelLatency.
Microseconds
Multi
EndpointName, VariantName
Applicable
OverheadLatency
Microseconds
Sum
EndpointName, VariantName
OverheadLatency
Microseconds
Count
EndpointName, VariantName

Amazon SageMaker Ground Truth

Name
Description
Dimensions
Statistics
Unit
Recommended
ActiveWorkers
The number of workers on a private work team performing a labeling job
Region, LabelingJobName
Maximum
None
DatasetObjectsAutoAnnotated
The number of dataset objects auto-annotated in a labeling job. This metric is only emitted when automated labeling is enabled.
Region, LabelingJobName
Maximum
None
Applicable
DatasetObjectsHumanAnnotated
The number of dataset objects annotated by a human in a labeling job
Region, LabelingJobName
Maximum
None
Applicable
DatasetObjectsLabelingFailed
The number of dataset objects that failed labeling in a labeling job
Region, LabelingJobName
Maximum
None
Applicable
JobsFailed
The number of labeling jobs that failed
Region
Count
None
JobsFailed
Region
Sum
None
Applicable
JobsStopped
The number of labeling jobs that were stopped
Region
Count
None
JobsStopped
Region
Sum
None
JobsSucceeded
The number of labeling jobs that succeeded
Region
Count
None
JobsSucceeded
Region
Sum
None
Applicable
TasksSubmitted
The number of tasks submitted/completed by a private work team
Region, LabelingJobName
Maximum
None
TimeSpent
Time spent on a task completed by a private work team
Region, LabelingJobName
Maximum
Seconds
TotalDatasetObjectsLabeled
The number of dataset objects labeled successfully in a labeling job
Region, LabelingJobName
Maximum
None
Applicable