Amazon SageMaker (Batch Transform Jobs, Endpoint Instances, Endpoints, Ground Truth, Processing Jobs, Training Jobs) monitoring
Dynatrace ingests metrics for multiple preselected namespaces, including Amazon SageMaker. You can view metrics for each service instance, split metrics into multiple dimensions, and create custom charts that you can pin to your dashboards.
Prerequisites
To enable monitoring for this service, you need
-
ActiveGate version 1.181+, as follows:
-
For Dynatrace SaaS deployments, you need an Environment ActiveGate or a Multi-environment ActiveGate.
-
For Dynatrace Managed deployments, you can use any kind of ActiveGate.
For role-based access (whether in a SaaS or Managed deployment), you need an Environment ActiveGate installed on an Amazon EC2 host.
-
-
Dynatrace version 1.182+
-
An updated AWS monitoring policy to include the additional AWS services.
To update the AWS IAM policy, use the JSON below, containing the monitoring policy (permissions) for all supporting services.
1{2 "Version": "2012-10-17",3 "Statement": [4 {5 "Sid": "VisualEditor0",6 "Effect": "Allow",7 "Action": [8 "acm-pca:ListCertificateAuthorities",9 "apigateway:GET",10 "apprunner:ListServices",11 "appstream:DescribeFleets",12 "appsync:ListGraphqlApis",13 "athena:ListWorkGroups",14 "autoscaling:DescribeAutoScalingGroups",15 "cloudformation:ListStackResources",16 "cloudfront:ListDistributions",17 "cloudhsm:DescribeClusters",18 "cloudsearch:DescribeDomains",19 "cloudwatch:GetMetricData",20 "cloudwatch:GetMetricStatistics",21 "cloudwatch:ListMetrics",22 "codebuild:ListProjects",23 "datasync:ListTasks",24 "dax:DescribeClusters",25 "directconnect:DescribeConnections",26 "dms:DescribeReplicationInstances",27 "dynamodb:ListTables",28 "dynamodb:ListTagsOfResource",29 "ec2:DescribeAvailabilityZones",30 "ec2:DescribeInstances",31 "ec2:DescribeNatGateways",32 "ec2:DescribeSpotFleetRequests",33 "ec2:DescribeTransitGateways",34 "ec2:DescribeVolumes",35 "ec2:DescribeVpnConnections",36 "ecs:ListClusters",37 "eks:ListClusters",38 "elasticache:DescribeCacheClusters",39 "elasticbeanstalk:DescribeEnvironmentResources",40 "elasticbeanstalk:DescribeEnvironments",41 "elasticfilesystem:DescribeFileSystems",42 "elasticloadbalancing:DescribeInstanceHealth",43 "elasticloadbalancing:DescribeListeners",44 "elasticloadbalancing:DescribeLoadBalancers",45 "elasticloadbalancing:DescribeRules",46 "elasticloadbalancing:DescribeTags",47 "elasticloadbalancing:DescribeTargetHealth",48 "elasticmapreduce:ListClusters",49 "elastictranscoder:ListPipelines",50 "es:ListDomainNames",51 "events:ListEventBuses",52 "firehose:ListDeliveryStreams",53 "fsx:DescribeFileSystems",54 "gamelift:ListFleets",55 "glue:GetJobs",56 "inspector:ListAssessmentTemplates",57 "kafka:ListClusters",58 "kinesis:ListStreams",59 "kinesisanalytics:ListApplications",60 "kinesisvideo:ListStreams",61 "lambda:ListFunctions",62 "lambda:ListTags",63 "lex:GetBots",64 "logs:DescribeLogGroups",65 "mediaconnect:ListFlows",66 "mediaconvert:DescribeEndpoints",67 "mediapackage-vod:ListPackagingConfigurations",68 "mediapackage:ListChannels",69 "mediatailor:ListPlaybackConfigurations",70 "opsworks:DescribeStacks",71 "qldb:ListLedgers",72 "rds:DescribeDBClusters",73 "rds:DescribeDBInstances",74 "rds:DescribeEvents",75 "rds:ListTagsForResource",76 "redshift:DescribeClusters",77 "robomaker:ListSimulationJobs",78 "route53:ListHostedZones",79 "route53resolver:ListResolverEndpoints",80 "s3:ListAllMyBuckets",81 "sagemaker:ListEndpoints",82 "sns:ListTopics",83 "sqs:ListQueues",84 "storagegateway:ListGateways",85 "sts:GetCallerIdentity",86 "swf:ListDomains",87 "tag:GetResources",88 "tag:GetTagKeys",89 "transfer:ListServers",90 "workmail:ListOrganizations",91 "workspaces:DescribeWorkspaces"92 ],93 "Resource": "*"94 }95 ]96}
If you don't want to add permissions to all services, and just select permissions for certain services, consult the table below. The table contains a set of permissions that are required for all services (All monitored Amazon services) and, for each supporting service, a list of optional permissions specific to that service.
Name | Additional permissions |
---|---|
AWS Certificate Manager Private Certificate Authority | "acm-pca:ListCertificateAuthorities" |
All monitored Amazon services | "cloudwatch:GetMetricData", "cloudwatch:GetMetricStatistics", "cloudwatch:ListMetrics", "sts:GetCallerIdentity", "tag:GetResources", "tag:GetTagKeys", "ec2:DescribeAvailabilityZones" |
Amazon MQ | |
Amazon API Gateway | "apigateway:GET" |
AWS App Runner | "apprunner:ListServices" |
Amazon AppStream | "appstream:DescribeFleets" |
AWS AppSync | "appsync:ListGraphqlApis" |
Amazon Athena | "athena:ListWorkGroups" |
Amazon Aurora | "rds:DescribeDBClusters" |
Amazon EC2 Auto Scaling | "autoscaling:DescribeAutoScalingGroups" |
Amazon EC2 Auto Scaling (built-in) | "autoscaling:DescribeAutoScalingGroups" |
AWS Billing | |
Amazon Keyspaces | |
AWS Chatbot | |
Amazon CloudFront | "cloudfront:ListDistributions" |
AWS CloudHSM | "cloudhsm:DescribeClusters" |
Amazon CloudSearch | "cloudsearch:DescribeDomains" |
AWS CodeBuild | "codebuild:ListProjects" |
Amazon Cognito | |
Amazon Connect | |
Amazon Elastic Kubernetes Service (EKS) | "eks:ListClusters" |
AWS DataSync | "datasync:ListTasks" |
Amazon DynamoDB Accelerator (DAX) | "dax:DescribeClusters" |
Amazon Database Migration Service | "dms:DescribeReplicationInstances" |
Amazon DocumentDB | "rds:DescribeDBClusters" |
AWS Direct Connect | "directconnect:DescribeConnections" |
Amazon DynamoDB | "dynamodb:ListTables" |
Amazon DynamoDB (built-in) | "dynamodb:ListTables", "dynamodb:ListTagsOfResource" |
Amazon EBS | "ec2:DescribeVolumes" |
Amazon EBS (built-in) | "ec2:DescribeVolumes" |
Amazon EC2 API | |
Amazon EC2 (built-in) | "ec2:DescribeInstances" |
Amazon EC2 Spot Fleet | "ec2:DescribeSpotFleetRequests" |
Amazon Elastic Container Service (ECS) | "ecs:ListClusters" |
Amazon ECS ContainerInsights | "ecs:ListClusters" |
Amazon ElastiCache (EC) | "elasticache:DescribeCacheClusters" |
AWS Elastic Beanstalk | "elasticbeanstalk:DescribeEnvironments" |
Amazon Elastic File System (EFS) | "elasticfilesystem:DescribeFileSystems" |
Amazon Elastic Inference | |
Amazon Elastic Map Reduce (EMR) | "elasticmapreduce:ListClusters" |
Amazon Elasticsearch Service (ES) | "es:ListDomainNames" |
Amazon Elastic Transcoder | "elastictranscoder:ListPipelines" |
AWS Elastic Load Balancing (ELB) (built-in) | "elasticloadbalancing:DescribeInstanceHealth", "elasticloadbalancing:DescribeListeners", "elasticloadbalancing:DescribeLoadBalancers", "elasticloadbalancing:DescribeRules", "elasticloadbalancing:DescribeTags", "elasticloadbalancing:DescribeTargetHealth" |
Amazon EventBridge | "events:ListEventBuses" |
Amazon FSx | "fsx:DescribeFileSystems" |
Amazon GameLift | "gamelift:ListFleets" |
AWS Glue | "glue:GetJobs" |
Amazon Inspector | "inspector:ListAssessmentTemplates" |
AWS Internet of Things (IoT) | |
AWS IoT Analytics | |
Amazon Managed Streaming for Kafka | "kafka:ListClusters" |
Amazon Kinesis Data Analytics | "kinesisanalytics:ListApplications" |
Amazon Kinesis Data Firehose | "firehose:ListDeliveryStreams" |
Amazon Kinesis Data Streams | "kinesis:ListStreams" |
Amazon Kinesis Video Streams | "kinesisvideo:ListStreams" |
Amazon Lambda | "lambda:ListFunctions" |
AWS Lambda (built-in) | "lambda:ListFunctions", "lambda:ListTags" |
Amazon Lex | "lex:GetBots" |
AWS Application and Network Load Balancer (built-in) | "elasticloadbalancing:DescribeInstanceHealth", "elasticloadbalancing:DescribeListeners", "elasticloadbalancing:DescribeLoadBalancers", "elasticloadbalancing:DescribeRules", "elasticloadbalancing:DescribeTags", "elasticloadbalancing:DescribeTargetHealth" |
Amazon CloudWatch Logs | "logs:DescribeLogGroups" |
AWS Elemental MediaConnect | "mediaconnect:ListFlows" |
Amazon MediaConvert | "mediaconvert:DescribeEndpoints" |
Amazon MediaPackage Live | "mediapackage:ListChannels" |
Amazon MediaPackage Video on Demand | "mediapackage-vod:ListPackagingConfigurations" |
Amazon MediaTailor | "mediatailor:ListPlaybackConfigurations" |
Amazon VPC NAT Gateways | "ec2:DescribeNatGateways" |
Amazon Neptune | "rds:DescribeDBClusters" |
AWS OpsWorks | "opsworks:DescribeStacks" |
Amazon Polly | |
Amazon QLDB | "qldb:ListLedgers" |
Amazon RDS | "rds:DescribeDBInstances" |
Amazon RDS (built-in) | "rds:DescribeDBInstances", "rds:DescribeEvents", "rds:ListTagsForResource" |
Amazon Redshift | "redshift:DescribeClusters" |
Amazon Rekognition | |
AWS RoboMaker | "robomaker:ListSimulationJobs" |
Amazon Route 53 | "route53:ListHostedZones" |
Amazon Route 53 Resolver | "route53resolver:ListResolverEndpoints" |
Amazon S3 | "s3:ListAllMyBuckets" |
Amazon S3 (built-in) | "s3:ListAllMyBuckets" |
Amazon SageMaker Batch Transform Jobs | |
Amazon SageMaker Endpoint Instances | "sagemaker:ListEndpoints" |
Amazon SageMaker Endpoints | "sagemaker:ListEndpoints" |
Amazon SageMaker Ground Truth | |
Amazon SageMaker Processing Jobs | |
Amazon SageMaker Training Jobs | |
AWS Service Catalog | |
Amazon Simple Email Service (SES) | |
Amazon Simple Notification Service (SNS) | "sns:ListTopics" |
Amazon Simple Queue Service (SQS) | "sqs:ListQueues" |
AWS Systems Manager - Run Command | |
AWS Step Functions | |
AWS Storage Gateway | "storagegateway:ListGateways" |
Amazon SWF | "swf:ListDomains" |
Amazon Textract | |
AWS IoT Things Graph | |
Amazon Transfer Family | "transfer:ListServers" |
AWS Transit Gateway | "ec2:DescribeTransitGateways" |
Amazon Translate | |
AWS Trusted Advisor | |
AWS API Usage | |
AWS Site-to-Site VPN | "ec2:DescribeVpnConnections" |
Amazon WAF Classic | |
Amazon WAF | |
Amazon WorkMail | "workmail:ListOrganizations" |
Amazon WorkSpaces | "workspaces:DescribeWorkspaces" |
Example of JSON policy for one single service.
1{2 "Version": "2012-10-17",3 "Statement": [4 {5 "Sid": "VisualEditor0",6 "Effect": "Allow",7 "Action": [8 "apigateway:GET",9 "cloudwatch:GetMetricData",10 "cloudwatch:GetMetricStatistics",11 "cloudwatch:ListMetrics",12 "sts:GetCallerIdentity",13 "tag:GetResources",14 "tag:GetTagKeys",15 "ec2:DescribeAvailabilityZones"16 ],17 "Resource": "*"18 }19 ]20}
In this example, from the complete list of permissions you need to select
"apigateway:GET"
for Amazon API Gateway"cloudwatch:GetMetricData"
,"cloudwatch:GetMetricStatistics"
,"cloudwatch:ListMetrics"
,"sts:GetCallerIdentity"
,"tag:GetResources"
,"tag:GetTagKeys"
, and"ec2:DescribeAvailabilityZones"
for All monitored Amazon services.
Endpoint | Service |
---|---|
autoscaling.<REGION>.amazonaws.com | Amazon EC2 Auto Scaling (built-in), Amazon EC2 Auto Scaling |
elasticbeanstalk.<REGION>.amazonaws.com | AWS Elastic Beanstalk (built-in), AWS Elastic Beanstalk |
lambda.<REGION>.amazonaws.com | AWS Lambda (built-in), Amazon Lambda |
elasticloadbalancing.<REGION>.amazonaws.com | AWS Application and Network Load Balancer (built-in), AWS Elastic Load Balancing (ELB) (built-in) |
dynamodb.<REGION>.amazonaws.com | Amazon DynamoDB (built-in), Amazon DynamoDB |
ec2.<REGION>.amazonaws.com | Amazon EBS (built-in), Amazon EC2 (built-in), Amazon EBS, Amazon EC2, Amazon EC2 Spot Fleet, Amazon VPC NAT Gateways, AWS Transit Gateway, AWS Site-to-Site VPN |
rds.<REGION>.amazonaws.com | Amazon RDS (built-in), Amazon Aurora, Amazon DocumentDB, Amazon Neptune, Amazon RDS |
s3.<REGION>.amazonaws.com | Amazon S3 (built-in) |
acm-pca.<REGION>.amazonaws.com | AWS Certificate Manager Private Certificate Authority |
apigateway.<REGION>.amazonaws.com | Amazon API Gateway |
apprunner.<REGION>.amazonaws.com | AWS App Runner |
appstream2.<REGION>.amazonaws.com | Amazon AppStream |
appsync.<REGION>.amazonaws.com | AWS AppSync |
athena.<REGION>.amazonaws.com | Amazon Athena |
cloudfront.amazonaws.com | Amazon CloudFront |
cloudhsmv2.<REGION>.amazonaws.com | AWS CloudHSM |
cloudsearch.<REGION>.amazonaws.com | Amazon CloudSearch |
codebuild.<REGION>.amazonaws.com | AWS CodeBuild |
datasync.<REGION>.amazonaws.com | AWS DataSync |
dax.<REGION>.amazonaws.com | Amazon DynamoDB Accelerator (DAX) |
dms.<REGION>.amazonaws.com | Amazon Database Migration Service |
directconnect.<REGION>.amazonaws.com | AWS Direct Connect |
ecs.<REGION>.amazonaws.com | Amazon Elastic Container Service (ECS), Amazon ECS ContainerInsights |
elasticfilesystem.<REGION>.amazonaws.com | Amazon Elastic File System (EFS) |
eks.<REGION>.amazonaws.com | Amazon Elastic Kubernetes Service (EKS) |
elasticache.<REGION>.amazonaws.com | Amazon ElastiCache (EC) |
elastictranscoder.<REGION>.amazonaws.com | Amazon Elastic Transcoder |
es.<REGION>.amazonaws.com | Amazon Elasticsearch Service (ES) |
events.<REGION>.amazonaws.com | Amazon EventBridge |
fsx.<REGION>.amazonaws.com | Amazon FSx |
gamelift.<REGION>.amazonaws.com | Amazon GameLift |
glue.<REGION>.amazonaws.com | AWS Glue |
inspector.<REGION>.amazonaws.com | Amazon Inspector |
kafka.<REGION>.amazonaws.com | Amazon Managed Streaming for Kafka |
models.lex.<REGION>.amazonaws.com | Amazon Lex |
logs.<REGION>.amazonaws.com | Amazon CloudWatch Logs |
api.mediatailor.<REGION>.amazonaws.com | Amazon MediaTailor |
mediaconnect.<REGION>.amazonaws.com | AWS Elemental MediaConnect |
mediapackage.<REGION>.amazonaws.com | Amazon MediaPackage Live |
mediapackage-vod.<REGION>.amazonaws.com | Amazon MediaPackage Video on Demand |
opsworks.<REGION>.amazonaws.com | AWS OpsWorks |
qldb.<REGION>.amazonaws.com | Amazon QLDB |
redshift.<REGION>.amazonaws.com | Amazon Redshift |
robomaker.<REGION>.amazonaws.com | AWS RoboMaker |
route53.amazonaws.com | Amazon Route 53 |
route53resolver.<REGION>.amazonaws.com | Amazon Route 53 Resolver |
api.sagemaker.<REGION>.amazonaws.com | Amazon SageMaker Endpoints, Amazon SageMaker Endpoint Instances |
sns.<REGION>.amazonaws.com | Amazon Simple Notification Service (SNS) |
sqs.<REGION>.amazonaws.com | Amazon Simple Queue Service (SQS) |
storagegateway.<REGION>.amazonaws.com | AWS Storage Gateway |
swf.<REGION>.amazonaws.com | Amazon SWF |
transfer.<REGION>.amazonaws.com | Amazon Transfer Family |
workmail.<REGION>.amazonaws.com | Amazon WorkMail |
workspaces.<REGION>.amazonaws.com | Amazon WorkSpaces |
Enable monitoring
To learn how to enable service monitoring, see Enable service monitoring.
View service metrics
You can view the service metrics in your Dynatrace environment either on the custom device overview page or on your Dashboards page.
View metrics on the custom device overview page
To access the custom device overview page
- In the Dynatrace menu, go to Technologies and processes.
Filter by service name and select the relevant custom device group.
- Once you select the custom device group, you're on the custom device group overview page.
- The custom device group overview page lists all instances (custom devices) belonging to the group. Select an instance to view the custom device overview page.
View metrics on your dashboard
You can also view metrics in the Dynatrace web UI on dashboards. There is no preset dashboard available for this service, but you can create your own dashboard.
To check the availability of preset dashboards for each AWS service, see the list below.
AWS service | Preset dashboard |
---|---|
AWS Certificate Manager Private Certificate Authority | no |
Amazon MQ | yes |
Amazon API Gateway | no |
AWS App Runner | no |
Amazon AppStream | yes |
AWS AppSync | yes |
Amazon Athena | yes |
Amazon Aurora | no |
Amazon EC2 Auto Scaling | yes |
Amazon EC2 Auto Scaling (built-in) | no |
AWS Billing | yes |
Amazon Keyspaces | yes |
AWS Chatbot | yes |
Amazon CloudFront | no |
AWS CloudHSM | yes |
Amazon CloudSearch | yes |
AWS CodeBuild | yes |
Amazon Cognito | no |
Amazon Connect | yes |
Amazon Elastic Kubernetes Service (EKS) | yes |
AWS DataSync | yes |
Amazon DynamoDB Accelerator (DAX) | yes |
Amazon Database Migration Service | yes |
Amazon DocumentDB | yes |
AWS Direct Connect | yes |
Amazon DynamoDB | no |
Amazon DynamoDB (built-in) | no |
Amazon EBS | no |
Amazon EBS (built-in) | no |
Amazon EC2 API | yes |
Amazon EC2 (built-in) | no |
Amazon EC2 Spot Fleet | no |
Amazon Elastic Container Service (ECS) | no |
Amazon ECS ContainerInsights | yes |
Amazon ElastiCache (EC) | no |
AWS Elastic Beanstalk | yes |
Amazon Elastic File System (EFS) | no |
Amazon Elastic Inference | yes |
Amazon Elastic Map Reduce (EMR) | no |
Amazon Elasticsearch Service (ES) | no |
Amazon Elastic Transcoder | yes |
AWS Elastic Load Balancing (ELB) (built-in) | no |
Amazon EventBridge | yes |
Amazon FSx | yes |
Amazon GameLift | yes |
AWS Glue | no |
Amazon Inspector | yes |
AWS Internet of Things (IoT) | no |
AWS IoT Analytics | yes |
Amazon Managed Streaming for Kafka | yes |
Amazon Kinesis Data Analytics | no |
Amazon Kinesis Data Firehose | no |
Amazon Kinesis Data Streams | no |
Amazon Kinesis Video Streams | no |
Amazon Lambda | no |
AWS Lambda (built-in) | no |
Amazon Lex | yes |
AWS Application and Network Load Balancer (built-in) | no |
Amazon CloudWatch Logs | yes |
AWS Elemental MediaConnect | yes |
Amazon MediaConvert | yes |
Amazon MediaPackage Live | yes |
Amazon MediaPackage Video on Demand | yes |
Amazon MediaTailor | yes |
Amazon VPC NAT Gateways | no |
Amazon Neptune | yes |
AWS OpsWorks | yes |
Amazon Polly | yes |
Amazon QLDB | yes |
Amazon RDS | no |
Amazon RDS (built-in) | no |
Amazon Redshift | no |
Amazon Rekognition | yes |
AWS RoboMaker | yes |
Amazon Route 53 | yes |
Amazon Route 53 Resolver | yes |
Amazon S3 | no |
Amazon S3 (built-in) | no |
Amazon SageMaker Batch Transform Jobs | no |
Amazon SageMaker Endpoint Instances | no |
Amazon SageMaker Endpoints | no |
Amazon SageMaker Ground Truth | no |
Amazon SageMaker Processing Jobs | no |
Amazon SageMaker Training Jobs | no |
AWS Service Catalog | yes |
Amazon Simple Email Service (SES) | no |
Amazon Simple Notification Service (SNS) | no |
Amazon Simple Queue Service (SQS) | no |
AWS Systems Manager - Run Command | yes |
AWS Step Functions | yes |
AWS Storage Gateway | yes |
Amazon SWF | yes |
Amazon Textract | yes |
AWS IoT Things Graph | yes |
Amazon Transfer Family | yes |
AWS Transit Gateway | yes |
Amazon Translate | yes |
AWS Trusted Advisor | yes |
AWS API Usage | yes |
AWS Site-to-Site VPN | yes |
Amazon WAF Classic | yes |
Amazon WAF | yes |
Amazon WorkMail | yes |
Amazon WorkSpaces | yes |
Available metrics
Amazon SageMaker Batch Transform Jobs
Name | Description | Unit | Statistics | Dimensions | Recommended |
---|---|---|---|---|---|
CPUUtilization | The percentage of CPU units that are used by the containers on an instance. The value can range between 0% and 100% , and is multiplied by the number of CPUs. For example, if there are four CPUs, CPUUtilization can range from 0% to `400%'. | Percent | Average | Region, Host | |
MemoryUtilization | The percentage of memory that is used by the containers on an instance. This value can range between 0% and 100% . | Percent | Average | Region, Host | |
GPUMemoryUtilization | The percentage of GPU memory used by the containers on an instance. The value can range between 0% and 100% and is multiplied by the number of GPUs. For example, if there are four GPUs, GPUMemoryUtilization can range from 0% to `400%'. | Percent | Average | Region, Host | |
GPUUtilization | The percentage of GPU units that are used by the containers on an instance. The value can range between 0% and 100% and is multiplied by the number of GPUs. For example, if there are four GPUs, GPUUtilization can range from 0% to `400%'. | Percent | Average | Region, Host |
Amazon SageMaker Processing Jobs, Amazon SageMaker Training Jobs
Name | Description | Unit | Statistics | Dimensions | Recommended |
---|---|---|---|---|---|
CPUUtilization | The percentage of CPU units that are used by the containers on an instance. The value can range between 0% and 100% , and is multiplied by the number of CPUs. For example, if there are four CPUs, CPUUtilization can range from 0% to `400%'. | Percent | Average | Region, Host | |
DiskUtilization | The percentage of disk space used by the containers on an instance uses. This value can range between 0% and 100% . This metric is not supported for batch transform jobs. | Percent | Average | EndpointName, VariantName | |
GPUMemoryUtilization | The percentage of GPU memory used by the containers on an instance. The value can range between 0% and 100% and is multiplied by the number of GPUs. For example, if there are four GPUs, GPUMemoryUtilization can range from 0% to `400%'. | Percent | Average | Region, Host | |
GPUUtilization | The percentage of GPU units that are used by the containers on an instance. The value can range between 0% and 100% and is multiplied by the number of GPUs. For example, if there are four GPUs, GPUUtilization can range from 0% to `400%'. | Percent | Average | Region, Host | |
MemoryUtilization | The percentage of memory that is used by the containers on an instance. This value can range between 0% and 100% . | Percent | Average | Region, Host |
Amazon SageMaker Endpoint Instances
Name | Description | Unit | Statistics | Dimensions | Recommended |
---|---|---|---|---|---|
CPUUtilization | The percentage of CPU units that are used by the containers on an instance. The value can range between 0% and 100% , and is multiplied by the number of CPUs. For example, if there are four CPUs, CPUUtilization can range from 0% to `400%'. | Percent | Average | EndpointName, VariantName | |
DiskUtilization | The percentage of disk space used by the containers on an instance uses. This value can range between 0% and 100% . This metric is not supported for batch transform jobs. | Percent | Average | EndpointName, VariantName | |
GPUMemoryUtilization | The percentage of GPU units that are used by the containers on an instance. The value can range between 0% and 100% and is multiplied by the number of GPUs. For example, if there are four GPUs, GPUUtilization can range from 0% to `400%'. | Percent | Average | EndpointName, VariantName | |
GPUUtilization | The percentage of GPU units that are used by the containers on an instance. The value can range between 0% and 100% and is multiplied by the number of GPUs. For example, if there are four GPUs, GPUUtilization can range from 0% to `400%'. | Percent | Average | EndpointName, VariantName | |
LoadedModelCount | The number of models loaded in the containers of the multi-model endpoint. This metric is emitted per instance. | None | Average | EndpointName, VariantName | |
LoadedModelCount | None | Sum | EndpointName, VariantName | ||
MemoryUtilization | The percentage of memory that is used by the containers on an instance. This value can range between 0% and 100% . | Percent | Average | EndpointName, VariantName |
Amazon SageMaker Endpoints
Name | Description | Unit | Statistics | Dimensions | Recommended |
---|---|---|---|---|---|
Invocation4XXErrors | The number of InvokeEndpoint requests where the model returned a 4xx HTTP response code. For each 4xx response, 1 is sent; otherwise, 0 is sent. | None | Average | EndpointName, VariantName | |
Invocation4XXErrors | None | Sum | EndpointName, VariantName | ||
Invocation5XXErrors | The number of InvokeEndpoint requests where the model returned a 5xx HTTP response code. For each 5xx response, 1 is sent; otherwise, 0 is sent. | None | Average | EndpointName, VariantName | |
Invocation5XXErrors | None | Sum | EndpointName, VariantName | ||
Invocations | The number of InvokeEndpoint requests sent to a model endpoint | None | Sum | EndpointName, VariantName | |
Invocations | None | Count | EndpointName, VariantName | ||
InvocationsPerInstance | The number of invocations sent to a model, normalized by InstanceCount in each ProductionVariant . 1/numberOfInstances is sent as the value on each request, where numberOfInstances is the number of active instances for the ProductionVariant behind the endpoint at the time of the request. | None | Sum | EndpointName, VariantName | |
ModelCacheHit | The number of InvokeEndpoint requests sent to the multi-model endpoint for which the model was already loaded | None | Sum | EndpointName, VariantName | |
ModelCacheHit | None | Average | EndpointName, VariantName | ||
ModelCacheHit | None | Count | EndpointName, VariantName | ||
ModelLatency | The interval of time taken by a model to respond as viewed from SageMaker. This interval includes the local communication times taken to send the request and to fetch the response from the container of a model and the time taken to complete the inference in the container. | Microseconds | Multi | EndpointName, VariantName | |
ModelLatency | Microseconds | Sum | EndpointName, VariantName | ||
ModelLatency | Microseconds | Count | EndpointName, VariantName | ||
ModelLoadingTime | The interval of time that it took to load the model through the container's LoadModel API call. | Microseconds | Multi | EndpointName, VariantName | |
ModelLoadingTime | Microseconds | Sum | EndpointName, VariantName | ||
ModelLoadingTime | Microseconds | Count | EndpointName, VariantName | ||
ModelLoadingWaitTime | The interval of time that an invocation request has waited for the target model to be downloaded, or loaded, or both in order to perform inference | Microseconds | Multi | EndpointName, VariantName | |
ModelLoadingWaitTime | Microseconds | Sum | EndpointName, VariantName | ||
ModelLoadingWaitTime | Microseconds | Count | EndpointName, VariantName | ||
ModelDownloadingTime | The interval of time that it took to download the model from Amazon Simple Storage Service (Amazon S3) | Microseconds | Multi | EndpointName, VariantName | |
ModelDownloadingTime | Microseconds | Sum | EndpointName, VariantName | ||
ModelDownloadingTime | Microseconds | Count | EndpointName, VariantName | ||
ModelUnloadingTime | The interval of time that it took to unload the model through the container's UnloadModel API call | Microseconds | Multi | EndpointName, VariantName | |
ModelUnloadingTime | Microseconds | Sum | EndpointName, VariantName | ||
ModelUnloadingTime | Microseconds | Count | EndpointName, VariantName | ||
OverheadLatency | The interval of time added to the time taken to respond to a client request by SageMaker overheads. This interval is measured from the time SageMaker receives the request until it returns a response to the client, minus the ModelLatency . | Microseconds | Multi | EndpointName, VariantName | |
OverheadLatency | Microseconds | Sum | EndpointName, VariantName | ||
OverheadLatency | Microseconds | Count | EndpointName, VariantName |
Amazon SageMaker Ground Truth
Name | Description | Dimensions | Statistics | Unit | Recommended |
---|---|---|---|---|---|
ActiveWorkers | The number of workers on a private work team performing a labeling job | Region, LabelingJobName | Maximum | None | |
DatasetObjectsAutoAnnotated | The number of dataset objects auto-annotated in a labeling job. This metric is only emitted when automated labeling is enabled. | Region, LabelingJobName | Maximum | None | |
DatasetObjectsHumanAnnotated | The number of dataset objects annotated by a human in a labeling job | Region, LabelingJobName | Maximum | None | |
DatasetObjectsLabelingFailed | The number of dataset objects that failed labeling in a labeling job | Region, LabelingJobName | Maximum | None | |
JobsFailed | The number of labeling jobs that failed | Region | Count | None | |
JobsFailed | Region | Sum | None | ||
JobsStopped | The number of labeling jobs that were stopped | Region | Count | None | |
JobsStopped | Region | Sum | None | ||
JobsSucceeded | The number of labeling jobs that succeeded | Region | Count | None | |
JobsSucceeded | Region | Sum | None | ||
TasksSubmitted | The number of tasks submitted/completed by a private work team | Region, LabelingJobName | Maximum | None | |
TimeSpent | Time spent on a task completed by a private work team | Region, LabelingJobName | Maximum | Seconds | |
TotalDatasetObjectsLabeled | The number of dataset objects labeled successfully in a labeling job | Region, LabelingJobName | Maximum | None |