This guide provides insights into migrating Kubernetes metrics to Grail. Typically, a Grail metric is equivalent to a Metrics Classic metric. In some cases, however, there's no one-to-one relation:
Classic Metrics and Grail Metrics have the same level of detail and dimensions available. The only difference is the metric key.
Metric key (Grail)
Metric key (Classic)
dt.kubernetes.cluster.readyz
builtin:kubernetes.cluster.readyz
dt.kubernetes.container.oom_kills
builtin:kubernetes.container.oom_kills
dt.kubernetes.container.restarts
builtin:kubernetes.container.restarts
dt.kubernetes.node.conditions
builtin:kubernetes.node.conditions
dt.kubernetes.node.cpu_allocatable
builtin:kubernetes.node.cpu_allocatable
dt.kubernetes.node.memory_allocatable
builtin:kubernetes.node.memory_allocatable
dt.kubernetes.node.pods_allocatable
builtin:kubernetes.node.pods_allocatable
dt.kubernetes.nodes
builtin:kubernetes.nodes
dt.kubernetes.persistentvolumeclaim.available
builtin:kubernetes.persistentvolumeclaim.available
dt.kubernetes.persistentvolumeclaim.capacity
builtin:kubernetes.persistentvolumeclaim.capacity
dt.kubernetes.persistentvolumeclaim.used
builtin:kubernetes.persistentvolumeclaim.used
dt.kubernetes.resourcequota.limits_cpu
builtin:kubernetes.resourcequota.limits_cpu
dt.kubernetes.resourcequota.limits_cpu_used
builtin:kubernetes.resourcequota.limits_cpu_used
dt.kubernetes.resourcequota.limits_memory
builtin:kubernetes.resourcequota.limits_memory
dt.kubernetes.resourcequota.limits_memory_used
builtin:kubernetes.resourcequota.limits_memory_used
dt.kubernetes.resourcequota.pods
builtin:kubernetes.resourcequota.pods
dt.kubernetes.resourcequota.pods_used
builtin:kubernetes.resourcequota.pods_used
dt.kubernetes.resourcequota.requests_cpu
builtin:kubernetes.resourcequota.requests_cpu
dt.kubernetes.resourcequota.requests_cpu_used
builtin:kubernetes.resourcequota.requests_cpu_used
dt.kubernetes.resourcequota.requests_memory
builtin:kubernetes.resourcequota.requests_memory
dt.kubernetes.resourcequota.requests_memory_used
builtin:kubernetes.resourcequota.requests_memory_used
dt.kubernetes.workload.conditions
builtin:kubernetes.workload.conditions
dt.kubernetes.workload.pods_desired
builtin:kubernetes.workload.pods_desired
dt.kubernetes.workloads
builtin:kubernetes.workloads
The following metrics have been consolidated. The Grail metrics that supersede the Classic metrics offer an increased level of detail compared to the Classic metrics.
To achieve this decreased level of detail, the Grail metrics are first aggregated to the granularity of the Classic metric. From there the same set of filters can be applied and the output between Classic metrics and Grail metrics is identical.
The following list of metrics contains the pod and container count metrics and the Kubernetes event count metric that was available at a lower level of detail as Classic metric.
Kubernetes events and container/pod count metrics
Metric key (Grail)
Metric key (Classic)
dt.kubernetes.containers
builtin:kubernetes.containers
dt.kubernetes.pod.containers_desired
builtin:kubernetes.workload.containers_desired
dt.kubernetes.events
builtin:kubernetes.events
dt.kubernetes.pods
builtin:kubernetes.node.pods
builtin:kubernetes.pods
The following table contains the workload and node resource metrics that have been available as separate workload- and node- level Classic metrics. With Grail, there is a single metric at the container level.
Example: The following DQL query returns the amount of memory consumed on the workload level based on aggregated container-level data.
timeseries memory_working_set = sum(dt.kubernetes.container.memory_working_set)by: {k8s.cluster.name,k8s.namespace.name,k8s.workload.name}
Workload- and node- level resource consumption metrics
Metric key (Grail)
Metric key (Classic)
dt.kubernetes.container.cpu_usage
builtin:kubernetes.node.cpu_usage
builtin:kubernetes.workload.cpu_usage
dt.kubernetes.container.cpu_throttled
builtin:kubernetes.node.cpu_throttled
builtin:kubernetes.workload.cpu_throttled
dt.kubernetes.container.requests_cpu
builtin:kubernetes.node.requests_cpu
builtin:kubernetes.workload.requests_cpu
dt.kubernetes.container.limits_cpu
builtin:kubernetes.node.limits_cpu
builtin:kubernetes.workload.limits_cpu
dt.kubernetes.container.memory_working_set
builtin:kubernetes.node.memory_working_set
builtin:kubernetes.workload.memory_working_set
dt.kubernetes.container.requests_memory
builtin:kubernetes.node.requests_memory
builtin:kubernetes.workload.requests_memory
dt.kubernetes.container.limits_memory
builtin:kubernetes.node.limits_memory
builtin:kubernetes.workload.limits_memory
This group of metrics consists of Classic metric keys that have never been made available as Grail metrics. Instead the most similar Classic metric is used to then determine the Grail metric replacement for these deprecated metrics. The reason for the deprecation is a cleanup of duplicate metric keys. In the case of the following metrics, a complete identity of the values between the Classic Metric and Grail Metric is not feasible, but they are closely related and do not deviate very much.
Metric key (Grail)
Metric key (Classic)
Superseding Classic Metric
dt.kubernetes.container.limits_cpu
builtin:containers.cpu.limit
n.a.
dt.kubernetes.container.oom_kills
builtin:kubernetes.container.outOfMemoryKills
builtin:kubernetes.container.oom_kills
The following set of Classic container metrics is superseded by Grail container metrics. For most of the CPU metrics in this section the Classic metrics have the unit millicores, while the Grail metrics have the unit nanoseconds/minute. To get to the same values, the Grail metric needs to be divided by the number of nanoseconds in a minute. (The number of nanoseconds per minute is 60 * 1000 * 1000 * 1000)
This is the case for the following Grail metrics.
builtin:containers.cpu.throttledMilliCores
timeseries {throttled_time = avg(dt.containers.cpu.throttled_time, rollup: sum, rate: 1m)}| fieldsAddns_per_min = 60 * 1000 * 1000 * 1000, milli_core_per_core = 1000| fieldsAddthrottled_milli_cores = throttled_time[] * milli_core_per_core / ns_per_min| summarize {throttled_milli_cores = sum(throttled_milli_cores[] )}, by: { timeframe, interval }
builtin:containers.cpu.usageUserMilliCores
timeseries { usage_user_time = avg(dt.containers.cpu.usage_user_time)}| fieldsAddns_per_min = 60 * 1000 * 1000 * 1000, milli_core_per_core = 1000| fieldsAddusage_user_milli_cores = usage_user_time[] * milli_core_per_core / ns_per_min| summarize { usage_user_milli_cores = sum(usage_user_milli_cores[] )}, by: { timeframe, interval }
builtin:containers.cpu.usageSystemMilliCores
timeseries {usage_system_time = avg(dt.containers.cpu.usage_system_time)}| fieldsAddns_per_min = 60 * 1000 * 1000 * 1000, milli_core_per_core = 1000| fieldsAddusage_system_milli_cores = usage_system_time[] * milli_core_per_core / ns_per_min| summarize {usage_system_milli_cores = sum(usage_system_milli_cores[] )}, by: { timeframe, interval }
builtin:containers.cpu.usageMilliCores
timeseries {usage_user_time = avg(dt.containers.cpu.usage_user_time), usage_system_time = avg(dt.containers.cpu.usage_system_time)}| fieldsAddns_per_min = 60 * 1000 * 1000 * 1000, milli_core_per_core = 1000| fieldsAddusage_milli_cores = (usage_user_time[] + usage_system_time[] )* milli_core_per_core / ns_per_min| summarize {usage_milli_cores = sum(usage_milli_cores[] )}, by: { timeframe, interval }
builtin:containers.cpu.usagePercent
timeseries {// for total usage, user and system cpu usage are addeduserCpuUsage = avg(dt.containers.cpu.usage_user_time), systemCpuUsage = avg(dt.containers.cpu.usage_system_time)// cpu logical counts are the fallback, if the throttling ratio doesn't exist, cpuLogicalCount = avg(dt.containers.cpu.logical_cores)}// filter statement ...// leftOuter join allows the throttling ratio to be null| join [timeseries {throttlingRatio = avg(dt.containers.cpu.throttling_ratio)// same filter statement as above ...}], on: { interval, timeframe}, fields: { throttlingRatio}, kind:leftOuter| fieldsAdd// sum of system and user cpu usagenumerator = userCpuUsage[] + systemCpuUsage[]// throttling ratio, or as a fallback cpu logical count., denominator = coalesce(throttlingRatio, cpuLogicalCount), nanoseconds_per_minute = 60 * 1000 * 1000 * 1000| fieldsinterval, timeframe, cpuUsagePercent = 100.0 * numerator[] / ( denominator[] * nanoseconds_per_minute)
builtin:containers.cpu.usageTime
timeseries {usageUserTime = avg(dt.containers.cpu.usage_user_time), usageSystemTime = avg(dt.containers.cpu.usage_system_time)}, by: { dt.entity.container_group_instance},| fieldsinterval, timeframe, usageTime = usageSystemTime[] + usageUserTime[]
builtin:containers.memory.limitPercent
timeseries {limit_bytes = avg(dt.containers.memory.limit_bytes),physical_total_bytes = avg(dt.containers.memory.physical_total_bytes)}| fieldsAddlimit_percent = (limit_bytes[] / physical_total_bytes[] ) * 100| summarize {limit_percent = sum(limit_percent[] )}, by: { timeframe, interval }
builtin:containers.memory.usagePercent
timeseries {memoryLimits = avg(dt.containers.memory.limit_bytes), totalPhysicalMemory = avg(dt.containers.memory.physical_total_bytes), residentSetBytes = avg(dt.containers.memory.resident_set_bytes)}, by: { dt.entity.container_group_instance}| fieldsAdddenominator = if (arrayFirst(memoryLimits) > 0,then: memoryLimits,else: totalPhysicalMemory)| fieldsdt.entity.container_group_instance, interval, timeframe, memoryUsagePercent = 100 * residentSetBytes[] / denominator[]