This guide provides insights into migrating Kubernetes metrics to Grail. Typically, a Grail metric is equivalent to a Metrics Classic metric. In some cases, however, there's no one-to-one relation:
Classic Metrics and Grail Metrics have the same level of detail and dimensions available. The only difference is the metric key.
Metric key (Grail) | Metric key (Classic) |
---|---|
dt.kubernetes.cluster.readyz | builtin:kubernetes.cluster.readyz |
dt.kubernetes.container.oom_kills | builtin:kubernetes.container.oom_kills |
dt.kubernetes.container.restarts | builtin:kubernetes.container.restarts |
dt.kubernetes.node.conditions | builtin:kubernetes.node.conditions |
dt.kubernetes.node.cpu_allocatable | builtin:kubernetes.node.cpu_allocatable |
dt.kubernetes.node.memory_allocatable | builtin:kubernetes.node.memory_allocatable |
dt.kubernetes.node.pods_allocatable | builtin:kubernetes.node.pods_allocatable |
dt.kubernetes.nodes | builtin:kubernetes.nodes |
dt.kubernetes.persistentvolumeclaim.available | builtin:kubernetes.persistentvolumeclaim.available |
dt.kubernetes.persistentvolumeclaim.capacity | builtin:kubernetes.persistentvolumeclaim.capacity |
dt.kubernetes.persistentvolumeclaim.used | builtin:kubernetes.persistentvolumeclaim.used |
dt.kubernetes.resourcequota.limits_cpu | builtin:kubernetes.resourcequota.limits_cpu |
dt.kubernetes.resourcequota.limits_cpu_used | builtin:kubernetes.resourcequota.limits_cpu_used |
dt.kubernetes.resourcequota.limits_memory | builtin:kubernetes.resourcequota.limits_memory |
dt.kubernetes.resourcequota.limits_memory_used | builtin:kubernetes.resourcequota.limits_memory_used |
dt.kubernetes.resourcequota.pods | builtin:kubernetes.resourcequota.pods |
dt.kubernetes.resourcequota.pods_used | builtin:kubernetes.resourcequota.pods_used |
dt.kubernetes.resourcequota.requests_cpu | builtin:kubernetes.resourcequota.requests_cpu |
dt.kubernetes.resourcequota.requests_cpu_used | builtin:kubernetes.resourcequota.requests_cpu_used |
dt.kubernetes.resourcequota.requests_memory | builtin:kubernetes.resourcequota.requests_memory |
dt.kubernetes.resourcequota.requests_memory_used | builtin:kubernetes.resourcequota.requests_memory_used |
dt.kubernetes.workload.conditions | builtin:kubernetes.workload.conditions |
dt.kubernetes.workload.pods_desired | builtin:kubernetes.workload.pods_desired |
dt.kubernetes.workloads | builtin:kubernetes.workloads |
The following metrics have been consolidated. The Grail metrics that supersede the Classic metrics offer an increased level of detail compared to the Classic metrics.
To achieve this decreased level of detail, the Grail metrics are first aggregated to the granularity of the Classic metric. From there the same set of filters can be applied and the output between Classic metrics and Grail metrics is identical.
The following list of metrics contains the pod and container count metrics and the Kubernetes event count metric that was available at a lower level of detail as Classic metric.
Kubernetes events and container/pod count metrics
Metric key (Grail) | Metric key (Classic) |
---|---|
dt.kubernetes.containers | builtin:kubernetes.containers |
dt.kubernetes.pod.containers_desired | builtin:kubernetes.workload.containers_desired |
dt.kubernetes.events | builtin:kubernetes.events |
dt.kubernetes.pods | builtin:kubernetes.node.pods |
The following table contains the workload and node resource metrics that have been available as separate workload- and node- level Classic metrics. With Grail, there is a single metric at the container level.
Example: The following DQL query returns the amount of memory consumed on the workload level based on aggregated container-level data.
timeseries memory_working_set = sum(dt.kubernetes.container.memory_working_set)by: {k8s.cluster.name,k8s.namespace.name,k8s.workload.name}
Workload- and node- level resource consumption metrics
Metric key (Grail) | Metric key (Classic) |
---|---|
dt.kubernetes.container.cpu_usage | builtin:kubernetes.node.cpu_usage |
dt.kubernetes.container.cpu_throttled | builtin:kubernetes.node.cpu_throttled |
dt.kubernetes.container.requests_cpu | builtin:kubernetes.node.requests_cpu |
dt.kubernetes.container.limits_cpu | builtin:kubernetes.node.limits_cpu |
dt.kubernetes.container.memory_working_set | builtin:kubernetes.node.memory_working_set |
dt.kubernetes.container.requests_memory | builtin:kubernetes.node.requests_memory |
dt.kubernetes.container.limits_memory | builtin:kubernetes.node.limits_memory |
This group of metrics consists of Classic metric keys that have never been made available as Grail metrics. Instead the most similar Classic metric is used to then determine the Grail metric replacement for these deprecated metrics. The reason for the deprecation is a cleanup of duplicate metric keys. In the case of the following metrics, a complete identity of the values between the Classic Metric and Grail Metric is not feasible, but they are closely related and do not deviate very much.
Metric key (Grail) | Metric key (Classic) | Superseding Classic Metric |
---|---|---|
dt.kubernetes.container.limits_cpu | builtin:containers.cpu.limit | n.a. |
dt.kubernetes.container.oom_kills | builtin:kubernetes.container.outOfMemoryKills | builtin:kubernetes.container.oom_kills |
The following set of Classic container metrics is superseded by Grail container metrics. For most of the CPU metrics in this section the Classic metrics have the unit millicores, while the Grail metrics have the unit nanoseconds/minute. To get to the same values, the Grail metric needs to be divided by the number of nanoseconds in a minute. (The number of nanoseconds per minute is 60 * 1000 * 1000 * 1000)
This is the case for the following Grail metrics.
builtin:containers.cpu.throttledMilliCores
timeseries {throttled_time = avg(dt.containers.cpu.throttled_time, rollup: sum, rate: 1m)}| fieldsAddns_per_min = 60 * 1000 * 1000 * 1000, milli_core_per_core = 1000| fieldsAddthrottled_milli_cores = throttled_time[] * milli_core_per_core / ns_per_min| summarize {throttled_milli_cores = sum(throttled_milli_cores[] )}, by: { timeframe, interval }
builtin:containers.cpu.usageUserMilliCores
timeseries { usage_user_time = avg(dt.containers.cpu.usage_user_time)}| fieldsAddns_per_min = 60 * 1000 * 1000 * 1000, milli_core_per_core = 1000| fieldsAddusage_user_milli_cores = usage_user_time[] * milli_core_per_core / ns_per_min| summarize { usage_user_milli_cores = sum(usage_user_milli_cores[] )}, by: { timeframe, interval }
builtin:containers.cpu.usageSystemMilliCores
timeseries {usage_system_time = avg(dt.containers.cpu.usage_system_time)}| fieldsAddns_per_min = 60 * 1000 * 1000 * 1000, milli_core_per_core = 1000| fieldsAddusage_system_milli_cores = usage_system_time[] * milli_core_per_core / ns_per_min| summarize {usage_system_milli_cores = sum(usage_system_milli_cores[] )}, by: { timeframe, interval }
builtin:containers.cpu.usageMilliCores
timeseries {usage_user_time = avg(dt.containers.cpu.usage_user_time), usage_system_time = avg(dt.containers.cpu.usage_system_time)}| fieldsAddns_per_min = 60 * 1000 * 1000 * 1000, milli_core_per_core = 1000| fieldsAddusage_milli_cores = (usage_user_time[] + usage_system_time[] )* milli_core_per_core / ns_per_min| summarize {usage_milli_cores = sum(usage_milli_cores[] )}, by: { timeframe, interval }
builtin:containers.cpu.usagePercent
timeseries {// for total usage, user and system cpu usage are addeduserCpuUsage = avg(dt.containers.cpu.usage_user_time), systemCpuUsage = avg(dt.containers.cpu.usage_system_time)// cpu logical counts are the fallback, if the throttling ratio doesn't exist, cpuLogicalCount = avg(dt.containers.cpu.logical_cores)}// filter statement ...// leftOuter join allows the throttling ratio to be null| join [timeseries {throttlingRatio = avg(dt.containers.cpu.throttling_ratio)// same filter statement as above ...}], on: { interval, timeframe}, fields: { throttlingRatio}, kind:leftOuter| fieldsAdd// sum of system and user cpu usagenumerator = userCpuUsage[] + systemCpuUsage[]// throttling ratio, or as a fallback cpu logical count., denominator = coalesce(throttlingRatio, cpuLogicalCount), nanoseconds_per_minute = 60 * 1000 * 1000 * 1000| fieldsinterval, timeframe, cpuUsagePercent = 100.0 * numerator[] / ( denominator[] * nanoseconds_per_minute)
builtin:containers.cpu.usageTime
timeseries {usageUserTime = avg(dt.containers.cpu.usage_user_time), usageSystemTime = avg(dt.containers.cpu.usage_system_time)}, by: { dt.entity.container_group_instance},| fieldsinterval, timeframe, usageTime = usageSystemTime[] + usageUserTime[]
builtin:containers.memory.limitPercent
timeseries {limit_bytes = avg(dt.containers.memory.limit_bytes),physical_total_bytes = avg(dt.containers.memory.physical_total_bytes)}| fieldsAddlimit_percent = (limit_bytes[] / physical_total_bytes[] ) * 100| summarize {limit_percent = sum(limit_percent[] )}, by: { timeframe, interval }
builtin:containers.memory.usagePercent
timeseries {memoryLimits = avg(dt.containers.memory.limit_bytes), totalPhysicalMemory = avg(dt.containers.memory.physical_total_bytes), residentSetBytes = avg(dt.containers.memory.resident_set_bytes)}, by: { dt.entity.container_group_instance}| fieldsAdddenominator = if (arrayFirst(memoryLimits) > 0,then: memoryLimits,else: totalPhysicalMemory)| fieldsdt.entity.container_group_instance, interval, timeframe, memoryUsagePercent = 100 * residentSetBytes[] / denominator[]