Trace a cost spike to its root cause

Latest Dynatrace
Tutorial
8-min read
Published May 26, 2026

A cost alert fired, or your FinOps team has escalated. You know something is up, but not what, where, or who owns it.

This tutorial walks you through a repeatable investigation: from the account-level overview in Account Management down to the specific entity, dashboard, workflow, or detector responsible.

Cost estimates use public list prices as of March 2026. Actual costs depend on your negotiated DPS contract. Use them for relative comparisons and directional analysis only, not invoice reconciliation.

Who is this for?

This tutorial is for FinOps practitioners, engineers, platform team leads, and anyone who receives a DPS cost alert and needs to pinpoint the source before it becomes a budget problem.

You'll likely be here because either:

You received a DPS threshold email. In this case, costs are up but you don't yet know which capability is responsible.

Start at Step 1, See which capability is driving the spike.
Someone in your team escalated the cost alert. In this case you already know the capability (for example, Log Ingest is 40% over budget indicates you should look at Log Analytics - Ingest).

Skip to Step 3, Pin the start time, and pre-filter by that capability.

What will you learn?

In this tutorial, you'll learn how to:

Identify which DPS capability is driving a cost spike, using either the ready-made dashboards or Account Management.
Confirm and quantify the spike with DQL billing event queries.
Pin the exact start time of a cost increase to a specific hour.
Attribute the spike to the responsible entity, dashboard, workflow, or detector.
Assemble an escalation package with evidence queries and a root cause summary.

Before you begin

Prerequisites

Access to Account Management with Subscription viewer permissions.
Permission to run DQL queries on the dt.system.events table in at least one environment.
Access to the ready-made Usage - Overview dashboard, and telemetry-specific drill-down dashboards. For more information, see View DPS consumption with ready-made usage dashboards.

Prior knowledge

Basic familiarity with DQL. If you're new to billing events, see Where to view your costs first.

Use DQL to trace a DPS cost spike to its root cause

1. Find the origin of the cost spike

To start investigating a cost spike, figure out the DPS capability that's driving the consumption.

Use ready-made dashboards

Use the ready-made usage dashboards to view consumption by capability. These dashboards let you get a quick understanding of capabilities driving usage, and are especially useful for telemetry-related capabilities such as traces, logs, and Full-Stack Monitoring.

These are:

The Usage - Overview dashboard gives an overview of consumption related to all DPS capabilities, so you can spot trends and identify high-volume capabilities for closer analysis.
The Usage - Traces, Usage - Logs, and Usage - Full-Stack dashboards provide drill-downs into the specific category, so you can differentiate between usage of Ingest & Process, Query, and Retain.

Use Account Management

Use Account Management to investigate. Go to Account Management > Subscription > Overview > Cost and usage details, where the chart shows aggregate DPS consumption broken down by capability with a trend line over time.

Account Management can tell you:

Which broad capability is driving the spike.
Roughly when the increase started.

Account Management cannot tell you:

Which specific entity (host, cluster, or application) is responsible.
Which user, dashboard, workflow, or detector is running the queries.
The actual root cause.

Once you've identified which capability is responsible, use DQL to trace the spike to its root cause. This is also useful if you've used the Usage - Overview dashboard to identify the responsible capability, and then need to dig deeper into what's actually driving the usage.

2. Confirm the spike with DQL

Run this week-over-week comparison to confirm which capability has grown the most:

fetch dt.system.events, from: -14d
| filter event.kind == "BILLING_USAGE_EVENT"
| dedup event.id
| fieldsAdd week = if(timestamp >= now() - 7d, "current", else: "previous")
| summarize event_count = count(), by: {event.type, week}
| sort event.type asc

The capability with the largest gap between current and previous is your target. Note it down. You'll use it to filter in the next steps.

| dedup event.id is mandatory in every billing query. Dynatrace refreshes metering records when correcting measurements. Without it, the same consumption period is counted multiple times and costs appear 10–30% higher than they are.

3. Pin the start time

Narrow the spike to a specific hour using the capability from Step 2:

fetch dt.system.events, from: -7d
| filter event.kind == "BILLING_USAGE_EVENT"
| filter event.type == "<capability from Step 2>"
| dedup event.id
| fieldsAdd hour = bin(timestamp, 1h)
| summarize event_count = count(), by: {hour}
| sort hour asc

Look for the first hour where event_count jumps sharply. That timestamp is your spike anchor: it goes into the post-mortem and narrows the attribution search in the next step.

If the increase is gradual rather than a sharp jump, run a 30-day day-level view to find which day it tipped over:

fetch dt.system.events, from: -30d
| filter event.kind == "BILLING_USAGE_EVENT"
| filter event.type == "<capability from Step 2>"
| dedup event.id
| fieldsAdd day = bin(timestamp, 1d)
| summarize event_count = count(), by: {day}
| sort day asc

4. Find the attribution source

Match the capability from the previous step to the correct type of bucket, as described in this section.

Ingest spike: Logs, Traces, Events, Metrics

Ingest billing events carry entity reference fields that identify the source sending data. Query for the top sources by GiB:

fetch dt.system.events, from: -30d
| filter event.kind == "BILLING_USAGE_EVENT"
| filter in(event.type,
    "Log Management & Analytics - Ingest & Process",
    "Events - Ingest & Process",
    "Traces - Ingest & Process")
| dedup event.id
| summarize total_gib = sum(toDouble(billed_bytes) / 1073741824),
    by: {event.type, dt.entity.host, dt.entity.application,
         dt.entity.kubernetes_cluster}
| sort total_gib desc
| limit 20

Look for a single entity or cluster with a disproportionate share. Compare against your spike timestamp from Step 3. Did this entity appear for the first time, or did its volume increase?

Typical causes for ingest spikes include: New Kubernetes cluster onboarded without log filtering configured; debug-level logging left on in production; new data pipeline sending raw events to Dynatrace.

Query cost spike: Dashboards, detectors, and workflows

For query billing events, client.source identifies who is scanning data. Query for the top sources:

fetch dt.system.events, from: -30d
| filter event.kind == "BILLING_USAGE_EVENT"
| filter in(event.type,
    "Log Management & Analytics - Query",
    "Events - Query",
    "Traces - Query",
    "Files - Query")
| dedup event.id
| filter isNotNull(client.source)
| summarize total_billed_gib = sum(toDouble(billed_bytes) / 1073741824),
    by: {client.source, event.type}
| sort total_billed_gib desc
| limit 20

Decode client.source:

Pattern	Source type	Next step
`https://.../document/...`	Dashboard URL	Open the dashboard and inspect tile queries
UUID without `dt.` prefix	Anomaly detector `objectId`	Cross-reference with the ALERTING pool (see below)
`dynatrace.automations:...`	Automation workflow	See the Automation spike section
`dt.*` service name	Internal platform service	Review with your platform team

If client.source is a UUID, resolve the detector name via the Query Execution Event pool:

fetch dt.system.events, from: -7d
| filter event.kind == "QUERY_EXECUTION_EVENT"
| filter query_pool == "ALERTING"
| parse client.client_context, "JSON:ctx"
| summarize queries = count(),
    total_scanned_gib = sum(scanned_bytes) / 1073741824,
    by: {task_id = ctx[`dt.task.id`], task_group = ctx[`dt.task.group`]}
| sort total_scanned_gib desc
| limit 20

Then look up by task_id:

fetch dt.system.events, from: -7d
| filter event.kind == "ANALYZER_EXECUTION_EVENT"
| filter dt.task.id == "<task_id from above>"
| summarize count(), by: {dt.task.name, dt.task.id, dt.task.result_status}

Typical causes for a query spike are: Dashboard on auto-refresh with wide time ranges; new anomaly detector scanning full data history; workflow recently added a large query step.

Automation spike: Workflows and AppEngine Functions

Workflows generate costs across three separate billing signals: query scans, AppEngine invocations, and workflow-hours. Check all three. A query spike from the previous section may actually be workflow-driven.

Find the workflow driving the most query scans:

fetch dt.system.events, from: -30d
| filter event.kind == "QUERY_EXECUTION_EVENT"
| filter query_pool == "AUTOMATION"
| filter isNotNull(client.workflow_context)
| summarize total_scanned_gib = sum(scanned_bytes) / 1073741824,
    queries = count(),
    by: {client.workflow_context}
| sort total_scanned_gib desc
| limit 20

Resolve the UUID to a workflow name and owner:

fetch dt.system.events, from: -7d
| filter event.kind == "BILLING_USAGE_EVENT"
| filter event.type == "Automation Workflow"
| dedup event.id
| filter workflow.id == "<uuid from Step A>"
| filter isNotNull(workflow.owner)
| summarize count(), by: {workflow.id, workflow.title, workflow.owner}
| limit 1

Check AppEngine invocations for the same workflow:

fetch dt.system.events, from: -30d
| filter event.kind == "BILLING_USAGE_EVENT"
| filter event.type == "AppEngine Functions - Small"
| dedup event.id
| filter workflow.id == "<uuid from Step A>"
| summarize total_invocations = sum(billed_invocations)

Typical causes for an automation spike are: The trigger frequency changed from hourly to every 2 minutes; new workflow with an unscoped table scan; event-triggered workflow firing on a high-volume event stream.

Infrastructure spike: Full-Stack, Kubernetes, Code Monitoring

For infrastructure monitoring, query by entity reference and look for new hosts or namespaces that appeared around the spike timestamp from Step 3:

fetch dt.system.events, from: -30d
| filter event.kind == "BILLING_USAGE_EVENT"
| filter event.type == "Full-Stack Monitoring"
| dedup event.id
| summarize total_gib_hours = sum(billed_gibibyte_hours), by: {dt.entity.host}
| sort total_gib_hours desc
| limit 20
| lookup [fetch dt.entity.host],
    sourceField: dt.entity.host, lookupField: id, fields: {entity.name}

For container or Kubernetes spikes, break down by namespace:

fetch dt.system.events, from: -30d
| filter event.kind == "BILLING_USAGE_EVENT"
| filter event.type == "Code Monitoring"
| dedup event.id
| summarize total_container_hours = sum(billed_container_hours),
    by: {k8s.namespace.name}
| sort total_container_hours desc
| limit 20

Typical causes for an infrastructure spike are: A production cluster was onboarded to Full-Stack Monitoring; auto-scaling added many short-lived containers; monitoring mode upgraded from Infrastructure Monitoring to Full-Stack across a large fleet.

5. Assemble the escalation package

Hand off the following to Account Management, your FinOps team, or the owning team:

Item	Content
Root cause summary	Capability Entity or source Spike start timestamp
Evidence queries	The DQL queries from Steps 3–4 with the exact time window pinned
Estimated delta	Cost difference between current and previous week
Remediation pointer	Link to optimization guidance for the relevant capability

Common investigation mistakes

Mistake	Impact	Fix
Missing `dedup event.id`	Costs appear 10–30% higher	Apply before every `summarize`
Using `scanned_bytes` (query events) as the billed total	Query execution events are diagnostic, not billable	Use `billed_bytes` (business events) for totals; use query execution events only for per-source drill-down
Checking only one automation billing signal	Misses two-thirds of workflow spend	Check query scans (query execution events), AppEngine invocations (business events), and workflow-hours (business events)
Looking up `workflow.id` on business events	Field not present. Empty results.	Use the query execution events automation pool with `client.workflow_context` instead

Congratulations!

You've completed the full investigation flow, from a cost alert to a named root cause. You can now:

Reproduce the spike with a confirmed DQL query and an exact timestamp.
Identify the entity, dashboard, workflow, or detector responsible.
Hand off a complete escalation package to the owning team.

Next steps

Apply the optimization guidance for the capability you identified. For more information, see Optimize.
Set up a cost monitor so the same spike triggers an alert earlier next time. For more information, see Control.

Trace a cost spike to its root cause

Latest Dynatrace
Tutorial
8-min read
Published May 26, 2026

A cost alert fired, or your FinOps team has escalated. You know something is up, but not what, where, or who owns it.

This tutorial walks you through a repeatable investigation: from the account-level overview in Account Management down to the specific entity, dashboard, workflow, or detector responsible.

Who is this for?

This tutorial is for FinOps practitioners, engineers, platform team leads, and anyone who receives a DPS cost alert and needs to pinpoint the source before it becomes a budget problem.

You'll likely be here because either:

You received a DPS threshold email. In this case, costs are up but you don't yet know which capability is responsible.

Start at Step 1, See which capability is driving the spike.
Someone in your team escalated the cost alert. In this case you already know the capability (for example, Log Ingest is 40% over budget indicates you should look at Log Analytics - Ingest).

Skip to Step 3, Pin the start time, and pre-filter by that capability.

What will you learn?

In this tutorial, you'll learn how to:

Identify which DPS capability is driving a cost spike, using either the ready-made dashboards or Account Management.
Confirm and quantify the spike with DQL billing event queries.
Pin the exact start time of a cost increase to a specific hour.
Attribute the spike to the responsible entity, dashboard, workflow, or detector.
Assemble an escalation package with evidence queries and a root cause summary.

Before you begin

Prerequisites

Access to Account Management with Subscription viewer permissions.
Permission to run DQL queries on the dt.system.events table in at least one environment.
Access to the ready-made Usage - Overview dashboard, and telemetry-specific drill-down dashboards. For more information, see View DPS consumption with ready-made usage dashboards.

Prior knowledge

Basic familiarity with DQL. If you're new to billing events, see Where to view your costs first.

Use DQL to trace a DPS cost spike to its root cause

1. Find the origin of the cost spike

To start investigating a cost spike, figure out the DPS capability that's driving the consumption.

Use ready-made dashboards

These are:

The Usage - Overview dashboard gives an overview of consumption related to all DPS capabilities, so you can spot trends and identify high-volume capabilities for closer analysis.
The Usage - Traces, Usage - Logs, and Usage - Full-Stack dashboards provide drill-downs into the specific category, so you can differentiate between usage of Ingest & Process, Query, and Retain.

Use Account Management

Account Management can tell you:

Which broad capability is driving the spike.
Roughly when the increase started.

Account Management cannot tell you:

Which specific entity (host, cluster, or application) is responsible.
Which user, dashboard, workflow, or detector is running the queries.
The actual root cause.

2. Confirm the spike with DQL

Run this week-over-week comparison to confirm which capability has grown the most:

fetch dt.system.events, from: -14d
| filter event.kind == "BILLING_USAGE_EVENT"
| dedup event.id
| fieldsAdd week = if(timestamp >= now() - 7d, "current", else: "previous")
| summarize event_count = count(), by: {event.type, week}
| sort event.type asc

The capability with the largest gap between current and previous is your target. Note it down. You'll use it to filter in the next steps.

3. Pin the start time

Narrow the spike to a specific hour using the capability from Step 2:

fetch dt.system.events, from: -7d
| filter event.kind == "BILLING_USAGE_EVENT"
| filter event.type == "<capability from Step 2>"
| dedup event.id
| fieldsAdd hour = bin(timestamp, 1h)
| summarize event_count = count(), by: {hour}
| sort hour asc

Look for the first hour where event_count jumps sharply. That timestamp is your spike anchor: it goes into the post-mortem and narrows the attribution search in the next step.

If the increase is gradual rather than a sharp jump, run a 30-day day-level view to find which day it tipped over:

fetch dt.system.events, from: -30d
| filter event.kind == "BILLING_USAGE_EVENT"
| filter event.type == "<capability from Step 2>"
| dedup event.id
| fieldsAdd day = bin(timestamp, 1d)
| summarize event_count = count(), by: {day}
| sort day asc

4. Find the attribution source

Match the capability from the previous step to the correct type of bucket, as described in this section.

Ingest spike: Logs, Traces, Events, Metrics

Ingest billing events carry entity reference fields that identify the source sending data. Query for the top sources by GiB:

fetch dt.system.events, from: -30d
| filter event.kind == "BILLING_USAGE_EVENT"
| filter in(event.type,
    "Log Management & Analytics - Ingest & Process",
    "Events - Ingest & Process",
    "Traces - Ingest & Process")
| dedup event.id
| summarize total_gib = sum(toDouble(billed_bytes) / 1073741824),
    by: {event.type, dt.entity.host, dt.entity.application,
         dt.entity.kubernetes_cluster}
| sort total_gib desc
| limit 20

Look for a single entity or cluster with a disproportionate share. Compare against your spike timestamp from Step 3. Did this entity appear for the first time, or did its volume increase?

Query cost spike: Dashboards, detectors, and workflows

For query billing events, client.source identifies who is scanning data. Query for the top sources:

fetch dt.system.events, from: -30d
| filter event.kind == "BILLING_USAGE_EVENT"
| filter in(event.type,
    "Log Management & Analytics - Query",
    "Events - Query",
    "Traces - Query",
    "Files - Query")
| dedup event.id
| filter isNotNull(client.source)
| summarize total_billed_gib = sum(toDouble(billed_bytes) / 1073741824),
    by: {client.source, event.type}
| sort total_billed_gib desc
| limit 20

Decode client.source:

Pattern	Source type	Next step
`https://.../document/...`	Dashboard URL	Open the dashboard and inspect tile queries
UUID without `dt.` prefix	Anomaly detector `objectId`	Cross-reference with the ALERTING pool (see below)
`dynatrace.automations:...`	Automation workflow	See the Automation spike section
`dt.*` service name	Internal platform service	Review with your platform team

If client.source is a UUID, resolve the detector name via the Query Execution Event pool:

fetch dt.system.events, from: -7d
| filter event.kind == "QUERY_EXECUTION_EVENT"
| filter query_pool == "ALERTING"
| parse client.client_context, "JSON:ctx"
| summarize queries = count(),
    total_scanned_gib = sum(scanned_bytes) / 1073741824,
    by: {task_id = ctx[`dt.task.id`], task_group = ctx[`dt.task.group`]}
| sort total_scanned_gib desc
| limit 20

Then look up by task_id:

fetch dt.system.events, from: -7d
| filter event.kind == "ANALYZER_EXECUTION_EVENT"
| filter dt.task.id == "<task_id from above>"
| summarize count(), by: {dt.task.name, dt.task.id, dt.task.result_status}

Typical causes for a query spike are: Dashboard on auto-refresh with wide time ranges; new anomaly detector scanning full data history; workflow recently added a large query step.

Automation spike: Workflows and AppEngine Functions

Find the workflow driving the most query scans:

fetch dt.system.events, from: -30d
| filter event.kind == "QUERY_EXECUTION_EVENT"
| filter query_pool == "AUTOMATION"
| filter isNotNull(client.workflow_context)
| summarize total_scanned_gib = sum(scanned_bytes) / 1073741824,
    queries = count(),
    by: {client.workflow_context}
| sort total_scanned_gib desc
| limit 20

Resolve the UUID to a workflow name and owner:

fetch dt.system.events, from: -7d
| filter event.kind == "BILLING_USAGE_EVENT"
| filter event.type == "Automation Workflow"
| dedup event.id
| filter workflow.id == "<uuid from Step A>"
| filter isNotNull(workflow.owner)
| summarize count(), by: {workflow.id, workflow.title, workflow.owner}
| limit 1

Check AppEngine invocations for the same workflow:

fetch dt.system.events, from: -30d
| filter event.kind == "BILLING_USAGE_EVENT"
| filter event.type == "AppEngine Functions - Small"
| dedup event.id
| filter workflow.id == "<uuid from Step A>"
| summarize total_invocations = sum(billed_invocations)

Infrastructure spike: Full-Stack, Kubernetes, Code Monitoring

For infrastructure monitoring, query by entity reference and look for new hosts or namespaces that appeared around the spike timestamp from Step 3:

fetch dt.system.events, from: -30d
| filter event.kind == "BILLING_USAGE_EVENT"
| filter event.type == "Full-Stack Monitoring"
| dedup event.id
| summarize total_gib_hours = sum(billed_gibibyte_hours), by: {dt.entity.host}
| sort total_gib_hours desc
| limit 20
| lookup [fetch dt.entity.host],
    sourceField: dt.entity.host, lookupField: id, fields: {entity.name}

For container or Kubernetes spikes, break down by namespace:

fetch dt.system.events, from: -30d
| filter event.kind == "BILLING_USAGE_EVENT"
| filter event.type == "Code Monitoring"
| dedup event.id
| summarize total_container_hours = sum(billed_container_hours),
    by: {k8s.namespace.name}
| sort total_container_hours desc
| limit 20

5. Assemble the escalation package

Hand off the following to Account Management, your FinOps team, or the owning team:

Item	Content
Root cause summary	Capability Entity or source Spike start timestamp
Evidence queries	The DQL queries from Steps 3–4 with the exact time window pinned
Estimated delta	Cost difference between current and previous week
Remediation pointer	Link to optimization guidance for the relevant capability

Common investigation mistakes

Mistake	Impact	Fix
Missing `dedup event.id`	Costs appear 10–30% higher	Apply before every `summarize`
Using `scanned_bytes` (query events) as the billed total	Query execution events are diagnostic, not billable	Use `billed_bytes` (business events) for totals; use query execution events only for per-source drill-down
Checking only one automation billing signal	Misses two-thirds of workflow spend	Check query scans (query execution events), AppEngine invocations (business events), and workflow-hours (business events)
Looking up `workflow.id` on business events	Field not present. Empty results.	Use the query execution events automation pool with `client.workflow_context` instead

Congratulations!

You've completed the full investigation flow, from a cost alert to a named root cause. You can now:

Reproduce the spike with a confirmed DQL query and an exact timestamp.
Identify the entity, dashboard, workflow, or detector responsible.
Hand off a complete escalation package to the owning team.

Next steps

Apply the optimization guidance for the capability you identified. For more information, see Optimize.
Set up a cost monitor so the same spike triggers an alert earlier next time. For more information, see Control.

Trace a cost spike to its root cause

Who is this for?

What will you learn?

Before you begin

Prerequisites

Prior knowledge

Use DQL to trace a DPS cost spike to its root cause

1. Find the origin of the cost spike

Use ready-made dashboards

Use Account Management

2. Confirm the spike with DQL

3. Pin the start time

4. Find the attribution source

Ingest spike: Logs, Traces, Events, Metrics

Query cost spike: Dashboards, detectors, and workflows

Automation spike: Workflows and AppEngine Functions

Infrastructure spike: Full-Stack, Kubernetes, Code Monitoring

5. Assemble the escalation package

Common investigation mistakes

Congratulations!

Next steps

Related topics

Trace a cost spike to its root cause

Who is this for?

What will you learn?

Before you begin

Prerequisites

Prior knowledge

Use DQL to trace a DPS cost spike to its root cause

1. Find the origin of the cost spike

Use ready-made dashboards

Use Account Management

2. Confirm the spike with DQL

3. Pin the start time

4. Find the attribution source

Ingest spike: Logs, Traces, Events, Metrics

Query cost spike: Dashboards, detectors, and workflows

Automation spike: Workflows and AppEngine Functions

Infrastructure spike: Full-Stack, Kubernetes, Code Monitoring

5. Assemble the escalation package

Common investigation mistakes

Congratulations!

Next steps

Related topics