Try it free

Trace a cost spike to its root cause

  • Latest Dynatrace
  • Tutorial
  • 8-min read
  • Published May 26, 2026

A cost alert fired, or your FinOps team has escalated. You know something is up, but not what, where, or who owns it.

This tutorial walks you through a repeatable investigation: from the account-level overview in Account Management down to the specific entity, dashboard, workflow, or detector responsible.

Cost estimates use public list prices as of March 2026. Actual costs depend on your negotiated DPS contract. Use them for relative comparisons and directional analysis only, not invoice reconciliation.

Who is this for?

This tutorial is for FinOps practitioners, engineers, platform team leads, and anyone who receives a DPS cost alert and needs to pinpoint the source before it becomes a budget problem.

You'll likely be here because either:

  • You received a DPS threshold email. In this case, costs are up but you don't yet know which capability is responsible.

    Start at Step 1, See which capability is driving the spike.

  • Someone in your team escalated the cost alert. In this case you already know the capability (for example, Log Ingest is 40% over budget indicates you should look at Log Analytics - Ingest).

    Skip to Step 3, Pin the start time, and pre-filter by that capability.

What will you learn?

In this tutorial, you'll learn how to:

  • Identify which DPS capability is driving a cost spike using Account Management.
  • Confirm and quantify the spike with DQL billing event queries.
  • Pin the exact start time of a cost increase to a specific hour.
  • Attribute the spike to the responsible entity, dashboard, workflow, or detector.
  • Assemble an escalation package with evidence queries and a root cause summary.

Before you begin

Prerequisites

  • Access to Account Management with Subscription viewer permissions.
  • Permission to run DQL queries on the dt.system.events table in at least one environment.

Prior knowledge

Basic familiarity with DQL. If you're new to billing events, see Where to view your costs first.

How to trace a DPS cost spike to its root cause

1. See which capability is driving the spike

If you don't know which capability is responsible for the spike, you can use Account Management to investigate. Go to Account Management > Subscription > Overview > Cost and usage details.

The chart shows aggregate DPS consumption broken down by capability with a trend line over time.

Account Management can tell you:

  • Which broad capability is driving the spike.
  • Roughly when the increase started.

Account Management cannot tell you:

  • Which specific entity (host, cluster, or application) is responsible.
  • Which user, dashboard, workflow, or detector is running the queries.
  • The actual root cause.

To further attribute the spike, continue to the next step.

2. Confirm the spike with DQL

Run this week-over-week comparison to confirm which capability has grown the most:

fetch dt.system.events, from: -14d
| filter event.kind == "BILLING_USAGE_EVENT"
| dedup event.id
| fieldsAdd week = if(timestamp >= now() - 7d, "current", else: "previous")
| summarize event_count = count(), by: {event.type, week}
| sort event.type asc

The capability with the largest gap between current and previous is your target. Note it down. You'll use it to filter in the next steps.

| dedup event.id is mandatory in every billing query. Dynatrace refreshes metering records when correcting measurements. Without it, the same consumption period is counted multiple times and costs appear 10–30% higher than they are.

3. Pin the start time

Narrow the spike to a specific hour using the capability from Step 2:

fetch dt.system.events, from: -7d
| filter event.kind == "BILLING_USAGE_EVENT"
| filter event.type == "<capability from Step 2>"
| dedup event.id
| fieldsAdd hour = bin(timestamp, 1h)
| summarize event_count = count(), by: {hour}
| sort hour asc

Look for the first hour where event_count jumps sharply. That timestamp is your spike anchor: it goes into the post-mortem and narrows the attribution search in the next step.

If the increase is gradual rather than a sharp jump, run a 30-day day-level view to find which day it tipped over:

fetch dt.system.events, from: -30d
| filter event.kind == "BILLING_USAGE_EVENT"
| filter event.type == "<capability from Step 2>"
| dedup event.id
| fieldsAdd day = bin(timestamp, 1d)
| summarize event_count = count(), by: {day}
| sort day asc

4. Find the attribution source

Match the capability from the previous step to the correct type of bucket, as described in this section.

Ingest spike: Logs, Traces, Events, Metrics

Ingest billing events carry entity reference fields that identify the source sending data. Query for the top sources by GiB:

fetch dt.system.events, from: -30d
| filter event.kind == "BILLING_USAGE_EVENT"
| filter in(event.type,
"Log Management & Analytics - Ingest & Process",
"Events - Ingest & Process",
"Traces - Ingest & Process")
| dedup event.id
| summarize total_gib = sum(toDouble(billed_bytes) / 1073741824),
by: {event.type, dt.entity.host, dt.entity.application,
dt.entity.kubernetes_cluster}
| sort total_gib desc
| limit 20

Look for a single entity or cluster with a disproportionate share. Compare against your spike timestamp from Step 3. Did this entity appear for the first time, or did its volume increase?

Typical causes for ingest spikes include: New Kubernetes cluster onboarded without log filtering configured; debug-level logging left on in production; new data pipeline sending raw events to Dynatrace.

Query cost spike: Dashboards, detectors, and workflows

For query billing events, client.source identifies who is scanning data. Query for the top sources:

fetch dt.system.events, from: -30d
| filter event.kind == "BILLING_USAGE_EVENT"
| filter in(event.type,
"Log Management & Analytics - Query",
"Events - Query",
"Traces - Query",
"Files - Query")
| dedup event.id
| filter isNotNull(client.source)
| summarize total_billed_gib = sum(toDouble(billed_bytes) / 1073741824),
by: {client.source, event.type}
| sort total_billed_gib desc
| limit 20

Decode client.source:

PatternSource typeNext step

https://.../document/...

Dashboard URL

Open the dashboard and inspect tile queries

UUID without dt. prefix

Anomaly detector objectId

Cross-reference with the ALERTING pool (see below)

dynatrace.automations:...

Automation workflow

See the Automation spike section

dt.* service name

Internal platform service

Review with your platform team

If client.source is a UUID, resolve the detector name via the Query Execution Event pool:

fetch dt.system.events, from: -7d
| filter event.kind == "QUERY_EXECUTION_EVENT"
| filter query_pool == "ALERTING"
| parse client.client_context, "JSON:ctx"
| summarize queries = count(),
total_scanned_gib = sum(scanned_bytes) / 1073741824,
by: {task_id = ctx[`dt.task.id`], task_group = ctx[`dt.task.group`]}
| sort total_scanned_gib desc
| limit 20

Then look up by task_id:

fetch dt.system.events, from: -7d
| filter event.kind == "ANALYZER_EXECUTION_EVENT"
| filter dt.task.id == "<task_id from above>"
| summarize count(), by: {dt.task.name, dt.task.id, dt.task.result_status}

Typical causes for a query spike are: Dashboard on auto-refresh with wide time ranges; new anomaly detector scanning full data history; workflow recently added a large query step.

Automation spike: Workflows and AppEngine Functions

Workflows generate costs across three separate billing signals: query scans, AppEngine invocations, and workflow-hours. Check all three. A query spike from the previous section may actually be workflow-driven.

  1. Find the workflow driving the most query scans:

    fetch dt.system.events, from: -30d
    | filter event.kind == "QUERY_EXECUTION_EVENT"
    | filter query_pool == "AUTOMATION"
    | filter isNotNull(client.workflow_context)
    | summarize total_scanned_gib = sum(scanned_bytes) / 1073741824,
    queries = count(),
    by: {client.workflow_context}
    | sort total_scanned_gib desc
    | limit 20
  2. Resolve the UUID to a workflow name and owner:

    fetch dt.system.events, from: -7d
    | filter event.kind == "BILLING_USAGE_EVENT"
    | filter event.type == "Automation Workflow"
    | dedup event.id
    | filter workflow.id == "<uuid from Step A>"
    | filter isNotNull(workflow.owner)
    | summarize count(), by: {workflow.id, workflow.title, workflow.owner}
    | limit 1
  3. Check AppEngine invocations for the same workflow:

    fetch dt.system.events, from: -30d
    | filter event.kind == "BILLING_USAGE_EVENT"
    | filter event.type == "AppEngine Functions - Small"
    | dedup event.id
    | filter workflow.id == "<uuid from Step A>"
    | summarize total_invocations = sum(billed_invocations)

Typical causes for an automation spike are: The trigger frequency changed from hourly to every 2 minutes; new workflow with an unscoped table scan; event-triggered workflow firing on a high-volume event stream.

Infrastructure spike: Full-Stack, Kubernetes, Code Monitoring

For infrastructure monitoring, query by entity reference and look for new hosts or namespaces that appeared around the spike timestamp from Step 3:

fetch dt.system.events, from: -30d
| filter event.kind == "BILLING_USAGE_EVENT"
| filter event.type == "Full-Stack Monitoring"
| dedup event.id
| summarize total_gib_hours = sum(billed_gibibyte_hours), by: {dt.entity.host}
| sort total_gib_hours desc
| limit 20
| lookup [fetch dt.entity.host],
sourceField: dt.entity.host, lookupField: id, fields: {entity.name}

For container or Kubernetes spikes, break down by namespace:

fetch dt.system.events, from: -30d
| filter event.kind == "BILLING_USAGE_EVENT"
| filter event.type == "Code Monitoring"
| dedup event.id
| summarize total_container_hours = sum(billed_container_hours),
by: {k8s.namespace.name}
| sort total_container_hours desc
| limit 20

Typical causes for an infrastructure spike are: A production cluster was onboarded to Full-Stack Monitoring; auto-scaling added many short-lived containers; monitoring mode upgraded from Infrastructure Monitoring to Full-Stack across a large fleet.

5. Assemble the escalation package

Hand off the following to Account Management, your FinOps team, or the owning team:

ItemContent

Root cause summary

  • Capability
  • Entity or source
  • Spike start timestamp

Evidence queries

The DQL queries from Steps 3–4 with the exact time window pinned

Estimated delta

Cost difference between current and previous week

Remediation pointer

Link to optimization guidance for the relevant capability

Common investigation mistakes

MistakeImpactFix

Missing dedup event.id

Costs appear 10–30% higher

Apply before every summarize

Using scanned_bytes (query events) as the billed total

Query execution events are diagnostic, not billable

Use billed_bytes (business events) for totals; use query execution events only for per-source drill-down

Checking only one automation billing signal

Misses two-thirds of workflow spend

Check query scans (query execution events), AppEngine invocations (business events), and workflow-hours (business events)

Looking up workflow.id on business events

Field not present. Empty results.

Use the query execution events automation pool with client.workflow_context instead

Congratulations!

You've completed the full investigation flow, from a cost alert to a named root cause. You can now:

  • Reproduce the spike with a confirmed DQL query and an exact timestamp.
  • Identify the entity, dashboard, workflow, or detector responsible.
  • Hand off a complete escalation package to the owning team.

Next steps

  • Apply the optimization guidance for the capability you identified. For more information, see Optimize.
  • Set up a cost monitor so the same spike triggers an alert earlier next time. For more information, see Control.

Related topics

  • Budget alerts
  • Cost monitors
  • Optimize
  • DQL best practices
Related tags
Dynatrace Platform