A cost alert fired, or your FinOps team has escalated. You know something is up, but not what, where, or who owns it.
This tutorial walks you through a repeatable investigation: from the account-level overview in Account Management down to the specific entity, dashboard, workflow, or detector responsible.
Cost estimates use public list prices as of March 2026. Actual costs depend on your negotiated DPS contract. Use them for relative comparisons and directional analysis only, not invoice reconciliation.
This tutorial is for FinOps practitioners, engineers, platform team leads, and anyone who receives a DPS cost alert and needs to pinpoint the source before it becomes a budget problem.
You'll likely be here because either:
You received a DPS threshold email. In this case, costs are up but you don't yet know which capability is responsible.
Start at Step 1, See which capability is driving the spike.
Someone in your team escalated the cost alert. In this case you already know the capability (for example, Log Ingest is 40% over budget indicates you should look at Log Analytics - Ingest).
Skip to Step 3, Pin the start time, and pre-filter by that capability.
In this tutorial, you'll learn how to:
dt.system.events table in at least one environment.Basic familiarity with DQL. If you're new to billing events, see Where to view your costs first.
If you don't know which capability is responsible for the spike, you can use Account Management to investigate. Go to Account Management > Subscription > Overview > Cost and usage details.
The chart shows aggregate DPS consumption broken down by capability with a trend line over time.
Account Management can tell you:
Account Management cannot tell you:
To further attribute the spike, continue to the next step.
Run this week-over-week comparison to confirm which capability has grown the most:
fetch dt.system.events, from: -14d| filter event.kind == "BILLING_USAGE_EVENT"| dedup event.id| fieldsAdd week = if(timestamp >= now() - 7d, "current", else: "previous")| summarize event_count = count(), by: {event.type, week}| sort event.type asc
The capability with the largest gap between current and previous is your target. Note it down. You'll use it to filter in the next steps.
| dedup event.id is mandatory in every billing query. Dynatrace refreshes metering records when correcting measurements. Without it, the same consumption period is counted multiple times and costs appear 10–30% higher than they are.
Narrow the spike to a specific hour using the capability from Step 2:
fetch dt.system.events, from: -7d| filter event.kind == "BILLING_USAGE_EVENT"| filter event.type == "<capability from Step 2>"| dedup event.id| fieldsAdd hour = bin(timestamp, 1h)| summarize event_count = count(), by: {hour}| sort hour asc
Look for the first hour where event_count jumps sharply. That timestamp is your spike anchor: it goes into the post-mortem and narrows the attribution search in the next step.
If the increase is gradual rather than a sharp jump, run a 30-day day-level view to find which day it tipped over:
fetch dt.system.events, from: -30d| filter event.kind == "BILLING_USAGE_EVENT"| filter event.type == "<capability from Step 2>"| dedup event.id| fieldsAdd day = bin(timestamp, 1d)| summarize event_count = count(), by: {day}| sort day asc
Match the capability from the previous step to the correct type of bucket, as described in this section.
Ingest billing events carry entity reference fields that identify the source sending data. Query for the top sources by GiB:
fetch dt.system.events, from: -30d| filter event.kind == "BILLING_USAGE_EVENT"| filter in(event.type,"Log Management & Analytics - Ingest & Process","Events - Ingest & Process","Traces - Ingest & Process")| dedup event.id| summarize total_gib = sum(toDouble(billed_bytes) / 1073741824),by: {event.type, dt.entity.host, dt.entity.application,dt.entity.kubernetes_cluster}| sort total_gib desc| limit 20
Look for a single entity or cluster with a disproportionate share. Compare against your spike timestamp from Step 3. Did this entity appear for the first time, or did its volume increase?
Typical causes for ingest spikes include: New Kubernetes cluster onboarded without log filtering configured; debug-level logging left on in production; new data pipeline sending raw events to Dynatrace.
For query billing events, client.source identifies who is scanning data. Query for the top sources:
fetch dt.system.events, from: -30d| filter event.kind == "BILLING_USAGE_EVENT"| filter in(event.type,"Log Management & Analytics - Query","Events - Query","Traces - Query","Files - Query")| dedup event.id| filter isNotNull(client.source)| summarize total_billed_gib = sum(toDouble(billed_bytes) / 1073741824),by: {client.source, event.type}| sort total_billed_gib desc| limit 20
Decode client.source:
| Pattern | Source type | Next step |
|---|---|---|
| Dashboard URL | Open the dashboard and inspect tile queries |
UUID without | Anomaly detector | Cross-reference with the ALERTING pool (see below) |
| Automation workflow | See the Automation spike section |
| Internal platform service | Review with your platform team |
If client.source is a UUID, resolve the detector name via the Query Execution Event pool:
fetch dt.system.events, from: -7d| filter event.kind == "QUERY_EXECUTION_EVENT"| filter query_pool == "ALERTING"| parse client.client_context, "JSON:ctx"| summarize queries = count(),total_scanned_gib = sum(scanned_bytes) / 1073741824,by: {task_id = ctx[`dt.task.id`], task_group = ctx[`dt.task.group`]}| sort total_scanned_gib desc| limit 20
Then look up by task_id:
fetch dt.system.events, from: -7d| filter event.kind == "ANALYZER_EXECUTION_EVENT"| filter dt.task.id == "<task_id from above>"| summarize count(), by: {dt.task.name, dt.task.id, dt.task.result_status}
Typical causes for a query spike are: Dashboard on auto-refresh with wide time ranges; new anomaly detector scanning full data history; workflow recently added a large query step.
Workflows generate costs across three separate billing signals: query scans, AppEngine invocations, and workflow-hours. Check all three. A query spike from the previous section may actually be workflow-driven.
Find the workflow driving the most query scans:
fetch dt.system.events, from: -30d| filter event.kind == "QUERY_EXECUTION_EVENT"| filter query_pool == "AUTOMATION"| filter isNotNull(client.workflow_context)| summarize total_scanned_gib = sum(scanned_bytes) / 1073741824,queries = count(),by: {client.workflow_context}| sort total_scanned_gib desc| limit 20
Resolve the UUID to a workflow name and owner:
fetch dt.system.events, from: -7d| filter event.kind == "BILLING_USAGE_EVENT"| filter event.type == "Automation Workflow"| dedup event.id| filter workflow.id == "<uuid from Step A>"| filter isNotNull(workflow.owner)| summarize count(), by: {workflow.id, workflow.title, workflow.owner}| limit 1
Check AppEngine invocations for the same workflow:
fetch dt.system.events, from: -30d| filter event.kind == "BILLING_USAGE_EVENT"| filter event.type == "AppEngine Functions - Small"| dedup event.id| filter workflow.id == "<uuid from Step A>"| summarize total_invocations = sum(billed_invocations)
Typical causes for an automation spike are: The trigger frequency changed from hourly to every 2 minutes; new workflow with an unscoped table scan; event-triggered workflow firing on a high-volume event stream.
For infrastructure monitoring, query by entity reference and look for new hosts or namespaces that appeared around the spike timestamp from Step 3:
fetch dt.system.events, from: -30d| filter event.kind == "BILLING_USAGE_EVENT"| filter event.type == "Full-Stack Monitoring"| dedup event.id| summarize total_gib_hours = sum(billed_gibibyte_hours), by: {dt.entity.host}| sort total_gib_hours desc| limit 20| lookup [fetch dt.entity.host],sourceField: dt.entity.host, lookupField: id, fields: {entity.name}
For container or Kubernetes spikes, break down by namespace:
fetch dt.system.events, from: -30d| filter event.kind == "BILLING_USAGE_EVENT"| filter event.type == "Code Monitoring"| dedup event.id| summarize total_container_hours = sum(billed_container_hours),by: {k8s.namespace.name}| sort total_container_hours desc| limit 20
Typical causes for an infrastructure spike are: A production cluster was onboarded to Full-Stack Monitoring; auto-scaling added many short-lived containers; monitoring mode upgraded from Infrastructure Monitoring to Full-Stack across a large fleet.
Hand off the following to Account Management, your FinOps team, or the owning team:
| Item | Content |
|---|---|
Root cause summary |
|
Evidence queries | The DQL queries from Steps 3–4 with the exact time window pinned |
Estimated delta | Cost difference between current and previous week |
Remediation pointer | Link to optimization guidance for the relevant capability |
| Mistake | Impact | Fix |
|---|---|---|
Missing | Costs appear 10–30% higher | Apply before every |
Using | Query execution events are diagnostic, not billable | Use |
Checking only one automation billing signal | Misses two-thirds of workflow spend | Check query scans (query execution events), AppEngine invocations (business events), and workflow-hours (business events) |
Looking up | Field not present. Empty results. | Use the query execution events automation pool with |
You've completed the full investigation flow, from a cost alert to a named root cause. You can now: