DQL Best practices

This page describes actions you can take to improve query performance.

Narrow the query time range

A shorter analysis window provides better performance based on identical data sets. Use available timeframe selectors provided by the user interface or directly specify the query time range within the fetch command.

fetch bizevents, from:-10m

Utilize available sampling options

Currently, sampling is applicable for Log data within the initial fetch pipeline statement. Depending on the specified value, a fraction (1/<samplingRatio>) of all available raw log records is returned.

The applicable value ranges for sampling are:

  • 1: Default value, resulting in no applied sampling.
  • 10
  • 100
  • 1000
  • 10000

The following query uses sampling to improve query performance to observe an approximation of error logs over time.

fetch logs, from:-7d, samplingRatio:100
| summarize c = countIf(loglevel == "ERROR"), by:bin(timestamp, 3h)
| fieldsAdd c = c*100

Utilize options to limit the scanned amount of data

Particularly when dealing with logs, a large amount of data is ingested in short periods. Even with a narrow query time range, a query can take considerable time to complete.

To stop reading data after a specified amount, use the scanLimitGBytes fetch command optional parameter.

fetch logs, from:-2h, scanLimitGBytes:100

Recommended order of commands for non-aggregating queries:

  1. Reduce the number of processed records by filtering the data using, for example, the filter or filterOut commands.
  2. Select the amount of processed data by selecting fields early using the fields, fieldsKeep, or fieldsRemove commands.
  3. Process the resulting dataset to achieve the required result set. Typically, non-transformative commands are used, such as fieldsAdd, parse, append.
  4. Sort the resulting records according to your priority
  5. Limit the final result set if applicable.

Recommended order of commands for aggregating queries:

  1. Reduce the number of processed records by filtering the data using, for example, the filter or filterOut commands.
  2. Select the amount of processed data by selecting fields early using the fields, fieldsKeep, or fieldsRemove commands.
  3. Process the resulting dataset to achieve the required result set. Typically, non-transformative commands are used, such as fieldsAdd, parse, append.
  4. Aggregate your data set using the summarize command to create a tabular result and maketimeseries if a time chart is required. Don't use limit before aggregating the data to prevent wrong aggregates unless intended.
  5. Use non-transformative commands such as fields, fieldsKeep, or fieldsRemove, or continue filtering.
  6. sort and limit the aggregated data set after the aggregation stages.

Example

fetch logs
| filter dt.entity.host == "HOST-76FAA6DC0347DA12" and matchesPhrase(content,"failed to load")
| filterOut loglevel == "NONE"
| fields timestamp, loglevel, content, dt.entity.process_group_instance
| summarize count = countIf(loglevel=="ERROR"), by:{dt.entity.process_group_instance}
| sort count desc
| limit 5

It is recommended to place sort at the end of the query. Sorting right after fetch, and continuing the query will reduce the query performance.

Examples

This example show a query, where we put sort right after fetch.

It is recommended to place sort at the end of the query. Sorting right after fetch and then continuing the query will reduce the query performance. Example:

fetch logs
| sort timestamp desc
| filter contains(content, "Error")

This example shows the recommended order of putting sort at the end of the query.

fetch logs
| filter contains(content, "Error")
| sort timestamp desc

You can repeat the same command within one query and still stick to the recommended order. In the below example, you first filter the fetched content, then again you filter the parsed content, but the sort command and summarize function retain their positions:

fetch logs
| filter dt.entity.host == "HOST-1"
and dt.entity.process_group_instance == "PROCESS_GROUP_INSTANCE-123AB35"
| parse content, "ipaddr:ip ld ' POST ' ld:action ' HTTP/1.1 ' long:status ld"
| filter action == "/cart" or action == "/cart/checkout"
| summarize count = count(), by:{ ip, log.source }
| sort count desc

Filter early

Overall query performance is greatly improved by narrowing the dataset before further processing. If applicable, segmenting data into dedicated buckets can be particularly effective.

  • Valid for all tables, filter on bucket.

    fetch logs
    | filter dt.system.bucket == "custom_audit_log"
  • For business event queries, filter on any ingested field.

    fetch bizevents
    | filter event.provider == "www.easytrade.com"
  • For log queries, filter on fields representing the topological context of your data.

    fetch logs
    | filter k8s.container.name == "coredns"
    and (loglevel != "NONE" or loglevel != "INFO")

    Commonly used fields within logs:

    • log.source
    • loglevel
    • host.name
    • dt.process.name
    • event.type
    • dt.kubernetes.cluster.name
    • k8s.namespace.name
    • k8s.deployment.name
    • k8s.container.name
    • k8s.pod.name
    • dt.host_group.id
    • dt.event.group_label
    • dt.entity.host
    • aws-log_group

Use string comparisons with care

  • Use == or != whenever the value of a field is known.

    fetch logs
    | filter k8s.container.name == "coredns"
    and (loglevel != "NONE" or loglevel != "INFO")
  • Use matchesPhrase() instead of contains() whenever the value of a field is only partly known or unknown.

    fetch logs
    | filter matchesPhrase(content,"No files matching import")

Fields names to be avoided or used in backticks

It is not recommended to use the below eight reserved keywords as field identifiers (field names) or dimensions:

  • true
  • false
  • null
  • mod
  • and
  • or
  • xor
  • not

However, you can still use these words as field names, identifiers and dimensions if you put them in backticks ('`')

For example, if you have a dimension named 'true':

...
| fields x = true // creates a boolean field that is always true
...
| fields x = `true` // allows to access the custom dimension named 'true'

Similarly, if you need to sort by a field named 'not':

...
| sort not desc // sorts by a boolean value of dimension `desc`
...
| sort `not` desc // sorts descending by a field named `not`

Entity queries

The section below is a comparison between classicEntitySelector and native DQL filters. To achieve effective entity queries, follow these best practices.

Simple filtering

In simple queries that only filter entities, you can use classicEntitySelector or native DQL filters.

The following two queries will be equally fast:

// fetch all K8s pods with the name "pod*"
// using native DQL
fetch dt.entity.cloud_application_instance
| filter startsWith(entity.name, "pod")
// fetch all K8s pods with the name "pod*"
// using classicEntitySelector()
fetch dt.entity.cloud_application_instance
| filter in(id, classicEntitySelector("type(cloud_application_instance), entityName.startsWith(pod)"))

Filtering along relationships

When your query evaluates relationships, we recommend using the classicEntitySelector function instead of native DQL queries.

In the following examples, the native DQL query will be slower and might yield incomplete results compared to the classicEntitySelector query:

// fetch all hosts that run Java processes
// using native DQL
fetch dt.entity.host
| expand pgi=contains[dt.entity.process_group_instance]
| filter pgi in [
fetch dt.entity.process_group_instance
| filter matchesValue(softwareTechnologies, "*JAVA*")
| fields id
]
// fetch all hosts that run Java processes
// using classicEntitySelector()
fetch dt.entity.host
| filter in (id, classicEntitySelector("type(host),toRelationship.isProcessOf(type(PROCESS_GROUP_INSTANCE),softwareTechnologies(JAVA))"))

Combine classicEntitySelector with native DQL filters

If you already use the classicEntitySelector function, it is better to add all filter criteria into the function call rather than add additional native filter statements. The mixed query is slower than the query that contains all filter conditions in the entity selector.

// fetch all LINUX hosts that run Java processes
// using a mix of classicEntitySelector and native DQL filters
fetch dt.entity.host
| filter in (id, classicEntitySelector("type(host),toRelationship.isProcessOf(type(PROCESS_GROUP_INSTANCE),softwareTechnologies(JAVA))"))
| fieldsAdd osType
| filter osType == "LINUX"
// fetch all LINUX hosts that run Java processes
// using only classicEntitySelector
fetch dt.entity.host
| filter in (id, classicEntitySelector("type(host),osType(LINUX),toRelationship.isProcessOf(type(PROCESS_GROUP_INSTANCE),softwareTechnologies(JAVA))"))
| fieldsAdd osType