Logs are an essential monitoring signal. Many teams build their own dashboards that rely on logs.
Let’s look at the following dashboard that analyzes error logs generated by Kuberenetes workloads running in a namespace.
This dashboard uses two types of log queries:
Queries that aggregate log records
using the makeTimeseries
command.
fetch logs| filter k8s.namespace.name == "obslab-log-problem-detection"| filter status == "ERROR"| makeTimeseries count = count()
using the summarize
command.
fetch logs| filter k8s.namespace.name == "obslab-log-problem-detection"| filter status == "ERROR"| summarize count = count(), by: { k8s.deployment.name } | sort count desc
A query that fetches logs to get log records in raw form.
fetch logs| filter k8s.namespace.name == "obslab-log-problem-detection"| filter status == "ERROR"
Team wants to know
In this guide you will learn different techniques to get this information while optimizing your dashboard for performance and costs.
Let’s look what you can do to optimize different types of queries.
When you are interested in aggregated values like sums, or counts spitted by a low cardinality dimension, follow the guide to Parse log lines and extract a metric.
In our example this is the required configuration for your metric extraction:
errors in obslab-log-problem-detection
k8s.namespace.name == "obslab-log-problem-detection" AND status == "ERROR"
log.k8s.obslab.errors
k8s.workload.name
Then you have to adjust the DQL statement on the dashboard tile.
timeseries count = count(log.obslab.errors)
Once you have adjusted the two tiles, there will be no query cost generated on rendering the two tiles. Log-based metrics are licensed like any other Custom Metric powered by Grail.
It’s not always possible to extract metrics from any DQL query.
Be aware of dimensional cardinality, and specific commands like dedup
or join
.
If you can't extract metrics, you will need to optimize your DQL query and dashboard configuration.
Our dashboard example presents a log records content field. This is a high-cardinality field and as such it's not suitable for metric dimension.
The first thing you can do is to set the timeframe, segment, or bucket–if possible. This will increase the performance and decrease costs right away. To do so, simply edit the dashboard tile.
Set timeframe: Select the Custom timeframe toggle and choose the timeframe via the drop-down menu.
Set segment: Select the Custom segments toggle and choose the segment via the drop-down menu.
Set bucket: Add the following line to the DQL query.
| filter dt.system.bucket == "default_logs"
More DQL best practices for queries optimization are at DQL Best practices.
Looking at our query we can also filter early and set fields.
fetch logs| filter dt.system.bucket == "default_logs"| filter k8s.namespace.name == "obslab-log-problem-detection"| filter status == "ERROR"| fields timestamp, content, k8s.workload.name
If your use case allows, you can also set query limits. Scroll to Edit tile > Query limits. You can then configure
When you change the read data limit and record limit, users who need more results should use the Open with functionality and analyze logs in Logs.
Reducing the sampling size reduces the amount of data that is scanned, and returns more diverse samples in large data sets. Sampling is useful for:
In situations when you count records (like using the summarize
command), remember to multiply the result by sampling rate to get a better approximation of the result.
Sampling isn’t a good idea when you care about accuracy or scan small datasets.
If many users frequently open a dashboard and they query shorter time ranges (data that is less than 35 days old), consider using the Retain with Included Queries pricing model and set the appropriate IAM permissions.
For the detailed instructions on the bucket configuration, see Take control of log query costs using Retain with Included Queries.
When your dashboard is automatically refreshed every minute, all your queries will be executed. If you are using log queries, you are charged per execution.
Consider switching off the auto-refresh functionality via the drop-down menu and selecting Off.
By using the techniques mentioned above, we reduced the size of the scanned data for a 24-hour timeframe from 12.3 GB to 147 MB. Your results may vary.
Best practices:
If possible, use metrics based on logs as described in Set up Davis alerts based on metrics. Metrics are faster, have longer and cheaper retention, queries for metrics are not charged, you can drop the log records contributing to a metric and save on retention costs, can metrics be used for alerting.
To extract metrics, Parse log lines and extract a metric with OpenPipeline.
If your use case requires that log content (or other high-cardinality data) is presented via dashboards, optimize your DQL queries by setting timeframe, segments, buckets, query limits, and disable auto-refresh.