Improve OpenTelemetry Collector resilience using persistent file storage

  • Latest Dynatrace
  • How-to guide
  • 15-min read
  • Published Jan 23, 2026

When operating the OpenTelemetry Collector in production environments, data loss prevention is crucial. Network outages, backend unavailability, or Collector restarts can lead to telemetry data loss if not properly handled.

The file storage extension provides persistent storage capabilities that help make the Collector more resilient against these scenarios.

This page describes how to install and configure the file storage extension, with code snippets showing best practices for the configuration. A complete production-ready configuration, which combines resiliency best practices, is provided at Full example configuration.

How the file storage extension works

The file storage extension uses bbolt, an embedded key-value database, to store data in a persistent queue on the local file system.

The persistent queue works by:

  • Always buffering to disk: Data is written to disk as soon as it reaches the exporter. This allows the queue to handle backend outages or slow network connections without data loss.
  • Persisting across restarts: Data remains on disk if the Collector restarts, allowing exports to resume automatically.

Note that the persistent queue works alongside the retry mechanism (retry_on_failure), which is a separate configuration that controls how failed exports are retried.

Both receivers and exporters can use the extension:

  • The filelog receiver uses the file storage extension to persist information about which log files have been read and at what position within each file. This prevents duplicate log ingestion or data loss after Collector restarts.

  • Exporters can use the file storage extension through the sending_queue configuration with storage parameter.

For detailed information on how the extension manages files, sanitizes component names, and handles database corruption, see the upstream documentation.

When to use the file storage extension

There are two scenarios when you'll use the file storage extension:

  • When receivers need to track their reading position.

  • When exporters need to queue data reliably.

Configuration for receivers

By using the file storage extension, the receiver can resume reading from the exact position where it stopped.

Example configuration for receivers
extensions:
file_storage:
directory: /var/lib/otelcol/file_storage
create_directory: true
fsync: true
compaction:
on_start: true
on_rebound: true
rebound_needed_threshold_mib: 100
rebound_trigger_threshold_mib: 10
check_interval: 5s
receivers:
filelog:
include: [/var/log/app/*.log]
storage: file_storage

Without persistent storage, if the Collector restarts while reading log files, it would either:

  • Re-read logs from the beginning, causing duplicates.
  • Skip logs that were generated during the restart, causing data loss.

Configuration for exporters

The file storage configuration creates a persistent queue that survives Collector restarts, preventing data loss during network outages or backend unavailability.

Example configuration for exporters
extensions:
file_storage:
directory: /var/lib/otelcol/file_storage
create_directory: true
fsync: true
compaction:
on_start: true
on_rebound: true
rebound_needed_threshold_mib: 100
rebound_trigger_threshold_mib: 10
check_interval: 5s
exporters:
otlp_http:
endpoint: "${env:DT_ENDPOINT}"
headers:
Authorization: "Api-Token ${env:DT_API_TOKEN}"
sending_queue:
enabled: true
num_consumers: 10
queue_size: 5000
storage: file_storage
retry_on_failure:
enabled: true
initial_interval: 5s
max_interval: 30s
max_elapsed_time: 300s

The persistent queue backed by file storage is enabled by the sending_queue.storage setting. Without it, the queue exists only in memory and is lost on restart.

When you enable sending_queue.storage, the in-memory queue is disabled. Note that authentication context from extensions is not preserved in the persistent queue. If your setup uses credentials derived from the incoming request, they won't be available when the data is read from disk and sent later.

Environment considerations

The file storage extension behaves differently depending on the deployment environment. Understanding these differences helps you configure it appropriately for your use case.

Operating system installations

Installation directories for the file storage extension depend on the operating system used.

  • Linux and macOS: /var/lib/otelcol/file_storage
  • Windows: %ProgramData%\Otelcol\FileStorage

These directories must exist before the Collector starts, unless create_directory is enabled.

The directory can be located in storage locations such as the cloud, a local disk, or a persistent volume on a Kubernetes cluster. You should choose the storage location based on:

  • Available disk space: Ensure the filesystem has sufficient space for expected queue sizes.
  • Disk performance: Use fast disks for high-throughput scenarios.
  • Backup policies: Consider whether the storage directory should be included in backups.

Avoid placing the storage directory on network filesystems (NFS, CIFS). The underlying database relies on file locking and memory-mapped files, which are often poorly supported on network drives.

Storage capacity planning

Calculate the required storage capacity based on your expected throughput and potential downtime:

Required Storage = Throughput * Outage Duration * Safety Factor

Example storage capacity calculation
  • Throughput: 100 MB/hour of telemetry data
  • Expected maximum outage: 4 hours
  • Safety factor: 2× (to account for compression and overhead)
  • Required Storage: 100 MB/hour * 4 hours * 2 = 800 MB

Container environments

When running in containers (Docker, Kubernetes), the storage directory must be backed by a persistent volume for data to survive container restarts. Without this, the storage files are lost when the container is recreated.

Avoid sharing the same storage directory across multiple Collector instances. The file storage extension uses file locking and is intended for a single process to own a given storage directory. To achieve this in Kubernetes when running multiple replicas, use a StatefulSet instead of a Deployment. A StatefulSet ensures that each pod gets its own unique PersistentVolumeClaim (PVC) via volumeClaimTemplates.

Kubernetes example with PersistentVolumeClaim
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: otel-collector-storage
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: otel-collector
spec:
template:
spec:
containers:
- name: otel-collector
image: ghcr.io/dynatrace/dynatrace-otel-collector/dynatrace-otel-collector:0.47.0
volumeMounts:
- name: storage
mountPath: /var/lib/otelcol/file_storage
volumes:
- name: storage
persistentVolumeClaim:
claimName: otel-collector-storage

Security considerations

Protect the storage files from unauthorized access, as they may contain sensitive telemetry data.

  • Encryption at rest: The file storage extension does not encrypt data. If required by your security policy, use infrastructure-level encryption:

    • Kubernetes: Use a StorageClass that provisions encrypted volumes (for example, AWS EBS with KMS).
    • VMs: Use cloud provider disk encryption (for example, AWS EBS encryption or Azure Disk Encryption).
    • On-premise: Use filesystem-level encryption (for example, LUKS).
  • Storage reclamation: Configure compaction to reclaim disk space from processed data.

    Stability

    Note that compaction does not expire data based on time (retention); it only reclaims space from data that has already been successfully processed and removed from the queue.

  • Database corruption recovery: Enable automatic recovery from database corruption:

    extensions:
    file_storage:
    recreate: true # Automatically handle corrupted database files

    When recreate: true is set and corruption is detected, the corrupted file is renamed with a .backup suffix and a fresh database is created. This allows the Collector to continue operating but may result in data duplication or loss of component state.

High-throughput considerations

The file storage extension can consume significant memory in high-throughput scenarios. Understanding this behavior helps you configure the Collector appropriately and avoid out-of-memory issues.

Memory usage consists of:

  • Active working set: Data that is currently being written or read.
  • File cache: OS-level caching of recently accessed data.
  • Queue buffer: In-memory buffer before data is flushed to disk.

Memory consumption patterns

The extension uses memory-mapped files for database operations, which affects how memory usage is reported and managed:

  • High reported usage: The operating system may report high memory usage (RSS) because it caches properties of the file in available RAM.
  • Reclaimable memory: Much of this reported usage is often file-backed page cache. If the system faces memory pressure, the OS can reclaim this memory by evicting pages without crashing the process.
  • Actual pressure: True memory pressure occurs only when the working set (data actively being read/written) exceeds available RAM, or when non-reclaimable Go heap memory (used for serialization buffers and internal pointers) grows too large.

Managing memory usage

To control memory consumption:

  • Use the memory limiter processor.

    Always use the memory limiter processor before file storage-backed components. This prevents the Collector from consuming excessive memory by refusing data when limits are approached. For configuration information, see memory limiter documentation.

  • Configure queue sizes appropriately.

    Don't over-provision queue sizes. Calculate based on your actual needs:

    exporters:
    otlp_http:
    sending_queue:
    enabled: true
    queue_size: 5000 # Start conservative, increase if needed
    num_consumers: 10
    storage: file_storage

    Larger queues provide more buffering but consume more memory.

  • When using persistent queues with file storage, consider setting sizer to items or bytes if you need stricter control over queue size.

    But, be aware of the performance implications with each setting:

    • requests: The default setting. Best performance. Counts batches/requests. Low overhead (O(1)).
    • items: Moderate overhead. Counts individual items (spans, metrics, logs). Requires iterating through the batch structure. Recommended for better queue capacity control without extreme performance penalty.
    • bytes: Least performant. Counts size of serialized data. Requires serializing (marshalling) every request to calculate size, which significantly increases CPU usage. Use only if strict memory limit enforcement is critical.
    exporters:
    otlp_http:
    sending_queue:
    enabled: true
    queue_size: 5000
    num_consumers: 10
    storage: file_storage
    sizer: items # Counts by number of items (spans/metrics/logs). Moderate CPU overhead.
    # sizer: bytes # Counts by size of serialized data. High CPU overhead (requires serialization).
  • Enable compaction. Compaction reclaims disk space from deleted data and can reduce memory pressure.

    extensions:
    file_storage:
    compaction:
    on_start: true
    on_rebound: true
    directory: /tmp/otel_compaction # Temporary directory for compaction
    rebound_needed_threshold_mib: 100
    rebound_trigger_threshold_mib: 10
    check_interval: 5s
    max_transaction_size: 65536 # Maximum size of compaction transaction
    cleanup_on_start: true # Clean up temporary files on startup

    Compaction modes:

    • on_start: Compacts the database when the Collector starts (reclaims space immediately but adds to startup time).
    • on_rebound: Compacts online when storage usage drops after a spike.

    For technical details on how rebound compaction thresholds are calculated, refer to the compaction documentation.

  • Monitor and tune.

    Monitor these metrics to understand memory behavior:

    • otelcol_processor_refused_spans: Indicates memory limiter is activating.
    • process_runtime_total_sys_memory_bytes: Total system memory usage.
    • otelcol_exporter_queue_size: Current queue occupancy.

    If you see frequent memory limiter activations, consider:

    • Increasing memory limits, if resources are available.
    • Reducing queue sizes.
    • Scaling horizontally, by using more Collector instances.
    • Optimizing data processing by sampling and filtering.

Performance tuning

For high-throughput scenarios (>10,000 spans/second), consider these optimizations:

extensions:
file_storage:
timeout: 500ms # Reduce timeout for faster operations
fsync: false # Disable fsync for better performance (less durability)
compaction:
on_rebound: true
max_transaction_size: 65536
exporters:
otlp_http:
sending_queue:
enabled: true
queue_size: 10000
num_consumers: 20 # Increase consumers for higher throughput
storage: file_storage

Disabling fsync improves performance but reduces durability. If the system crashes before data is flushed to disk, you may lose some buffered data. Only disable this if you can tolerate potential data loss.

Full example configuration

Here's a production-ready configuration that combines resiliency best practices:

extensions:
health_check:
endpoint: 0.0.0.0:13133
file_storage:
directory: /var/lib/otelcol/file_storage
create_directory: true
fsync: true
compaction:
on_start: true
on_rebound: true
rebound_needed_threshold_mib: 100
rebound_trigger_threshold_mib: 10
check_interval: 5s
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
filelog:
include: [/var/log/app/*.log]
storage: file_storage
processors:
memory_limiter:
check_interval: 1s
limit_mib: 1024
spike_limit_mib: 256
exporters:
otlp_http:
endpoint: "${env:DT_ENDPOINT}"
headers:
Authorization: "Api-Token ${env:DT_API_TOKEN}"
sending_queue:
enabled: true
num_consumers: 10
queue_size: 5000
storage: file_storage
retry_on_failure:
enabled: true
initial_interval: 5s
max_interval: 30s
max_elapsed_time: 300s
service:
extensions: [health_check, file_storage]
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter]
exporters: [otlp_http]
metrics:
receivers: [otlp]
processors: [memory_limiter]
exporters: [otlp_http]
logs:
receivers: [otlp, filelog]
processors: [memory_limiter]
exporters: [otlp_http]

Components

Configuration validation

Validate your settings to avoid any configuration issues.

Extensions

We enable the file_storage extension to provide the underlying persistence mechanism for our queues and receivers. We also use health_check to allow Kubernetes (or other orchestrators) to monitor the Collector's status.

Receivers

We use the standard otlp receiver for network traffic and the filelog receiver for reading log files from the host. The filelog receiver is configured to use file_storage to persist its reading offsets.

Processors

Crucially, we use the memory_limiter processor to prevent Out of Memory (OOM) errors during traffic spikes.

Exporters

The otlp_http exporter sends data to Dynatrace. We configure its sending_queue to use file_storage, ensuring that any data buffered during network outages is written to disk rather than held in volatile memory.

Common deployment issues

For more information about common issues see the File Storage extension README.

Issue: Collector won't start, error about file locks

  • Error message: Error: failed to start extension "file_storage": timeout
  • Cause: Another Collector instance is using the storage directory, or the previous instance didn't shut down cleanly.
  • Solution: Ensure only one Collector process accesses the storage directory. If a previous process crashed, the OS should eventually release the lock; restarting the host (or moving the storage directory) can help in rare cases where the lock appears stuck.

Issue: High memory usage

  • Cause: Large queue sizes and memory-mapped file caching.
  • Solution: Enable memory limiter, reduce queue sizes, enable compaction, consider horizontal scaling.

Issue: Data loss after container restart in Kubernetes

  • Cause: Storage directory is not backed by a persistent volume.
  • Solution: Mount a PersistentVolumeClaim to the storage directory in Kubernetes.

Issue: Permission denied errors

  • Cause: Collector process doesn't have access to the storage directory.
  • Solution: Adjust filesystem permissions or use create_directory with appropriate directory_permissions. In Kubernetes, configure securityContext.fsGroup to match the container user (for example, 10001) to ensure the volume is writable.
Related tags
Application Observability