When operating the OTel Collector in production environments, data loss prevention is crucial. Network outages, backend unavailability, or Collector restarts can lead to telemetry data loss if not properly handled.
The file storage extension provides persistent storage capabilities that help make the Collector more resilient against these scenarios.
This page describes how to install and configure the file storage extension, with code snippets showing best practices for the configuration. A complete production-ready configuration, which combines resiliency best practices, is provided at Full example configuration.
The file storage extension uses bbolt, an embedded key-value database, to store data in a persistent queue on the local file system.
The persistent queue works by:
Note that the persistent queue works alongside the retry mechanism (retry_on_failure), which is a separate configuration that controls how failed exports are retried.
Both receivers and exporters can use the extension:
The filelog receiver uses the file storage extension to persist information about which log files have been read and at what position within each file. This prevents duplicate log ingestion or data loss after Collector restarts.
Exporters can use the file storage extension through the sending_queue configuration with storage parameter.
For detailed information on how the extension manages files, sanitizes component names, and handles database corruption, see the upstream documentation.
There are two scenarios when you'll use the file storage extension:
When receivers need to track their reading position.
When exporters need to queue data reliably.
By using the file storage extension, the receiver can resume reading from the exact position where it stopped.
extensions:file_storage:directory: /var/lib/otelcol/file_storagecreate_directory: truefsync: truecompaction:on_start: trueon_rebound: truerebound_needed_threshold_mib: 100rebound_trigger_threshold_mib: 10check_interval: 5sreceivers:file_log:include: [/var/log/app/*.log]storage: file_storage
Without persistent storage, if the Collector restarts while reading log files, it would either:
The file storage configuration creates a persistent queue that survives Collector restarts, preventing data loss during network outages or backend unavailability.
extensions:file_storage:directory: /var/lib/otelcol/file_storagecreate_directory: truefsync: truecompaction:on_start: trueon_rebound: truerebound_needed_threshold_mib: 100rebound_trigger_threshold_mib: 10check_interval: 5sexporters:otlp_http:endpoint: "${env:DT_ENDPOINT}"headers:Authorization: "Api-Token ${env:DT_API_TOKEN}"sending_queue:enabled: truenum_consumers: 10queue_size: 5000storage: file_storageretry_on_failure:enabled: trueinitial_interval: 5smax_interval: 30smax_elapsed_time: 300s
The persistent queue backed by file storage is enabled by the sending_queue.storage setting.
Without it, the queue exists only in memory and is lost on restart.
When you enable sending_queue.storage, the in-memory queue is disabled.
Note that authentication context from extensions is not preserved in the persistent queue.
If your setup uses credentials derived from the incoming request, they won't be available when the data is read from disk and sent later.
The file storage extension behaves differently depending on the deployment environment. Understanding these differences helps you configure it appropriately for your use case.
Installation directories for the file storage extension depend on the operating system used.
/var/lib/otelcol/file_storage%ProgramData%\Otelcol\FileStorageThese directories must exist before the Collector starts, unless create_directory is enabled.
The directory can be located in storage locations such as the cloud, a local disk, or a persistent volume on a Kubernetes cluster. You should choose the storage location based on:
Avoid placing the storage directory on network filesystems (NFS, CIFS). The underlying database relies on file locking and memory-mapped files, which are often poorly supported on network drives.
Calculate the required storage capacity based on your expected throughput and potential downtime:
Required Storage = Throughput * Outage Duration * Safety Factor
100 MB/hour * 4 hours * 2 = 800 MBWhen running in containers (Docker, Kubernetes), the storage directory must be backed by a persistent volume for data to survive container restarts. Without this, the storage files are lost when the container is recreated.
Avoid sharing the same storage directory across multiple Collector instances.
The file storage extension uses file locking and is intended for a single process to own a given storage directory.
To achieve this in Kubernetes when running multiple replicas, use a StatefulSet instead of a Deployment.
A StatefulSet ensures that each pod gets its own unique PersistentVolumeClaim (PVC) via volumeClaimTemplates.
apiVersion: v1kind: PersistentVolumeClaimmetadata:name: otel-collector-storagespec:accessModes:- ReadWriteOnceresources:requests:storage: 10Gi---apiVersion: apps/v1kind: Deploymentmetadata:name: otel-collectorspec:template:spec:containers:- name: otel-collectorimage: ghcr.io/dynatrace/dynatrace-otel-collector/dynatrace-otel-collector:0.47.0volumeMounts:- name: storagemountPath: /var/lib/otelcol/file_storagevolumes:- name: storagepersistentVolumeClaim:claimName: otel-collector-storage
Protect the storage files from unauthorized access, as they may contain sensitive telemetry data.
Encryption at rest: The file storage extension does not encrypt data. If required by your security policy, use infrastructure-level encryption:
Storage reclamation: Configure compaction to reclaim disk space from processed data.
Note that compaction does not expire data based on time (retention); it only reclaims space from data that has already been successfully processed and removed from the queue.
Database corruption recovery: Enable automatic recovery from database corruption:
extensions:file_storage:recreate: true # Automatically handle corrupted database files
When recreate: true is set and corruption is detected, the corrupted file is renamed with a .backup suffix and a fresh database is created.
This allows the Collector to continue operating but may result in data duplication or loss of component state.
The file storage extension can consume significant memory in high-throughput scenarios. Understanding this behavior helps you configure the Collector appropriately and avoid out-of-memory issues.
Memory usage consists of:
The extension uses memory-mapped files for database operations, which affects how memory usage is reported and managed:
To control memory consumption:
Use the memory limiter processor.
Always use the memory limiter processor before file storage-backed components. This prevents the Collector from consuming excessive memory by refusing data when limits are approached. For configuration information, see memory limiter documentation.
Configure queue sizes appropriately.
Don't over-provision queue sizes. Calculate based on your actual needs:
exporters:otlp_http:sending_queue:enabled: truequeue_size: 5000 # Start conservative, increase if needednum_consumers: 10storage: file_storage
Larger queues provide more buffering but consume more memory.
When using persistent queues with file storage, consider setting sizer to items
or bytes if you need stricter control over queue size.
But, be aware of the performance implications with each setting:
requests: The default setting. Best performance. Counts batches/requests. Low overhead (O(1)).items: Moderate overhead. Counts individual items (spans, metrics, logs).
Requires iterating through the batch structure.
Recommended for better queue capacity control without extreme performance penalty.bytes: Least performant. Counts size of serialized data.
Requires serializing (marshalling) every request to calculate size, which significantly increases CPU usage.
Use only if strict memory limit enforcement is critical.exporters:otlp_http:sending_queue:enabled: truequeue_size: 5000num_consumers: 10storage: file_storagesizer: items # Counts by number of items (spans/metrics/logs). Moderate CPU overhead.# sizer: bytes # Counts by size of serialized data. High CPU overhead (requires serialization).
Enable compaction. Compaction reclaims disk space from deleted data and can reduce memory pressure.
extensions:file_storage:compaction:on_start: trueon_rebound: truedirectory: /tmp/otel_compaction # Temporary directory for compactionrebound_needed_threshold_mib: 100rebound_trigger_threshold_mib: 10check_interval: 5smax_transaction_size: 65536 # Maximum size of compaction transactioncleanup_on_start: true # Clean up temporary files on startup
Compaction modes:
on_start: Compacts the database when the Collector starts (reclaims space immediately but adds to startup time).on_rebound: Compacts online when storage usage drops after a spike.For technical details on how rebound compaction thresholds are calculated, refer to the compaction documentation.
Monitor and tune.
Monitor these metrics to understand memory behavior:
otelcol_processor_refused_spans: Indicates memory limiter is activating.process_runtime_total_sys_memory_bytes: Total system memory usage.otelcol_exporter_queue_size: Current queue occupancy.If you see frequent memory limiter activations, consider:
For high-throughput scenarios (>10,000 spans/second), consider these optimizations:
extensions:file_storage:timeout: 500ms # Reduce timeout for faster operationsfsync: false # Disable fsync for better performance (less durability)compaction:on_rebound: truemax_transaction_size: 65536exporters:otlp_http:sending_queue:enabled: truequeue_size: 10000num_consumers: 20 # Increase consumers for higher throughputstorage: file_storage
Disabling fsync improves performance but reduces durability.
If the system crashes before data is flushed to disk, you may lose some buffered data.
Only disable this if you can tolerate potential data loss.
Here's a production-ready configuration that combines resiliency best practices:
extensions:health_check:endpoint: 0.0.0.0:13133file_storage:directory: /var/lib/otelcol/file_storagecreate_directory: truefsync: truecompaction:on_start: trueon_rebound: truerebound_needed_threshold_mib: 100rebound_trigger_threshold_mib: 10check_interval: 5sreceivers:otlp:protocols:grpc:endpoint: 0.0.0.0:4317http:endpoint: 0.0.0.0:4318file_log:include: [/var/log/app/*.log]storage: file_storageprocessors:memory_limiter:check_interval: 1slimit_mib: 1024spike_limit_mib: 256exporters:otlp_http:endpoint: "${env:DT_ENDPOINT}"headers:Authorization: "Api-Token ${env:DT_API_TOKEN}"sending_queue:enabled: truenum_consumers: 10queue_size: 5000storage: file_storageretry_on_failure:enabled: trueinitial_interval: 5smax_interval: 30smax_elapsed_time: 300sservice:extensions: [health_check, file_storage]pipelines:traces:receivers: [otlp]processors: [memory_limiter]exporters: [otlp_http]metrics:receivers: [otlp]processors: [memory_limiter]exporters: [otlp_http]logs:receivers: [otlp, file_log]processors: [memory_limiter]exporters: [otlp_http]
Validate your settings to avoid any configuration issues.
We enable the file_storage extension to provide the underlying persistence mechanism for our queues and receivers.
We also use health_check to allow Kubernetes (or other orchestrators) to monitor the Collector's status.
We use the standard otlp receiver for network traffic and the filelog receiver for reading log files from the host.
The filelog receiver is configured to use file_storage to persist its reading offsets.
Crucially, we use the memory_limiter processor to prevent Out of Memory (OOM) errors during traffic spikes.
The otlp_http exporter sends data to Dynatrace.
We configure its sending_queue to use file_storage, ensuring that any data buffered during network outages is written to disk rather than held in volatile memory.
For more information about common issues see the File Storage extension README.
Error: failed to start extension "file_storage": timeoutcreate_directory with appropriate directory_permissions.
In Kubernetes, configure securityContext.fsGroup to match the container user (for example, 10001) to ensure the volume is writable.