How to recover from a corrupt Keeper snapshot
Corrupt or bad ClickHouse Keeper snapshots can cause significant system instability, such as metadata inconsistencies, read-only states for tables, resource exhaustion, or failed backups. This article covers:
- What snapshots are and where to find them
- How the problem manifests
- Possible strategies for recovery and what each of them means
Overview of Keeper snapshots
What is a snapshot?
A snapshot is a serialized state of Keeper's internal data (such as metadata about clusters, table coordination paths, and configurations) at a specific point in time. Snapshots are vital for resynchronizing Keeper nodes within a cluster, recovering metadata during failures, and start-up or restart processes that rely on a known-good Keeper state.
Where can I find snapshots?
Snapshots are stored as files on the local filesystem of Keeper nodes. By default, they are stored at /var/lib/clickhouse/coordination/snapshots/ or by the custom path specified by snapshot_storage_path in your keeper_server.xml file. Snapshots are named incrementally (e.g., snapshot.23), with newer ones having higher numbers.
For multi-node clusters, each Keeper node has its own snapshot directory.
Consistency within snapshots across nodes is critical for recovery.
Key symptoms and manifestations of corrupt Keeper snapshots
The table below details some common symptoms and manifestations of corrupt Keeper snapshots:
| Category | Issue Type | What to look for |
|---|---|---|
| Operational Issues | Read-Only Mode | Tables unexpectedly switch to read-only mode |
| Query Failures | Persistent query failures with Coordination::Exception errors | |
| Metadata Corruption | Outdated Metadata | Dropped tables not reflected; operation failures due to stale metadata |
| Resource Overload | System Resource Exhaustion | Keeper nodes consume excessive CPU, memory, or disk space; potential downtime |
| Disk Full | Disk full during snapshot creation | |
| Backup & Restore | Backup Failures | Backups fail due to missing or inconsistent Keeper metadata |
| Snapshot Creation/Transfer | Keeper Crash | Keeper crash mid-snapshot (look for "SEGFAULT" errors) |
| Snapshot Transfer Corruption | Corruption during snapshot transfer between replicas | |
| Race Condition | Race condition during log compaction - background commit thread accessing deleted logs | |
| Network Synchronization | Network issues preventing snapshot sync from leader to followers |
Log Indicators:
Before diagnosing snapshot corruption, check Keeper logs for specific error patterns:
| Log Type | What to Look For |
|---|---|
| Snapshot corruption errors | • Aborting because of failure to load from latest snapshot with index• Failure to load from latest snapshot with index {}: {}. Manual intervention is necessary for recovery• Failed to preprocess stored log at index {}, aborting to avoid inconsistent state• Snapshot serialization/loading failures during startup |
| Other Keeper issues | • Coordination::Exception• Zookeeper::Session Timeout• Synchronization or election issues • Log compaction race conditions |
Recovering from corrupt Keeper snapshots
Before touching any files, always:
- Stop all Keeper nodes to prevent further corruption
- Backup everything by copying the entire coordination directory to a safe location
- Verify cluster quorum to ensure at least one node has good data
1. Restore from an existing backup
You should follow this process if:
- The Keeper metadata or snapshot corruption makes current data unsalvageable.
- A backup exists with a known-good Keeper state.
Follow the steps below to restore an existing backup:
- Locate and validate the newest backup for metadata consistency.
- Shut down the ClickHouse and Keeper services.
- Replace the faulty snapshots and logs with those from the backup directory.
- Restart the Keeper cluster and validate metadata synchronization.
If backups are outdated, you may incur a loss of recent metadata changes. For this reason, we recommend backing up regularly.
2. Rollback to an older snapshot
You should follow this process when:
- Recent snapshots are corrupt, but older ones remain usable.
- Incremental logs are intact for consistent recovery.
Follow the steps below to roll back to an older snapshot:
- Identify and select a valid older snapshot (e.g., snapshot.19) from the Keeper directory.
- Remove newer snapshots and logs.
- Restart Keeper so it replays logs to rebuild the metadata state.
There is a risk of metadata desynchronization if snapshots and logs are missing or incomplete.
3. Restore metadata using SYSTEM RESTORE REPLICA
You should follow this process when:
- Keeper metadata is lost or corrupted but table data still exists on disk
- Tables have switched to read-only mode due to missing ZooKeeper/Keeper metadata
- You need to recreate metadata in Keeper based on locally available data parts
Follow the steps below to restore metadata:
-
Verify that table data exists locally in your clickHouse-server data path, set by
<path>in your config. (/var/lib/clickhouse/data/by default) -
For each affected table, execute:
- For database-level recovery (if using Replicated database engine):
- Wait for synchronization to complete:
- Verify recovery by checking
system.replicasforis_readonly = 0and monitoringsystem.detached_parts
SYSTEM RESTORE REPLICA detaches all existing parts, recreates metadata in Keeper (as if it's a new empty table), then reattaches all parts. This avoids re-downloading data over the network.
This only works if local data parts are intact. If data is also corrupted, use strategy #5 (rebuild cluster) instead.
4. Drop and recreate replica metadata in Keeper
You should follow this process when:
- The error occurs on a single replica of the cluster and has corrupt or inconsistent metadata in Keeper
- You encounter errors like "Part XXXXX intersects previous part YYYYY"
- You need to completely reset a replica's Keeper metadata while preserving local data
Follow the steps below to drop and recreate metadata:
- On the affected replica, detach the table:
- Remove the replica's metadata from Keeper (execute on any replica):
To find the correct ZooKeeper path:
- Reattach the table (it will be in read-only mode):
- Restore the replica metadata:
- Synchronize with other replicas:
- Check
system.detached_partson all replicas after recovery
If the corruption affects multiple replicas, repeat these steps on each one sequentially.
If using a Replicated database, you can use SYSTEM DROP REPLICA ... FROM DATABASE db_name instead.
Alternative: Using force_restore_data flag
For automatic recovery of all replicated tables at server startup:
- Stop ClickHouse server
- Create the recovery flag:
- Start ClickHouse server
- The server will automatically delete the flag and restore all replicated tables
- Monitor logs for recovery progress
This approach is useful when multiple tables need recovery simultaneously.
5. Rebuild Keeper cluster
You should follow this process when:
- No valid snapshots, logs, or backups are available for recovery.
- You need to recreate the entire Keeper cluster and its metadata.
Follow the steps below to rebuild the Keeper cluster:
- Fully stop the ClickHouse and Keeper clusters.
- Reset each Keeper node by cleaning the snapshot and log directories.
- Initialize one Keeper node as the leader and add other nodes incrementally.
- Re-import metadata if available from external records.
This process is time-intensive and carries a risk of prolonged outage. Total data reconstruction is required.