1. Background & Why
In Grafana Mimir, hash rings are the critical infrastructure responsible for sharding, replication, and service discovery. The hash ring maps data tokens to specific instances — without it, distributors wouldn’t know which ingester to send writes to, and query engines wouldn’t know where to fetch data.
Historically, Mimir stored ring state in external databases like Consul or etcd. This creates operational overhead:
- Additional infrastructure to operate, monitor, and scale
- Separate clusters required for high availability
- Single-point-of-failure risk if the backend becomes unavailable
- Every ring change triggers expensive database transactions
Memberlist is a peer-to-peer gossip protocol built into Mimir that:
- Requires no external service — embedded in every Mimir pod
- Uses SWIM gossip over port 7946 (TCP and UDP)
- Survives partial cluster failures with eventual consistency
- Is the Grafana-recommended backend for modern Mimir deployments
This tutorial documents the zero-downtime migration from Consul (or etcd) to Memberlist for all Mimir rings using Mimir’s multi KV store feature.
Trade-Off: Consistency vs. Simplicity
| Feature | Memberlist (Recommended) | Consul |
|---|---|---|
| Operational Overhead | Minimal: Embedded in every Mimir pod; no separate service to operate. | High: Requires a separate Consul cluster, monitoring, and lifecycle management. |
| Consistency Model | Eventual: Changes propagate within ~5-10 seconds via gossip. | Strong: Immediate consistency via CAS (compare-and-swap) operations. |
| Failure Tolerance | Good: Survives network partitions gracefully; gossip self-heals. | Critical: Loss of quorum = cluster halt; requires careful bootstrap. |
| Network Calls | ~20-50 per pod per second (gossip heartbeats). | Dozens per second (CAS operations). |
| Best For | All modern Mimir deployments; especially Kubernetes. | Existing Consul environments; strong-consistency requirements. |
Eventual consistency in Mimir: A crashed ingester stays visible to distributors for up to ~10 seconds while gossip propagates. Writes to it fail and retry — Mimir’s write path tolerates this. It does not tolerate waiting on a centralized CAS operation for every write.
2. Architecture Overview
Rings and their KV keys
Each Mimir component maintains a ring in a separate KV namespace:
| Component | Key Prefix | KV Key Pattern |
|---|---|---|
| ingester | ingester/ | ingester/ingester-<zone>-<i>/ |
| distributor | distributor/ | distributor/distributor-<i>/ |
| compactor | compactor/ | compactor/compactor-<i>/ |
| store_gateway | store_gateway/ | store_gateway/store-gateway-<i>/ |
| alertmanager | alertmanager/ | alertmanager/alertmanager-<i>/ |
| ruler | rulers/ | rulers/ruler-<i>/ |
How structuredConfig overlays base config
structuredConfig is deep-merged on top of the base config — it wins on conflict. This is the mechanism used in Phase 1 and Phase 3 to override ring KV settings without rewriting the entire base config:
base config + structuredConfig = final config passed to Mimir binary
How runtimeConfig hot-reloads
Mimir polls runtimeConfig every ~10 seconds. The multi_kv_config key overrides the multi: section of all rings simultaneously — no pod restart required:
# runtimeConfig (hot-loaded)
multi_kv_config:
primary: memberlist # overrides primary for ALL multi-configured rings
mirror_enabled: false # stop writing to secondary
This is the zero-restart mechanism used in Phase 2.
Cluster Label Security
In Kubernetes environments (especially AWS EKS with Karpenter or similar auto-scaling), pod IPs are frequently recycled. If a Mimir pod dies and another pod is spawned in its place with the same IP, the memberlist gossip will view them as the same logical node.
Critical risk: If two different clusters (e.g., Mimir + Loki + Prometheus) are using memberlist on the same IP range, their gossip traffic will merge the rings — causing traffic to be misrouted between systems. This is catastrophic.
Solution: Cluster labels. Each cluster gets a unique identifier, and memberlist only accepts gossip traffic from nodes with matching labels. During this migration, you’ll set a cluster label in Phase 1, and enforce verification in Phase 3 once all pods share the label.
3. Prerequisites
Before starting the migration, verify:
Mimir version
- Mimir 2.2 or later (memberlist is default since 2.2.0)
- Mimir 2.17+ required if using HA tracker with memberlist for distributors
Memberlist is configured
In your Mimir config, confirm memberlist.join_members is set to the gossip ring service:
memberlist:
abort_if_cluster_join_fails: false
compression_enabled: false
join_members:
- mimir-gossip-ring.<namespace>.svc.cluster.local:7946
If join_members is missing, pods won’t form a cluster and migration will fail.
Network connectivity
- Port 7946 (TCP and UDP) is open between all Mimir instances
Current KV backend is healthy
Check all Consul or etcd instances are healthy before starting:
# For Consul
kubectl get pods -l app=consul,component=server -n <namespace>
# For etcd
kubectl get pods -l app=etcd -n <namespace>
Configuration access
You can edit and reload Mimir runtime configuration without restarting pods (via ConfigMap or API).
Monitoring access
Access to Prometheus to run queries and verify metrics during migration.
Pod metrics annotations
Check your Mimir deployment for pod annotation port misconfiguration. Prometheus scrape annotations must use numeric port values, not named ports:
# CORRECT
podAnnotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080" # numeric string
# WRONG — will cause scraping on wrong port
podAnnotations:
prometheus.io/port: "http-metrics"
If any component uses named port annotations, fix them before Phase 1. See Issue 1 for why.
Helm chart version
Confirm your Mimir Helm chart supports structuredConfig. It was introduced in mimir-distributed ~4.x.
4. Component Ring Overview
Each Mimir component maintains its own independent ring. Not all rings need to be migrated simultaneously, but in practice, the entire cluster must use the same KV backend — you cannot run memberlist on some components and Consul on others.
Here are the main rings and their migration priority:
| Component | Ring Configuration Key | Migration Priority | Notes |
|---|---|---|---|
| Ingesters | ingester.ring.* | 1 (Migrate first) | Most critical for write path. |
| Ingest Storage Partitions | ingester.partition_ring.* | 1 (Migrate first) | If using ingest storage. |
| Distributors | distributor.ring.* | 2 (Migrate second) | Critical for request routing. |
| Compactors | compactor.sharding_ring.* | 3 (Migrate third) | Less critical; benefits from stability. |
| Store-gateways | store_gateway.sharding_ring.* | 3 (Migrate third) | Read-path sharding; less disruptive. |
| Rulers (optional) | ruler.ring.* | Optional | Only if using Mimir Ruler. |
| Alertmanagers (optional) | alertmanager.sharding_ring.* | Optional | Only if using Mimir Alertmanager. |
| Query-schedulers (optional) | query_scheduler.ring.* | Optional | Only if using query scheduling. |
| Overrides-exporters (optional) | overrides_exporter.ring.* | Optional | Rarely used. |
5. Migration Strategy — Multi KV Approach
Why You Need Multi KV
A direct cutover loses ring history — memberlist starts empty, so all components suddenly appear unregistered, causing query failures and ingestion drops. Instead, use the multi KV store: it writes to both Consul and memberlist simultaneously, letting memberlist shadow the primary until it has a full copy of ring state. Only then do you flip reads over.
The Three-Phase Migration
Phase 1: [Consul PRIMARY] ←→ [Memberlist SECONDARY mirror]
↓ (runtime hot-reload, zero restart)
Phase 2: [Memberlist PRIMARY] ←→ [Consul SECONDARY, no mirror]
↓ (Helm deploy, rolling restart)
Phase 3: [Memberlist ONLY] (Consul deleted)
How Multi KV Works
The multi KV store uses these configuration parameters:
primary: The store that handles both reads and writes (e.g.,consul)secondary: The store that receives copies of writes only (e.g.,memberlist)mirror_enabled: Whether to send writes to secondary (should betrueduring Phase 1)mirror_timeout: How long to wait for secondary write before timing out (e.g.,2s)
If a secondary write fails, the primary write still succeeds — you’ll see the error in metrics, but the system doesn’t block.
6. Phase 1 — Dual-Write Setup
What changes
Add a configuration overlay that modifies all 6 ring kvstore blocks to use multi KV store with your current backend (Consul/etcd) as primary (reads + writes) and memberlist as secondary (writes only, mirrored).
Also set:
memberlist.cluster_label_verification_disabled: true— prevents accidental ring partitions while cluster labels are rolling outmemberlist.cluster_label— a unique identifier for your cluster (e.g.,mimir-prod-us-east-1), set now and enforced in Phase 3
See Cluster Label Security in the Architecture Overview for background.
Configuration changes
Using Helm, add a structuredConfig: block to your values:
mimir:
structuredConfig:
memberlist:
cluster_label_verification_disabled: true # Disable enforcement during rollout
cluster_label: "mimir-prod-us-east-1" # Set unique label (customize for your cluster)
ingester:
ring:
kvstore: &kvstore
store: multi
multi:
primary: consul # Your current backend (consul or etcd)
secondary: memberlist
mirror_enabled: true
distributor:
ring:
kvstore: *kvstore
compactor:
sharding_ring:
kvstore: *kvstore
store_gateway:
sharding_ring:
kvstore: *kvstore
alertmanager:
sharding_ring:
kvstore: *kvstore
ruler:
ring:
kvstore: *kvstore
Note: If you’re using etcd instead of Consul, replace
primary: consulwithprimary: etcd.
Note: Your base config may still reference the old backend directly. The
structuredConfigdeep-merge will override those settings at Helm render time. Leave your base config as-is for now.
Deploying Phase 1
Apply your Helm values:
$ helm upgrade mimir mimir-distributed -f mimir/values.yaml -n <namespace>
Watch for the rolling restart to complete:
$ kubectl rollout status deployment/mimir-ingester -n <namespace>
# Wait for all components
Verifying Phase 1
Check that multi KV is initialized:
$ kubectl logs -l app=mimir-ingester -n <namespace> | grep -i "Starting KV client.*multi"
Expected: One entry per pod showing store=multi.
Check that memberlist cluster formed:
$ kubectl logs -l app=mimir-ingester -n <namespace> | grep -i "joined memberlist cluster"
Expected: Entries from all pods joining the gossip ring.
Check that cluster label is set:
$ kubectl exec -it <ingester-pod> -n <namespace> -- \
curl -s localhost:9009/config | jq '.memberlist.cluster_label'
Expected: "mimir-prod-us-east-1" (or whatever label you set).
Monitor mirror health using PromQL. Watch these metrics for the first 10-15 minutes:
# Should increase steadily (mirror writes happening)
rate(cortex_multikv_mirror_writes_total[5m])
# Should be zero or very low (errors are normal during startup, will resolve)
cortex_multikv_mirror_write_errors_total
Secondary write errors are expected in the first 10-15 minutes — they resolve automatically as the gossip cluster converges. The primary (Consul/etcd) is still serving all reads and writes correctly.
Verify no scrape errors: → Run V2: Gossip Scrape Errors
Wait 15 minutes after all pods are running before proceeding to Phase 2. This ensures memberlist rings are fully synchronized with the primary backend.
7. Phase 2 — Flip Primary via Runtime Config
What changes
Update runtimeConfig in your Helm values to flip the multi KV primary from Consul to Memberlist. This is a hot-reload — no pod restart required.
Critical: Only change
runtimeConfigduring Phase 2. Do NOT touchstructuredConfig.Why:
runtimeConfigis re-read every ~10 seconds and overridesmulti:settings for all rings.structuredConfigrequires a Helm deploy and pod rollout, which risks disruption. The runtime config is the correct, zero-restart lever.
Configuration changes
runtimeConfig:
multi_kv_config: # ← ADD this block
primary: memberlist # flip reads+writes to memberlist
mirror_enabled: false # stop writing to Consul secondary
# ... existing runtimeConfig content ...
Note: The cluster label set in Phase 1 stays in place. Verification remains disabled (in
structuredConfig) until Phase 3, when all pods have rolled out with the label.
Deploying Phase 2
Apply your Helm values (this updates the ConfigMap only):
$ helm upgrade mimir mimir-distributed -f mimir/values.yaml -n <namespace>
Mimir pods will pick up the change in ~10 seconds without restarting.
Verifying Phase 2
Check that all components switched primary:
# Query a pod to verify active config
$ kubectl exec -it <ingester-pod> -n <namespace> -- \
curl -s localhost:9009/config | jq '.ingester_ring.multi_kv_config.primary'
# Should return: "memberlist"
Check ring health: → Run V3: Ring Member Count
Verify ring convergence: → Run V1: Ring Convergence
Verify no new Consul errors (Consul is now secondary and unused; stale heartbeat timeouts are normal):
$ kubectl logs -l app=mimir-ingester -n <namespace> | \
grep -i "error.*consul" | wc -l
Expected: Zero to very low.
Wait 15 minutes for stability before proceeding to Phase 3.
8. Phase 3 — Full Cutover & KV Backend Decommission
What changes
- Remove multi KV configuration — switch all rings to use memberlist directly
- Remove
memberlist.cluster_label_verification_disabled(enforcement now active) - Keep the
memberlist.cluster_labelset in Phase 1 - Remove
multi_kv_configfrom runtimeConfig - Decommission your Consul/etcd backend (delete pods, remove from infrastructure config)
The cluster label verification is now enforced: memberlist will only accept gossip traffic from nodes with matching labels. This prevents ring merging with other gossip clusters (e.g., Loki, Prometheus) that may be running on the same Kubernetes cluster.
Phase 3 is irreversible. Once your KV backend pods are deleted, recovering requires restoring infrastructure and re-deploying from backups. Plan accordingly.
Configuration changes
In your structuredConfig, replace all 6 multi KV blocks with simple memberlist, and remove the cluster label disable flag:
# BEFORE (Phase 1/2)
mimir:
structuredConfig:
memberlist:
cluster_label_verification_disabled: true
cluster_label: "mimir-prod-us-east-1" # Keep label; enforcement is now re-enabled
ingester:
ring:
kvstore: &kvstore
store: multi
multi:
primary: memberlist
secondary: consul
mirror_enabled: false
# AFTER (Phase 3)
mimir:
structuredConfig:
memberlist:
cluster_label: "mimir-prod-us-east-1" # Keep label; enforcement is now re-enabled
ingester:
ring:
kvstore: &kvstore
store: memberlist # Simple, clean
In your runtimeConfig, remove the multi KV config block entirely:
# BEFORE
runtimeConfig:
multi_kv_config:
primary: memberlist
mirror_enabled: false
# ... other runtimeConfig ...
# AFTER
runtimeConfig:
# multi_kv_config section removed entirely
# ... other runtimeConfig ...
In your infrastructure config (Helm values, Terraform, or whatever deploys your KV backend), remove or disable the Consul/etcd deployment:
Deploying Phase 3
Apply your Helm values:
$ helm upgrade mimir mimir-distributed -f mimir/values.yaml -n <namespace>
Mimir pods will do a rolling restart (structuredConfig change requires pod restart). The ring will remain stable throughout because memberlist is peer-to-peer — it doesn’t depend on a central service. Once pods have restarted, delete your consul KV backend pods
Verifying Phase 3
Run these checks at 5-minute intervals, 3 times minimum (15 min total):
- V4: KV Backend Errors — should be zero
- V5: Ring/Memberlist Errors — should be zero
- V2: Gossip Scrape Errors — should be zero
Confirm KV store is memberlist in logs:
$ kubectl logs -l app=mimir-ingester -n <namespace> -c mimir | grep "Starting KV client" | head -3
Expected: All entries show store=memberlist.
Check ring convergence: → Run V1: Ring Convergence
Verify cluster label enforcement is enabled:
kubectl exec -it <ingester-pod> -n <namespace> -- \
curl -s localhost:9009/config | jq '.memberlist.cluster_label_verification_disabled'
Expected: false (or the key is absent — meaning enforcement is active). Verification is now enforced; memberlist will reject gossip packets from nodes with mismatched labels.
9. Verification Reference
The following verification checks are referenced throughout the migration phases. Run them as needed to confirm health at each stage.
V1: Ring Convergence
Purpose: Verify that ring state is propagating correctly across all pods via gossip.
time() - cortex_ring_oldest_member_timestamp
Expected: < 30 seconds (ideally < 15 seconds). If this value is consistently > 30s, gossip propagation is laggy; see Issue 6 for tuning.
V2: Gossip Scrape Errors
Purpose: Verify Prometheus isn’t accidentally scraping the memberlist gossip port (symptom of misconfigured pod annotations).
kubectl logs -l app=mimir-ingester -n <namespace> | \
grep -E "unknown message type|TCPTransport"
Expected: Zero results. If non-zero, see Issue 1 — your pod annotations likely use named ports instead of numeric port strings.
V3: Ring Member Count
Purpose: Verify all expected ring members are present and healthy.
kubectl exec -it <mimir-distributor-pod> -n <namespace> -- \
curl -s localhost:9009/ingester/ring | jq '.members | length'
Expected: Matches your expected replica count (typically 3 for ingesters, varies by setup).
V4: KV Backend Errors
Purpose: Verify no errors are occurring with your KV backend (Consul or etcd).
kubectl logs -l app=mimir-ingester -n <namespace> | \
grep -iE "error.*consul|error.*etcd"
Expected: Zero results (in Phase 2 and later, some stale heartbeat warnings are normal and not blocking).
V5: Ring/Memberlist Errors
Purpose: Verify no internal ring or memberlist errors.
kubectl logs -l app=mimir-ingester -n <namespace> | \
grep -iE "error.*(ring|kvstore|memberlist)"
Expected: Zero results.
10. Known Issues & Mitigations
Issue 1: Prometheus Scrape Port Mismatch
Symptom: Memberlist TCPTransport "unknown message type G" errors in logs, originating from Prometheus IP addresses.
Root Cause: Pod annotation prometheus.io/port: "http-metrics" is a string name, not a port number. Prometheus’s scrape config regex captures the numeric port (\d+); when it can’t resolve a named port to a number, it falls back to the first open port it finds — which is the memberlist gossip port 7946. Mimir sees the Prometheus HTTP scrape as a gossip packet and logs the error.
Fix: Change all component podAnnotations to use prometheus.io/port: "8080":
When to fix: Before Phase 1 deploy. This is a prerequisite, not a post-Phase 1 fix.
Verification: No "unknown message type" in logs within 5 min of deploy.
Issue 2: Secondary KV Write Timeouts During Phase 1 Startup
Symptom: "error writing to secondary KV store" or write timeout errors in logs during the first 10-15 minutes of Phase 1.
Root Cause: Phase 1 enables mirror_enabled: true, so every ring write goes to both your primary backend and memberlist. However, the memberlist cluster takes 1-2 minutes to fully form after pod restart — during this window, secondary writes fail because the gossip ring hasn’t converged yet.
Mitigation: This is expected and not a blocking issue. Secondary writes resume automatically once the gossip cluster is healthy. The primary is still serving all reads and writes correctly.
Action: Do not roll back. Monitor for 15 minutes. Errors should drop to zero.
Issue 3: structuredConfig vs runtimeConfig Precedence
Symptom: Confusion about which config layer to change when flipping from Consul primary to Memberlist primary.
Rule:
- Phase 1: Change
structuredConfig(Helm values — requires pod restart). Configures the multi KV backend. - Phase 2: Change only
runtimeConfig(ConfigMap hot-reload — no pod restart). Flips the primary store. - Phase 3: Change
structuredConfigagain (Helm deploy — rolling restart). Removes multi KV entirely.
Precedence chain (highest wins):
runtimeConfig (hot-reload, every ~10s)
> structuredConfig (Helm deep-merge)
> base config
Changing structuredConfig for Phase 2 triggers an unnecessary rolling restart. runtimeConfig.multi_kv_config is the correct zero-restart lever.
Issue 4: No Private IP Found
Symptom: Memberlist logs show “No Private IP Found” errors.
Cause: Kubernetes VPC CNI has ENABLE_PREFIX_DELEGATION enabled; memberlist can’t determine which interface to bind to.
Fix: Set memberlist.bind_addr to the pod IP using the Downward API
Issue 5: Too Many Unhealthy Instances
Symptom: Ring shows many instances marked UNHEALTHY or LEAVING; queries fail intermittently.
Cause: Cluster merged with another system via IP reuse, or ingester pods were force-deleted without deregistration.
Fix: Use the ring admin API to manually “forget” bad instances or restart all ingester pods simultaneously to reset the in-memory ring:
Issue 6: Slow Ring Updates
Symptom: Ring changes take 30+ seconds to propagate across the cluster.
Cause: Gossip interval is too large, or gossip-nodes count is too low.
Fix: Tune gossip parameters (rarely needed):
mimir:
structuredConfig:
memberlist:
gossip_interval: 500ms # Increase heartbeat frequency (default: 200ms)
gossip_nodes: 4 # Gossip to more peers per interval (default: 3)
retransmit_factor: 5 # Retry messages more (default: 4)
pullpush_interval: 20s # Full state sync interval (default: 10s)
Troubleshooting Quick Reference
| Symptom | Likely Cause | Resolution |
|---|---|---|
| ”unknown message type G” in logs | Prometheus scraping gossip port 7946 | Fix pod annotation to use numeric port "8080" instead of "http-metrics" |
| Secondary KV write timeout errors | Memberlist gossip cluster not converged yet | Expected during Phase 1 startup; wait 15 min, errors will resolve |
| ”Too Many Unhealthy Instances” | Ring merged with another system via IP reuse | Restart all pods or use /ring/forget/<instance-id> API to deregister bad entries |
| ”No Private IP Found” | Kubernetes CNI can’t resolve pod IP | Set memberlist.bind_addr to pod IP via Downward API |
| Ring updates take 30+ seconds | Gossip interval too large | Increase gossip_nodes from 3 to 4-5; decrease gossip_interval if needed (rarely required) |
| High CPU on memberlist reconciliation | Ring state comparison overhead with many instances | Increase pullpush_interval from 10s to 20-30s |
11. Rollback Procedures
Rollback Phase 1 (Low Risk)
Remove the structuredConfig block added in Phase and re-deploy to return to your original KV backend only.
Rollback Phase 2 (Medium Risk — Hot Reload)
Phase 2 is a runtimeConfig-only change. Remove multi_kv_config from runtimeConfig:
runtimeConfig:
# multi_kv_config: ← delete this block
# ... rest of runtimeConfig ...
Deploy. Within 10 seconds, all rings will re-read the config and switch back to primary: <your-original-backend>. No pod restart needed.
Rollback Phase 3 (High Risk — KV Backend Deleted)
Phase 3 deletes your KV backend. If issues arise after Phase 3 deploy:
-
Restore your KV backend infrastructure:
-
Restore Mimir to Phase 1 multi KV state (original backend primary, memberlist secondary):
mimir: structuredConfig: memberlist: cluster_label_verification_disabled: true ingester: ring: kvstore: store: multi multi: primary: <original-backend> # consul or etcd secondary: memberlist mirror_enabled: true # ... repeat for all 6 components ... -
Deploy Mimir — rolling restart will reconnect rings to the restored KV backend
-
Remove
multi_kv_configfrom runtimeConfig (if Phase 3 had set it) -
Once stable, you can re-plan the migration with the root cause fixed
12. Post-Migration Verification
Once all components have been migrated to memberlist, verify the entire deployment.
Run the following reference checks:
- V1: Ring Convergence — should be < 30 seconds
- V3: Ring Member Count — should match expected replicas
- V5: Ring/Memberlist Errors — should be zero
Additionally verify:
-
All Mimir components show
store: memberlistin config checkskubectl exec -it <any-mimir-pod> -n <namespace> -- \ curl -s localhost:9009/config | jq '.ingester_ring.kvstore.store' # Should return: "memberlist" -
No
cortex_multikv_*metrics being recorded (mirroring is fully disabled)cortex_multikv_mirror_writes_total # Should have no values -
Ring convergence is healthy — V1: Ring Convergence should be < 30 seconds
-
All pods in all rings are marked as ALIVE (not LEAVING, UNHEALTHY, or JOINING)
kubectl exec -it <distributor-pod> -n <namespace> -- \ curl -s localhost:9009/ingester/ring | jq '.members[] | select(.state != "ALIVE") | .addr' # Should return empty (all pods ALIVE) -
No errors in logs — V5: Ring/Memberlist Errors should be zero
-
Data ingestion and query latencies are stable (no regression)
- Check your monitoring dashboard for any latency increases
- Verify query success rates haven’t dropped
-
Your original KV backend cluster can be decommissioned
- Verify no remaining Consul/etcd instances in your infrastructure config
- Update your monitoring/alerting to remove KV backend checks (no longer needed)
Quick Reference Checklist
Pre-Migration
- Mimir 2.2+
- Port 7946 (TCP+UDP) open between all pods
memberlist.join_membersconfigured- Current KV backend healthy
- Pod metrics annotations use numeric port (not named port)
- Prometheus access for metric verification
Phase 1
structuredConfigadded with all 6 rings asstore: multi, primary: <backend>memberlist.cluster_label_verification_disabled: truesetmemberlist.cluster_labelset to a unique identifier (e.g.,mimir-prod-us-east-1)- Deploy and wait 15 min
- No
"unknown message type"errors in logs cortex_multikv_mirror_write_errors_totalis zero (after 15 min)- All pods joined memberlist gossip cluster
Phase 2
multi_kv_config: primary: memberlist, mirror_enabled: falseadded toruntimeConfigonlystructuredConfigleft unchanged- Deploy — no pod restart triggered
- All 6 components switched to memberlist within 30s
- Wait 15 min — no errors, ring stable
- Ring convergence healthy
Phase 3
- Remove
cluster_label_verification_disabled: truefrom structuredConfig (enforcement re-enabled) - Cluster label kept in place (verification now enforced)
- All 6 rings changed to
store: memberlist(remove multi KV) multi_kv_configremoved from runtimeConfig- KV backend deployment disabled/removed from infra config
- Deploy — rolling restart
- Verify
cluster_label_verification_disabled: false(enforcement active) - Zero KV backend errors in logs
- KV backend pods deleted
- Logs confirm
store=memberlist
Post-Migration
- All rings showing ALIVE instances
- No memberlist/ring/kvstore errors
- Ring convergence < 30 seconds
- Ingestion and query latencies normal
- KV backend cluster decommissioned