• incident summary - Résumé de l'incident: Faulty NVME has to be replaced
• Start time - Heure de début: 14:30 CET 16/10/2017
• Impact - Périmètre affecté: Public Cloud
• Impact type / Type d'impact: Lowered performances
Update(s):
Date: 2017-10-17 12:57:40 UTC Data rebalanced.
Date: 2017-10-17 07:52:53 UTC 95 % of the data are rebalanced.
Date: 2017-10-16 22:32:44 UTC 45% of the data are rebalanced.
Date: 2017-10-16 16:34:12 UTC No issue on the new host.
We'll check rebalance progress during the evening.
Date: 2017-10-16 16:00:08 UTC Disks from the new hosts are in the cluster, data are rebalancing.
Date: 2017-10-16 15:38:40 UTC Host crashed few minutes after after boot.
Disk removed from cluster, we are growing cluster with another host.
Date: 2017-10-16 15:22:02 UTC Host with nvme replaced just crashed, if the issue is not linked to the nvme we'll use another host to grow the cluster.
Date: 2017-10-16 15:01:56 UTC Disks added to the cluster, data are rebalancing.
Date: 2017-10-16 14:42:37 UTC Task started, disks will be within the clusters within few minutes.
Then data will have to rebalance.
Date: 2017-10-16 14:40:51 UTC Growing cluster by readding the \"missing\" disks.
Date: 2017-10-16 14:03:03 UTC NVME is being replaced, we'll be then able to readd disks to the cluster.
Date: 2017-10-16 13:07:22 UTC Data that only exists in 2 replicates are now being replicated a third time.
Date: 2017-10-16 13:04:42 UTC 8 of the 12 disks were still running, but 4 OSD still running seemed to slow the whole cluster.
We removed all the disks from the cluster.