Bare Metal Cloud Status

OVHcloud Bare Metal Cloud Status

Current status

Legend

Operational
Degraded performance
Partial Outage
Major Outage
Under maintenance

Storage SBG2

Scheduled Maintenance Report for Bare Metal Cloud

Completed

• intervention summary - Résumé de l'intervention: Faulty disk
• Start time - Heure de début: 2017-10-24 11:15:00
• Impact - Périmètre affecté: VPS cloud 2016 SBG2
• Impact type / Type d'impact: Lowered performances

Update(s):

Date: 2017-11-08 10:40:13 UTC
we reduced latency, but it's still higher than before. We are still investigating.

Date: 2017-10-31 22:17:45 UTC
We are still investigating, latency is higher than usual.

Date: 2017-10-30 22:39:47 UTC
We'll compare tonight metrics tomorrow morning.

Date: 2017-10-30 21:56:52 UTC
Operation done, checking metrics.

Date: 2017-10-30 21:06:38 UTC
Starting operation.

Date: 2017-10-30 15:05:02 UTC
Intervention planned for tonight, 30/10 22:00 CET

Date: 2017-10-27 23:49:07 UTC
We did one third of the operation but service was impacted with lower performances during five minutes.
We need to improve the way we do it. ETA monday evening.

Date: 2017-10-27 23:33:41 UTC
Operation started.

Date: 2017-10-26 13:19:03 UTC
An operation will be done the 27/10 to 28/10 night, no impact expected.
Clusters performances will be manually checked, others operation may be done during the evening if required.

Date: 2017-10-26 08:52:11 UTC
Data rebalanced during the evening of the 24th October.

Cluster still has an high latency, we are investigating.

Date: 2017-10-24 14:37:54 UTC
À 11:00 CET Un disque montrant des signes de faiblesses a été évacué du cluster. Cependant, un disque partageant des données communes avec ce disque avait été évacué une demie heure plus tôt. Cette situation a fait qu’il restait alors une copie de certaines données sur le cluster. Cette opération aurait dû attendre que la première évacuation se termine, le cluster étant en état dégradé.

Ceph, la technologie utilisée pour notre stockage, priorise toujours l'intégrité des données. Lorsque des données n'existent qu'en une copie, Ceph favorise la sécurisation des données et donc le trafic de réplication et de déplacement des données par rapport à celui des utilisateurs. Bien que la disponibilité du service ait été très fortement perturbée, à aucun moment les données n'ont été mises en danger.

Pour accélérer ce processus de sécurisation nous avons été dans l’obligation de partiellement bloquer les accès utilisateurs aux données.

À 14:12 CET toutes les données existaient en deux copies, nous avons donc ré-ouverts les accès pour rétablir le service. Des données étant toujours à copier et déplacer, les performances sont réduites.

Nos tests de sécurité n'ont pas détectés ce genre d'action dangereuse.
Dès que le cluster est parfaitement rétabli nous nous concentrons à l'amélioration de nos robots pour qu'une situation similaire ne puisse pas se reproduire.

///////////////////////////////////////////////////////////////////////////////////////////

At 11:00 CET a disk showing signs of weakness was evacauted from the cluster. However, a disk sharing some mutual data with this disk was evacuated half an hour earlier. As a result of this evacuation, some data had one copy on the cluster. This opeartion should have waited for should have waited for the first evacuation to finish, as cluster was in degraded state.

Ceph, the technology we use for our storage, always prioritise data integrity. When data exists in a single copy, Ceph prioritise data replication and migration traffic over user traffic. Although storage availability was highly impacted, at no time data were in danger.

To speed up this replication and migration, we had to partialy block access to the data.

At 14:12, all data were existing with at least two copies, so re reopened the network access to re-establish the service, some data still had to copy or migrate, performances are lowered.

Our safety checks failed to detect this kind of dangerous action.
As soon as the cluster is fully healthy, we'll focus on improvement of our robots to ensure this situation does not happen again.

Date: 2017-10-24 13:38:48 UTC
97% of the data rebalanced, still in progress.

Date: 2017-10-24 12:05:25 UTC
80% of the data rebalanced, still in progress.

Date: 2017-10-24 11:18:32 UTC
Rebalance will take at least two hours.

Date: 2017-10-24 10:48:40 UTC
Data are rebalancing, two disks from two different racks were removed from the cluster.
For safety reasons, the cluster blocks some IO to replicate data.

Date: 2017-10-24 09:29:58 UTC
We had to evacuate a faulty disk, most of the performances are used for data replication.
User operation are slowed down.

Posted Oct 24, 2017 - 09:28 UTC

This scheduled maintenance affected: Virtual Private Servers || Global Infrastructure (ERI, GRA, SBG, LIM, WAW, BHS, SGP, SYD).