Announcing PCI WAW1

OVHcloud Public Cloud Status

Current status

Legend

Operational
Degraded performance
Partial Outage
Major Outage
Under maintenance

Announcing PCI WAW1

Scheduled Maintenance Report for Public Cloud

Completed

Today marks the launch of our new Public Cloud region: WAW1!
WAW1 region is located in Warsaw, Poland, central Europe.

Dzień dobry!
Warsaw is our 3rd region in Europe, with GRA (Gravelines,
France), and SBG (Strasbourg, France). With our upcoming
regions in UK and Germany, you will be able to run infrastructures
accross 5 different Availability Zones in Europe!
Of course, you can enable the vRack private network, and
link your infrastructure on a l2 multi-region network.

The region is currently being enabled for all customers,
on actual projects.
WAW1 will support the following services:
- compute instances
- compute images
- compute volumes
- network
- object storage
- archive storage

WAW1 runs on OpenStack Newton.

We are rolling out the services in production progressively.
Compute images, volumes and network are now live, on C2 instances.

C2 instances run on top of top-range Intel Xeon E5 CPU. They
are perfect for high-performance computing requirements like
video encoding, batch processing, web-servers etc.

In the upcoming days, we will release R2 (Ram optimised),
S2 (sandbox) and B2 (General Purpose) flavors.

Get more infos here:
https://www.ovh.pl/public-cloud/instances/?waw

On a more technical point of view, WAW1 was key for us, as
it is the first region to run on the latest OpenStack version: Newton.
Newton means fewer bugs, enhanced performance, easier updates...

The new region should appear soon in your control panel:
https://www.ovh.com/manager
and via OpenStack Horizon & API.

Do not hesitate to provide us feedbacks on our public mailing-list:
cloud@ml.ovh.net.

Looking forward to announcing the next steps,

The Public Cloud Team

Update(s):

Date: 2017-08-10 12:45:53 UTC
-- FR --

Résumé :
Lundi 07 août 2017 nous avons rencontré un incident sur le cluster d'authentification (Openstack Keystone) de Public Cloud. Ceci a eu un impact sur toutes les API Openstack. La root cause est un problème d’optimisation du code gérant la vérification des tokens révoqués. L’action corrective la plus importante qui va être mise en œuvre est la décentralisation de Keystone afin de diminuer les domaines de pannes lorsqu’un évènement comme celui-ci se produit.

Timeline et root cause :
10h : Nous démarrons la mise en production de la région WAW1 de notre produit Public Cloud à 10h.

10h30 : Tous les premiers tests étaient au vert. Nous avons donc décidé d’ouvrir la région WAW1 à la création de projet pour tous les utilisateurs Public Cloud.

10h58: Keystone (le composant OpenStack qui gère l'authentification) montre une montée en charge anormale, et des temps de validation de token accrus sont immédiatement remontés par notre monitoring.
Nous décidons alors de suspendre le robot responsable de l'activation de la région sur tous les projets. Ce robot qui effectuait l'invalidation des tokens ce qui a provoqué une dégradation des temps de réponse sur les services suivants :
- API compute (instances, images, volumes)
- API storage (archive & object storage)
Les instances Public Cloud n'ont pas subies d'interruption et aucune perte de données n'a eu lieu.
Nous avons procédé au retour arrière de l'annonce de la région, mais la charge sur Keystone est restée anormalement élevée.

12h45 : Comprenant que le problème n’est pas une conséquence directe de l’activation de la nouvelle région nous décidons de couper les services liés à Keystone afin de réduire le domaine d'étude et identifier le composant responsable de ce comportement.

12h50: Keystone retrouve une charge normale suite la coupure des services qui l’interrogeaient. Nous décidons de redémarrer les services au fur et à mesure.

12h52: Les services de Storage et d'Archive sont de nouveau pleinement opérationnels.
Nous procédons au redémarrage de la région SBG3 pour la partie Compute, la charge de Keystone augmente légèrement.

13h20: La région SBG3 est opérationnelle.

13h30: nous décidons de démarrer SBG1.

Nous constatons un taux de requêtes erronées trop important du composant Neutron, nous procédons à son redémarrage.

La charge de Keystone augmente de manière exponentielle. Nous décidons de stopper la procédure de slow-start car nous comprenons que Keystone lui-même est à l’origine du problème.

14h15: Après avoir \"tracé\" les processus Keystone, nous identifions la root-cause: le processus de comparaison des tokens dans la table de révocation. Keystone doit comparer le jeton d'identification avec une liste de jetons expirés, et passe un temps anormalement long sur cette étape. Le nombre de requêtes en attente augmente, la latence aussi, et ainsi de suite.
La table \"revocation_event\" dans la base de données de Keystone est ainsi purgée. Après quelque secondes, la charge de Keystone diminue très fortement et les temps de validation de tokens redeviennent satisfaisants.

14h25: Nous redémarrons les régions SBG1 et GRA3.

15h00: L’ensemble des API openstack sont de nouveau opérationnelles.

Actions correctives :
Nous allons mettre en place un certain nombre d' actions pour éviter qu' un tel incident se reproduise :

Décentralisation de Keystone pour éviter un nouvel incident de cette ampleur

Fix de la root cause dans Keystone

Séparation du trafic authentifié et non authentifié dans swift afin que Keystone n’impacte pas ce dernier 

Même s'il s’agissait d'une maintenance programmée nous activons la procédure de SLA conformément à nos engagements contractuels. Les clients concernés recevront un mail pour l’application de la procédure.

Nous vous prions de nous excuser pour la gêne occasionnée et nous vous remercions de votre compréhension.

-- EN --

Summary :
On August 7, 2017 at 11am (CET) we identified an issue on the authentication cluster (Openstack Keystone) on our Public Cloud solution. This incident had an impact on all Openstack APIs. The Root Cause is an optimization issue of the code in charge of the token verification process. The main corrective action is the decentralization of Keystone in order to reduce the impact of such incidents.

Timeline of events (CET) and Root Cause:
10:00 AM : We started the deployment of the new WAW1 region of the Public Cloud solution.

10:30 AM : All indicators were green and our tests successful, we decided to open the new WAW1 region to all newly created projects for all Public Cloud users.

10:58 AM : Keystone (the authentication component of Openstack) has an higher than usual load and our monitoring tools show increased token validation time.
We then deactivated the robot in charge of enabling the region on all projects.

Token invalidation impacted the following services:
- Compute API (instances, images, volumes)
- Storage API (archive & object storage)

There was no Public Cloud instance interruption and no data loss to report.
We started rollbacking the deployment of the new region but the load on Keystone was still unusually high.

12:45 AM : We realized that the issue was not due to the activation of the new region itself so we decided to cut all services linked to Keystone in order to reduce the field of investigation and identify the component at fault.

12:50 AM : Keystone’s load comes back to normal after cutting all requests from all other services to Keystone, we then restarted all services one after another.

12:52 AM : Storage and Archive are fully operational again
We restarted the SBG3 region for Compute and the load on Keystone slightly increased

1:20 PM : SBG3 is now fully operational

1:30 PM : We restarted SBG1

We observed a number of invalid requests higher than usual from Neutron and restarted it.

The load on Keystone increased exponentially and we decided to stop the rampup procedure. That’s when we realized that Keystone itself was the component at fault

2:15 PM : After tracing Keystone process we identified the Root Cause : the token check with the Keystone revocation table. Keystone compares each token with a list of expired tokens and this step of the process is much longer than usual. This induces a longer than usual queue of request which slows the whole process down and make the situation even worse.
The « revocation_event » table has then been cleared in the Keystone database and after a few seconds the load on keystone decreased drastically and response times came back to normal level.

2:25 PM : We restarted SBG1 and GRA3

3:00 PM : All Openstack API are fully operational again

Corrective actions:
We are going to set up a list of actions to prevent such incident from happening again:
- Decentralization of Keystone to reduce the impact of such incident
- Fix the Root Cause in Keystone
- Separation of the authenticated and non-authenticated traffic in Swift so Keystone won’t impact it.

Even if the maintenance was planned we will still apply the SLA in accordance with our contractual obligation. All customers impacted will receive an email shortly in order to apply to the procedure.

Date: 2017-08-07 15:07:56 UTC
You can now use the new WAW1 region when creating a new project.

Within few hours, all active projects will be able to, also.

As a reminder, only C2-* instances are available.

Date: 2017-08-07 13:08:09 UTC
All services are now up.
As written here:

http://travaux.ovh.net/?do=details&id=26480

Date: 2017-08-07 10:56:59 UTC
Swift services are now brought back up

Date: 2017-08-07 10:50:19 UTC
Keystone issue is not related to WAW1 new region add.

We are bringing back services one by one. Swift first.

Date: 2017-08-07 09:43:13 UTC
Keystone response time is going back to a normal state, but token validation is still too slow.

Date: 2017-08-07 09:13:09 UTC
Keystone response time slower than usual. We are investigating

Date: 2017-08-07 08:49:32 UTC
Region enabling is in progress.
Within few hours, all customers will be able to use our new , WAW1 (Varsaw, Poland) region

Date: 2017-08-07 08:42:43 UTC
WAW1 is now activated on all new projects.

We are now enabling the region to actual customers.

All is going smooth right now.

Date: 2017-08-07 08:25:11 UTC
We are now enabling the new WAW1 region after creating new projects.

We will, then, enable it on all already-created projects.

Posted Aug 07, 2017 - 08:14 UTC