OVHcloud Network Status

Current status
Legend
  • Operational
  • Degraded performance
  • Partial Outage
  • Major Outage
  • Under maintenance
rbx-g1/g2
Incident Report for Network & Infrastructure
Resolved
Nous avons un probleme sur les ASR9000

Jul 6 12:58:05 rbx-g1-a9.fr.eu 5919: LC/0/0/CPU0:Jul 6 10:57:46 UTC: fib_mgr[161]: %ROUTING-FIB-4-RSRC_LOW : CEF running low on DATA_TYPE_TABLE_SET resource memory. CEF will nowbegin resource constrained forwarding. Only route deletes will behandled in this state, which may result in mismatch between RIB/CEF.Traffic loss on certain prefixes can be expected. The CEF willautomatically resume normal operation, once the resource utilizationreturns to normal level
Jul 6 12:57:42 rbx-g2-a9.fr.eu 15654: LC/0/3/CPU0:Jul 6 10:57:23 UTC: fib_mgr[161]: %PLATFORM-PLAT_FIB-6-INFO : PD FIB object LEAF OOR state changed to GREEN
Jul 6 12:57:42 rbx-g2-a9.fr.eu 15655: LC/0/3/CPU0:Jul 6 10:57:23 UTC: fib_mgr[161]: %ROUTING-FIB-6-RSRC_OK : CEF resource state has returned to normal. CEF hasexited resource constrained operation and normal forwarding has beenrestored


Update(s):

Date: 2011-07-07 10:54:06 UTC
Nous avons reçu la carte spare de Cisco à 4H00 du matin.
http://yfrog.com/z/kejb0uj

L'ancienne carte est toujours dans le routeur.
D'abord, on va deconnecter toutes les fibres optiques.
http://yfrog.com/z/kg4rknnj

C'est fait. La carte est prête à sortir.
http://yfrog.com/z/kl2d5jj

Ready to go ? Go ... La carte est sortie
http://yfrog.com/z/kl1aslhj

On verifie les logs et que tout va bien
http://yfrog.com/z/kj1kfij

On pose l'ancienne carte et on déballe la nouvelle
http://yfrog.com/z/kh47sqj

La carte est prête à être inserée
http://yfrog.com/z/kiz82vtj

La carte est inserée et elle boote
http://yfrog.com/z/kjh2dvj

On verifie les logs: le boot se passe bien
http://yfrog.com/z/kl42ttj

On reconnecte les fibres optiques
http://yfrog.com/z/h7iialhxj

On verifie les logs: tout va bien
http://yfrog.com/z/khd74nj

On verifie le weathermap et l'ecoulement
de trafic vers Paris et Frankfurt: tout
va bien.
http://weathermap.ovh.net/backbone

L'ancienne carte est reamballé et va être renvoyée
chez Cisco.

Merci à l'équipe de Cisco pour le suivi de cette
nuit et fixé le bug interne à 1h du matin.



Date: 2011-07-06 21:06:55 UTC
http://travaux.ovh.net/?do=details&id=5162

[...]
Nous allons remplacer la carte #6 du g1 par la carte #4 du g2 sur laquelle nous avons des ports non-utilisés ou sur lesquels nous avons peu de trafic.
[...]

C'est pourquoi ça ne colle pas dans les bases de cisco.

Date: 2011-07-06 21:01:30 UTC
Apparament la carte n'est pas dans les bases. Il se peut
que c'est dû au fait qu'on a déjà eu 2 cartes cassées et
suite aux RMA precedentes ça n'a pas été mis à jour.

On est en train de regarder comment contourner le probleme.

Date: 2011-07-06 20:43:08 UTC
bon, c'est le totale, les bases chez cisco ne sont pas
à jour avec le contrat qu'on a signé et on n'aura pas
la carte dans 2H.

Date: 2011-07-06 19:48:01 UTC
Le trafic a été remis, tout se passe bien.

Le probleme inital est fixé.

Il reste à remplacer la carte. le RMA est en cours.

Date: 2011-07-06 19:38:13 UTC
Cisco nous demande de redemarrer la carte pour voir
si c'est definitivement mort ou pas.

RP/0/RSP0/CPU0:rbx-g1-a9(admin)#reload location 0/4/CPU0
Wed Jul 6 19:37:06.607 UTC

Preparing system for backup. This may take a few minutes especially for large configurations.
[Done]
Proceed with reload? [confirm]


Date: 2011-07-06 19:36:37 UTC
On lance le remplacement de la carte avec Cisco
via le support hardware T+2H, c'est à dire qu'en
cas de probleme hardware sur l'un des elements
du routeur, Cisco nous livre la carte en panne
en moins de 2H.

On a verifié les ports en panne et on ne devrait
pas avoir de l'impact sur le trafic sans cette
carte. Tous les ports sont doublés et on ne devrait
rien saturer.

On a remit le routeur en routage.

On verifie la saturation des liens.

Date: 2011-07-06 19:28:45 UTC
La carte 0/4 est morte.

Date: 2011-07-06 19:21:41 UTC
g1 est up. on le verifie.

Date: 2011-07-06 19:13:58 UTC
RP/0/RSP0/CPU0:rbx-g1-a9(admin)#reload location all
Wed Jul 6 19:13:11.504 UTC

Preparing system for backup. This may take a few minutes especially for large configurations.
Status report: node0_RSP0_CPU0: START TO BACKUP
Status report: node0_RSP0_CPU0: BACKUP HAS COMPLETED SUCCESSFULLY
[Done]
Proceed with reload? [confirm]RP/0/RSP0/CPU0::This node received reload command. Reloading in 5 secs

le redemarrage en cours.

Date: 2011-07-06 19:12:59 UTC
g1 est sorti de la boucle. tout est routé sur le g2.

on est prêt pour le redemarrage.

Date: 2011-07-06 19:11:22 UTC
On va sortir le g1 de routage.

Date: 2011-07-06 19:09:52 UTC
g2 est okey. On le remet dans le routage. Il est dans la boucle.

Date: 2011-07-06 19:07:44 UTC
g2 est UP.

On le verifie.

Date: 2011-07-06 18:59:08 UTC
RP/0/RSP1/CPU0:rbx-g2-a9(admin)#reload location all
Wed Jul 6 18:58:42.597 UTC

Preparing system for backup. This may take a few minutes especially for large configurations.
Status report: node0_RSP1_CPU0: START TO BACKUP
Status report: node0_RSP1_CPU0: BACKUP HAS COMPLETED SUCCESSFULLY
[Done]
Proceed with reload? [confirm]RP/0/RSP1/CPU0::This node received reload command. Reloading in 5 secs


Date: 2011-07-06 18:58:23 UTC
tout le routage passe par g1 maintenant.

On est pret pour le reboot de g2.

Date: 2011-07-06 18:54:14 UTC
RP/0/RSP1/CPU0:rbx-g2-a9(admin-config)#hw-module profile scale l3xl
Wed Jul 6 18:50:16.520 UTC
In order to activate this new memory resource profile, you must manually reboot the system.

On va devoir redemarrer le routeur

Date: 2011-07-06 18:29:19 UTC
ça fait les boucles dans le reseau pour les nouvelles IP.
On attend Cisco.

6 th2-1-6k.fr.eu (213.186.32.181) 55.409 ms * 50.620 ms
7 th1-1-6k.fr.eu (213.186.32.165) 58.132 ms * 50.333 ms
8 rbx-g2-a9.fr.eu (91.121.131.141) 55.075 ms 53.812 ms 54.613 ms
9 gsw-2-6k.fr.eu (91.121.131.214) 77.756 ms * *
10 rbx-g1-a9.fr.eu (91.121.131.33) 57.627 ms 57.028 ms 57.390 ms
11 gsw-2-6k.fr.eu (91.121.131.38) 263.777 ms
gsw-2-6k.fr.eu (91.121.131.34) 205.179 ms
gsw-2-6k.fr.eu (213.251.128.106) 209.499 ms
12 rbx-g1-a9.fr.eu (91.121.131.33) 62.124 ms 59.690 ms 62.422 ms
13 gsw-2-6k.fr.eu (91.121.131.38) 62.392 ms *
gsw-2-6k.fr.eu (213.251.128.106) 61.387 ms
14 rbx-g1-a9.fr.eu (91.121.131.33) 65.804 ms 65.402 ms 65.773 ms
15 gsw-2-6k.fr.eu (91.121.131.38) 65.205 ms *
gsw-2-6k.fr.eu (213.251.128.106) 64.206 ms
16 rbx-g1-a9.fr.eu (91.121.131.33) 69.591 ms 67.366 ms 68.669 ms
17 * * gsw-2-6k.fr.eu (213.251.128.106) 220.553 ms
18 rbx-g1-a9.fr.eu (91.121.131.33) 71.096 ms 73.312 ms 71.266 ms
19 gsw-2-6k.fr.eu (91.121.131.38) 70.817 ms
gsw-2-6k.fr.eu (91.121.131.34) 70.360 ms
gsw-2-6k.fr.eu (213.251.128.106) 71.530 ms


Date: 2011-07-06 17:07:42 UTC
L'enregistrement de nouvelles IP ne se fait pas.

On discute et on echange avec le TAC de Cisco pour fixer ce
probleme.

Date: 2011-07-06 12:35:33 UTC
RP/0/RSP1/CPU0:rbx-g2-a9#sh cef resource detail location 0/0/cpu0
Wed Jul 6 12:35:19.098 UTC
CEF resource availability summary state: YELLOW
CEF will drop route updates
No. of times HW caused oor: 26
CEF entered oor at : Jul 6 12:30:33.573
CEF came out of oor at : Jul 6 12:29:48.370
ipv4 shared memory resource:
CurrMode GREEN, CurrAvail 866398208 bytes, MaxAvail 984129536 bytes
ipv6 shared memory resource:
CurrMode GREEN, CurrAvail 866398208 bytes, MaxAvail 984129536 bytes
mpls shared memory resource:
CurrMode GREEN, CurrAvail 866398208 bytes, MaxAvail 984129536 bytes
common shared memory resource:
CurrMode GREEN, CurrAvail 866398208 bytes, MaxAvail 984129536 bytes
DATA_TYPE_TABLE_SET hardware resource: YELLOW
DATA_TYPE_TABLE hardware resource: YELLOW
DATA_TYPE_IDB hardware resource: YELLOW
DATA_TYPE_IDB_EXT hardware resource: YELLOW
DATA_TYPE_LEAF hardware resource: YELLOW
DATA_TYPE_LOADINFO hardware resource: YELLOW
DATA_TYPE_PATH_LIST hardware resource: YELLOW
DATA_TYPE_NHINFO hardware resource: YELLOW
DATA_TYPE_LABEL_INFO hardware resource: YELLOW
DATA_TYPE_FRR_NHINFO hardware resource: YELLOW
DATA_TYPE_ECD hardware resource: YELLOW
DATA_TYPE_RECURSIVE_NH hardware resource: YELLOW
DATA_TYPE_TUNNEL_ENDPOINT hardware resource: YELLOW
DATA_TYPE_LOCAL_TUNNEL_INTF hardware resource: YELLOW
DATA_TYPE_ECD_TRACKER hardware resource: YELLOW
DATA_TYPE_ECD_V2 hardware resource: YELLOW
DATA_TYPE_ATTRIBUTE hardware resource: YELLOW
DATA_TYPE_LSPA hardware resource: YELLOW
DATA_TYPE_LDI_LW hardware resource: YELLOW
DATA_TYPE_LDSH_ARRAY hardware resource: YELLOW
DATA_TYPE_TE_TUN_INFO hardware resource: YELLOW
DATA_TYPE_DUMMY hardware resource: YELLOW
DATA_TYPE_IDB_VRF_LCL_CEF hardware resource: YELLOW
DATA_TYPE_TABLE_UNRESOLVED hardware resource: YELLOW
DATA_TYPE_MOL hardware resource: YELLOW
DATA_TYPE_MPI hardware resource: YELLOW
DATA_TYPE_SUBS_INFO hardware resource: YELLOW
DATA_TYPE_GRE_TUNNEL_INFO hardware resource: YELLOW
RP/0/RSP1/CPU0:rbx-g2-a9#

Date: 2011-07-06 12:34:43 UTC
RP/0/RSP1/CPU0:rbx-g2-a9# show bgp nexthops statistics
Wed Jul 6 12:34:19.284 UTC
Total Nexthop Processing
Time Spent: 871.632 secs

Maximum Nexthop Processing
Received: 6w3d
Bestpaths Deleted: 0
Bestpaths Changed: 144079
Time Spent: 2.918 secs

Last Notification Processing
Received: 1d14h
Time Spent: 0.021 secs

Gateway Address Family: IPv4 Unicast
Table ID: 0xe0000000
Nexthop Count: 147
Critical Trigger Delay: 3000msec
Non-critical Trigger Delay: 10000msec

Nexthop Version: 1, RIB version: 1

Total Critical Notifications Received: 119
Total Non-critical Notifications Received: 11570
Bestpaths Deleted After Last Walk: 0
Bestpaths Changed After Last Walk: 1961
Nexthop register:
Sync calls: 426747, last sync call: 00:15:14
Async calls: 1697, last async call: 14w6d
Nexthop unregister:
Async calls: 426603, last async call: 00:14:38
Nexthop batch finish:
Calls: 947770, last finish call: 00:14:37
Nexthop flush timer:
Times started: 853358, last time flush timer started: 00:14:38
RIB update: 0 rib update runs, last update: 00:00:00
0 prefixes installed, 0 modified, 0 removed

RP/0/RSP1/CPU0:rbx-g2-a9#show controller np struct 6 summary location 0/0/cpu0
Wed Jul 6 12:34:29.161 UTC

Node: 0/0/CPU0:
----------------------------------------------------------------
NP: 0 Struct 6: R_LDI
1685 of 65536 entries in use (1685 reserved)
Buddy allocator information:
Block Size : 1 2 4 8 16 32
Free Blocks: 288 57 8 1 1 1981
Used Blocks: 1673 0 3 0 0 0

NP: 1 Struct 6: R_LDI
1685 of 65536 entries in use (1685 reserved)
Buddy allocator information:
Block Size : 1 2 4 8 16 32
Free Blocks: 288 57 8 1 1 1981
Used Blocks: 1673 0 3 0 0 0

NP: 2 Struct 6: R_LDI
1685 of 65536 entries in use (1685 reserved)
Buddy allocator information:
Block Size : 1 2 4 8 16 32
Free Blocks: 288 57 8 1 1 1981
Used Blocks: 1673 0 3 0 0 0

NP: 3 Struct 6: R_LDI
1685 of 65536 entries in use (1685 reserved)
Buddy allocator information:
Block Size : 1 2 4 8 16 32
Free Blocks: 288 57 8 1 1 1981
Used Blocks: 1673 0 3 0 0 0

NP: 4 Struct 6: R_LDI
1685 of 65536 entries in use (1685 reserved)
Buddy allocator information:
Block Size : 1 2 4 8 16 32
Free Blocks: 288 57 8 1 1 1981
Used Blocks: 1673 0 3 0 0 0

NP: 5 Struct 6: R_LDI
1685 of 65536 entries in use (1685 reserved)
Buddy allocator information:
Block Size : 1 2 4 8 16 32
Free Blocks: 288 57 8 1 1 1981
Used Blocks: 1673 0 3 0 0 0

NP: 6 Struct 6: R_LDI
1685 of 65536 entries in use (1685 reserved)
Buddy allocator information:
Block Size : 1 2 4 8 16 32
Free Blocks: 288 57 8 1 1 1981
Used Blocks: 1673 0 3 0 0 0

NP: 7 Struct 6: R_LDI
1685 of 65536 entries in use (1685 reserved)
Buddy allocator information:
Block Size : 1 2 4 8 16 32
Free Blocks: 288 57 8 1 1 1981
Used Blocks: 1673 0 3 0 0 0



Date: 2011-07-06 12:33:33 UTC
Nous avons ajouté le next-hop-self sur IPv6.

La meme chose.

On vient d'ouvrir un TAC chez Cisco

Date: 2011-07-06 12:33:06 UTC
Le probleme semble ressembler à celui ci
http://travaux.ovh.net/?do=details&id=4791
mais pas tout à fait.
Posted Jul 06, 2011 - 11:00 UTC