Networking issues in cloud infrastructure (INC-115)

Date

Monday, February 15 2021 08:40 (GMT+2) - Monday, February 15 2021 19:21 (GMT+2)

Status

Complete

Summary

Various issues have been reported regarding brief loss of connectivity or partial/instant inability to access cloud services hosted on interworks.cloud infrastructure

Impact

Hosted Exchange

Users of the service may have experienced service disconnections when connected to the service with a mail client (i.e. Microsoft Outlook) and/or password prompts forcing them to reenter their credentials in order to use the service.

Infrastructure as a Service

Minor network disconnections between cloud servers residing in different availability zones (clusters) may have been noticed in cases where constant network access is required. This could also affect IPSEC VPN tunnel connectivity to on-premise locations. In some cases, users experienced significant delay or packet loss when connecting to their cloud servers from remote locations (delayed connections)

Platform as a Service

Cloud server communications to interworks.cloud public services (such as Cloud Databases) may have been slightly impacted, causing applications to temporarily lose established connections to the cloud service.

Root Causes

An issue was identified within interworks.cloud core networking infrastructure involving an incompatibility of a core routing component's kernel module with a specific network adapter firmware. The issue occurred after a scheduled upgrade of one of the route reflector components that was conducted on February 13, 2021. The issue was caused because of the firmware incompatibility in conjunction with increased network traffic, which led to high CPU usage on the affected route reflector component (FRR routine engine), which in turn caused increased packet loss to traffic destined to the affected route reflector component.

Trigger

The situation was triggered after a scheduled upgrade operation of one of the route reflector components of the cloud infrastructure in conjunction with increased incoming network traffic.

Resolution

interworks.cloud senior engineering team was able to troubleshoot the situation after receiving increased CPU usage alerts for the affected route reflector component, by performing a rollback of the upgraded equipment to the previous stable version. Further monitoring of the situation as well as constant feedback from affected users verified final resolution of the incident.

Detection

The issue was spotted by interworks.cloud monitoring systems and was also reported by various customers who experienced loss of connectivity or unexpected functionality of their cloud services.

Action Items

Initial attempt to bypass or eliminate connections/traffic from/to problematic route reflector component
Complete rollback of problematic route reflector firmware to previous stable version

Timeline

Monday, February 15 2021 08:40 (GMT+2)

System alerts were received via the monitoring system regarding increased CPU usage in one of the route reflector components of the core networking infrastructure.

Monday, February 15 2021 08:55 (GM+2) - 13:00 (GMT+2)

The engineering team was actively engaged in incident troubleshooting. Root cause was revealed and the team started planning of immediate actions in order to remedy the situation.

Monday, February 15 2021 09:00 (GM+2) - 13:39 (GM+2)

A number of reports from affected users were received by our Support department.

Monday, February 15 2021 13:00 (GMT+2) - 13:08 (GMT+2)

The engineering team began executing the action plan in order to resolve the situation.
A first attempt to bypass network traffic away from the affected route reflector component failed, as during that attempt the ospf module hang and as a result all incoming traffic to that module failed for a total duration of two (2) minutes. The route reflector was finally reset and network traffic was re-prioritized in order to bypass the affected component.

Monday, February 15 2021 13:08 (GMT+2) - 18:30 (GMT+2)

Traffic to/from the cloud infrastructure returned to normal conditions, with only very few cases of affected users that experienced network delays and/or disconnections

Monday, February 15 2021 18:30 (GMT+2) - 19:21 (GMT+2)

The engineering team performed a complete rollback of the affected component to the previous stable version.

Monday, February 15 2021 19:21 (GMT+2)

All services and routing equipment verified as up and running within established parameters. The engineering team maintained close monitoring of the previously affected resources.

Tuesday, February 16 2021 12:53 (GMT+2)

The networking issues were fully resolved after last day's actions and all affected systems were operating normally ever since. Incident was marked as resolved.