Networking issues in cloud infrastructure
Date
September 2019 - October 2019
Status
Complete, some action items in progress
Summary
Various issues have been reported regarding brief loss of connectivity or partial/instant inability to access cloud services hosted on interworks.cloud infrastructure
Impact
Hosted Exchange
Users of the service may have experienced service disconnections when connected to the service with a mail client (i.e. Microsoft Outlook) and/or password prompts forcing them to reenter their credentials in order to use the service.
Infrastructure as a Service
Minor network disconnections between cloud servers residing in different availability zones (clusters) may have been noticed in cases where constant network access is required. This could also affect IPSEC VPN tunnel connectivity to on-premise locations.
Platform as a Service
Cloud server communications to interworks.cloud public services (such as Cloud Databases) may have been slightly impacted, causing applications to temporarily lose established connections to the cloud service.
Root Causes
An issue was identified within interworks.cloud core networking infrastructure involving bridge protocol data units (BPDU) filtering capabilities, that did not allow correct detection of network topology changes broadcasted by some network switches connected to the infrastructure. The issue was caused because of an inherent inability of the network infrastructure to correctly filter the network frames originating from network devices combining single port connectivity (orphaned ports) to the infrastructure and use of Multiple Spanning Tree Protocol (MSTP). As a result, topology changes triggered ARP flash operations throughout the entire network infrastructure spanning all devices, causing in turn instant loss of connectivity until the network devices CAM tables were repopulated again. Although these disconnections were very brief in duration and most of the time would go unnoticed, due to the gradual addition of more devices within the infrastructure over time, the frequency of the disconnections was significantly increased, thus enhancing the impact of the symptoms to the end users.
The network devices causing the issue, were customer collocated equipment (routers, switches) that was used as part of their private dedicated cloud infrastructure they host in interworks.cloud cloud environment. Unfortunately, most of this equipment cannot be converted to support 2-node VLT domain configuration as the rest of the interworks.cloud network infrastructure employs, which lead to the aforementioned issues.
Trigger
The situation was triggered when certain type of network switches where connected to the core infrastructure via single port connectivity and had Multiple Spanning Tree Protocol (MSTP) configured.
Resolution
interworks.cloud senior engineering team was able to troubleshoot the situation by turning on debug mode on all networking devices and constant monitoring all suspicious events recorded by those devices. By correlating certain unexpected events with actual disconnections of users or applications that were timely reported, the team finally pinpointed the root cause and was able to isolate the network devices that caused the network issues. The core networking configuration was modified appropriately and applied to all involved devices. Further monitoring of the situation as well as constant feedback from affected users verified final resolution of the incident.
Detection
The issue was reported by various customers who experienced loss of connectivity or unexpected functionality of their cloud services.
Action Items
- Modify network configuration policy template for devices that fall in the category of devices using single port connectivity and MSTP
- Investigate additional metrics to be retrieved and assessed by the central networking monitoring system in order to be properly alerted
- Enforce - where applicable - use of dual customer collocated network devices and proper attachment to interworks.cloud core network infrastructure
Timeline
-- January 2019
The incident probably existed before January 2019, however due to its very low occurrence frequency there were very few reports on issues that can be associated to it.
January 2019 - September 2019
There were various reports on issues that could be potentially associated with this particular incident, most of them escalating significantly after September 8, 2019.
September 9, 2019 - October 2, 2019
Engineering team actively engaged in incident troubleshooting
October 3, 2019
Engineering team investigation reveals root cause
October 4, 2019
Network configuration modified and applied to all necessary devices. Incident resolved.