Issue 1: On July 24th at 5:05 pm CDT, MQTT disconnect rates were significantly higher than normal. Internally, this issue was identified before the disconnect due to memory pressure alerts in the load balancers.
Issue 2: Several load balancer protections quickly remediated the initial disconnect rate issue. On July 25th at 1:30 am CDT, ClearBlade finalized these protections and identified through log analysis that they were producing long MQTT connect times for some devices. Customers with shorter connect timeouts may have experienced connection issues.
Issue 1: On July 24th at 1:20-2:45 pm CDT, the connected workload in Asia increased by over 300%. The ClearBlade environment scaled effectively and remained active. ClearBlade automated monitoring successfully notified support of the scaling event. Several hours later, on July 24th at 4:25 pm CDT, the Asia region saw another order of magnitude increase in connected workload, which was again identified and monitored. This new workload created significant duplicate connection behavior in the load balancer and TLS termination of IoT Core Asia. Numerous single devices created up to 10 connect and TLS termination events in each load balancer, representing a connect count greater than the entire device count in the region. The increased load on the load balancer forced a restart, disconnecting all the devices into that load balancer. The disconnected devices would then attempt to reconnect, increasing the overall load on the environment and causing other load balancer containers to restart. The event represented a cascading restart effect from 5:05 - 6:00 pm CDT. During this period, devices were still successfully connecting and delivering data to Google Pub/Sub, and registries served API calls.
Issue 2: ClearBlade introduced lower maximum connection rates and stricter device connect throttles into the load balancers to protect the environment. These new restrictions limited the ability of devices to retry their connection and ensured that load balancers could not be overrun. These limits fully stabilized the environment, and the cluster was healthy. The continued external load of devices duplicated connecting caused valid devices to wait long before receiving their successful connection. Once a connection was successfully established, the device behavior worked as normal. API rates were monitored and seen as unchanged. This long connect time persisted until the available maintenance window, where the resolution was applied.
During the initial incident, ClearBlade responded quickly to put protections in place for the load balancer to ensure uptime and prevent disconnects. ClearBlade installed maximum connection rate limits in each load balancer instance and duplicate connection throttles. This addition immediately stabilized the environment.
ClearBlade then evaluated the traffic patterns to understand and identify the undesirable connection behavior. Extensive analysis and simulation were performed and executed to recreate the behavior and symptoms in ClearBlade’s private environments. After validation, ClearBlade defined a new load balancer configuration to serve better and connect traffic over MQTT as rapidly as possible. This resulted in increased settings that force a memory usage limit, a higher maximum connection limit, and a stricter set of connection throttles.
According to the regional maintenance windows, the new configuration was pushed into the ClearBlade environment at 10:00 am CDT on July 28th.
Several immediate actions have been taken to prevent this issue from reoccurring: