IoT Core Asia Performance degraded

Incident Report for ClearBlade

Postmortem

Issue identification

Issue 1: On July 24th at 5:05 pm CDT, MQTT disconnect rates were significantly higher than normal. Internally, this issue was identified before the disconnect due to memory pressure alerts in the load balancers.

Issue 2: Several load balancer protections quickly remediated the initial disconnect rate issue. On July 25th at 1:30 am CDT, ClearBlade finalized these protections and identified through log analysis that they were producing long MQTT connect times for some devices. Customers with shorter connect timeouts may have experienced connection issues.

Root cause

Issue 1: On July 24th at 1:20-2:45 pm CDT, the connected workload in Asia increased by over 300%. The ClearBlade environment scaled effectively and remained active. ClearBlade automated monitoring successfully notified support of the scaling event. Several hours later, on July 24th at 4:25 pm CDT, the Asia region saw another order of magnitude increase in connected workload, which was again identified and monitored. This new workload created significant duplicate connection behavior in the load balancer and TLS termination of IoT Core Asia. Numerous single devices created up to 10 connect and TLS termination events in each load balancer, representing a connect count greater than the entire device count in the region. The increased load on the load balancer forced a restart, disconnecting all the devices into that load balancer. The disconnected devices would then attempt to reconnect, increasing the overall load on the environment and causing other load balancer containers to restart. The event represented a cascading restart effect from 5:05 - 6:00 pm CDT. During this period, devices were still successfully connecting and delivering data to Google Pub/Sub, and registries served API calls.

Issue 2: ClearBlade introduced lower maximum connection rates and stricter device connect throttles into the load balancers to protect the environment. These new restrictions limited the ability of devices to retry their connection and ensured that load balancers could not be overrun. These limits fully stabilized the environment, and the cluster was healthy. The continued external load of devices duplicated connecting caused valid devices to wait long before receiving their successful connection. Once a connection was successfully established, the device behavior worked as normal. API rates were monitored and seen as unchanged. This long connect time persisted until the available maintenance window, where the resolution was applied.

Initial resolution

During the initial incident, ClearBlade responded quickly to put protections in place for the load balancer to ensure uptime and prevent disconnects. ClearBlade installed maximum connection rate limits in each load balancer instance and duplicate connection throttles. This addition immediately stabilized the environment.

ClearBlade then evaluated the traffic patterns to understand and identify the undesirable connection behavior. Extensive analysis and simulation were performed and executed to recreate the behavior and symptoms in ClearBlade’s private environments. After validation, ClearBlade defined a new load balancer configuration to serve better and connect traffic over MQTT as rapidly as possible. This resulted in increased settings that force a memory usage limit, a higher maximum connection limit, and a stricter set of connection throttles.

According to the regional maintenance windows, the new configuration was pushed into the ClearBlade environment at 10:00 am CDT on July 28th.

Long term prevention

Several immediate actions have been taken to prevent this issue from reoccurring:

ClearBlade continues to analyze MQTT traffic to understand clients connecting with higher connect rates, non-matching SSL versions, or ciphers without backoffs or low connect timeout settings. This additional understanding will help preemptively harden the overall solution.
ClearBlade continues to evaluate our offerings for load balancing and infrastructure protection at the network layer. This includes third-party tools and cloud offerings.
ClearBlade continues to look for performance increases to reduce overall connect time and pressure on load balancers.

Posted Jul 29, 2023 - 22:07 UTC

Resolved

The slow connection rates have been resolved. ClearBlade will be sharing a post-mortem on this event for interested users.

Posted Jul 27, 2023 - 16:12 UTC

Update

Update - IoT Core Asia-East1 environment is currently stable. The environment is experiencing delayed MQTT connection times. ClearBlade continues to work to restore connection times to previous averages

Posted Jul 26, 2023 - 12:22 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Jul 25, 2023 - 20:15 UTC

Update

We have identified the issue and working to apply a fix

Posted Jul 25, 2023 - 03:19 UTC

Update

The environment is now allowing connections, but still experiencing very high traffic.

Posted Jul 24, 2023 - 23:52 UTC

Identified

The issue has been identified and a fix is being implemented.

Posted Jul 24, 2023 - 23:26 UTC

Investigating

We are currently investigating this issue.

Posted Jul 24, 2023 - 22:29 UTC

This incident affected: ClearBlade IoT Core (Asia-East1).