Aug 29, 10 AM CDT: ClearBlade ran a scheduled maintenance event in the Asia region. The updates were fully successful. Upon reconnecting devices, the environment became unstable, and unable to connect devices before device retry timeouts occurred. This resulted in a cascading set of connections and the inability for devices to connect or APIs to respond. ClearBlade was made aware of internal communication failing to complete via monitoring alarms.
Throughout the day, ClearBlade attempted multiple strategies and redeployments to rectify the issue. These included scaling up infrastructure dramatically, rearchitecting infrastructure to separate network traffic, patch releases to reduce connection times, throttles to limit connection and API rates, load balancer configuration to prevent duplicate connections by IP, and load balancer configuration to limit connections. In each case, after some time, the environment became unstable again with internode communication, resulting in devices unable to connect and APIs not responding.
Aug 29, 11 PM CDT: ClearBlade attempted a rollback of updates other than Kubernetes version updates. The environment appeared stable, with all test devices connecting and communicating, all APIs responding within desired thresholds, and the user interface working as expected.
Aug 30, 3:30 AM CDT: ClearBlade monitoring alarms appeared again for internal node communication failing to complete. Test devices could no longer establish connections, and API response times climbed above desired thresholds.
Aug 30, 7:00 AM CDT: ClearBlade stabilized the environment again, although API times were higher than desired, and devices could intermittently connect at times.
Aug 30, 11:00 AM CDT: ClearBlade applied a tested and validated configuration update to increase internal communication worker pools and message queue sizes. The environment onboarded all devices and was stable within 10 minutes.
Aug 31: 11 AM CDT: ClearBlade completed 24 hours of monitoring and marked the incident as closed.
ClearBlade included updates to multiple infrastructure elements on its Aug 29th maintenance event to reduce the impact on end users in Asia. These updates had been tested at production scale within ClearBlade's sandbox environment. During the update, alerts for failing internal communication properly described the issue and represented the root problem. This internal communication was between the ClearBlade brokers and the shared cache infrastructure.
ClearBlade rolled back all changes throughout the day with only Kubernetes version updates remaining. These updates were identified by the ClearBlade security review team as necessary and required, as documented here. ClearBlade is investigating potential network communication impacts within those version changes. GKE does not support returning to the previous version; testing requires specific environment management.
The issue was resolved with updates to communication workers and queue sizes within the ClearBlade broker, which ClearBlade has complete control over. This implies that network bandwidth was not an issue but more related to how quickly a broker could receive and process an internal remote procedure call.
The following preventative measures are being explored:
Enhanced scenario testing: Isolation within the ClearBlade solution typically means ClearBlade is expanding test cases to include a customizable mix of:
Improved monitoring of internal communication: Bytes communicated and failed calls are easily transmitted. Still, additional information, including queue sizes, active workers, and messages sent at each node, will allow ClearBlade to tune performance and settings better.
On-the-fly RPC configuration: The ability to update internal communication strategies without a broker restart.
Reduced RPC communication dependencies in bridged brokers.
Review production triage procedures: ClearBlade current procedures include:
ClearBlade is committed to providing high-quality IoT Core customer service and avoiding disruptions. ClearBlade fully intends to continue to innovate and improve this solution with new user features and internal optimizations. ClearBlade will continue to keep internal dependencies updated with the latest patches and security updates.
ClearBlade recognizes the opportunity to learn from this event to improve customer service.
ClearBlade does offer an IoT Enterprise product that allows users to have their own IoT Core single-tenant hosted IoT Core solution. Please contact firstname.lastname@example.org for more information.
On August 29, ClearBlade upgraded its IoT Core Asia region environment. This update was planned, tested, and validated to have only a reconnect event with a total downtime of less than 1 minute, and all devices reconnected within 10 minutes. This update’s purpose included the following:
Priority 1: Resolve previous issues experienced the week of Aug 24
Priority 2: Ensure infrastructure is up-to-date and supported
Priority 3: Improve the deployment process to make future updates have no outage
Priority 4: Fix IoT core bugs and introduce new beta features requested by users in the region
This update included the following major components:
ClearBlade performs extensive testing and validation of every setting before every release. This includes API-level testing across all IoT Core capabilities. The testing also includes broad IoT Enterprise functional and duration tests and IoT Core sandbox environment tests scaled to a high volume of device connects with significant API calls. Scale testing environments like IoT Core can be challenging to reproduce the variety of behaviors that users can perform as part of a standard connect, including connect rates, connect retries, API calls, device creation, and device deletion. The IoT sandbox environment had been successfully reproducing the issue from Aug 24th. It identified that the above configuration could resolve the challenge.
The initial ClearBlade update immediately showed CPU maxed on the caching layer:
Many modifications to increase ClearBlade infrastructure resources:
ClearBlade built and scale tested a new update to more rapidly grant connections without waiting for disconnect confirmation.
ClearBlade tested procedures and then rolled back all updates to the previous configuration.
At ~3:45 AM CDT, the environment went unstable again, and connected device counts tripled.
ClearBlade stabilized the environment for device connection but continues to see it move into an unhealthy state.
ClearBlade applied new updates at ~12:00 CDT to increase communication queue sizes between cluster pods.
Currently, the environment is stable for device connections, API calls, and user interface.
ClearBlade continues to triage the environment as its top priority. A post-mortem follow-up will be provided with additional information on how ClearBlade will prevent these issues from happening in the future and ensure stability for all users.