IoT Core Asia - Stability Event

Incident Report for ClearBlade

Postmortem

Event Summary

Aug 29, 10 AM CDT: ClearBlade ran a scheduled maintenance event in the Asia region. The updates were fully successful. Upon reconnecting devices, the environment became unstable, and unable to connect devices before device retry timeouts occurred. This resulted in a cascading set of connections and the inability for devices to connect or APIs to respond. ClearBlade was made aware of internal communication failing to complete via monitoring alarms.

Throughout the day, ClearBlade attempted multiple strategies and redeployments to rectify the issue. These included scaling up infrastructure dramatically, rearchitecting infrastructure to separate network traffic, patch releases to reduce connection times, throttles to limit connection and API rates, load balancer configuration to prevent duplicate connections by IP, and load balancer configuration to limit connections. In each case, after some time, the environment became unstable again with internode communication, resulting in devices unable to connect and APIs not responding.

Aug 29, 11 PM CDT: ClearBlade attempted a rollback of updates other than Kubernetes version updates. The environment appeared stable, with all test devices connecting and communicating, all APIs responding within desired thresholds, and the user interface working as expected.

Aug 30, 3:30 AM CDT: ClearBlade monitoring alarms appeared again for internal node communication failing to complete. Test devices could no longer establish connections, and API response times climbed above desired thresholds.

Aug 30, 7:00 AM CDT: ClearBlade stabilized the environment again, although API times were higher than desired, and devices could intermittently connect at times.

Aug 30, 11:00 AM CDT: ClearBlade applied a tested and validated configuration update to increase internal communication worker pools and message queue sizes. The environment onboarded all devices and was stable within 10 minutes.

Aug 31: 11 AM CDT: ClearBlade completed 24 hours of monitoring and marked the incident as closed.

Root Cause Analysis

ClearBlade included updates to multiple infrastructure elements on its Aug 29th maintenance event to reduce the impact on end users in Asia. These updates had been tested at production scale within ClearBlade's sandbox environment. During the update, alerts for failing internal communication properly described the issue and represented the root problem. This internal communication was between the ClearBlade brokers and the shared cache infrastructure.

ClearBlade rolled back all changes throughout the day with only Kubernetes version updates remaining. These updates were identified by the ClearBlade security review team as necessary and required, as documented here. ClearBlade is investigating potential network communication impacts within those version changes. GKE does not support returning to the previous version; testing requires specific environment management.

The issue was resolved with updates to communication workers and queue sizes within the ClearBlade broker, which ClearBlade has complete control over. This implies that network bandwidth was not an issue but more related to how quickly a broker could receive and process an internal remote procedure call.

Preventative Measures

The following preventative measures are being explored:

No outage updates: Providing GKE is critical to introducing small changes and running both versions simultaneously. This is in progress and completed for many scenarios, but it is a valuable strategy to introduce rapidly. It does carry significantly more complexity than traditional web-based application green blue behaviors as MQTT connections represent static connections, not ephemeral requests and responses.
Smaller more frequent maintenance activities.
GKE version validations: Expand test cases to include only GKE control and node pool version changes.
Sandbox test to 3x scale: With migration complete, current scale tests were being done at production scale. ClearBlade will return to scale testing with 3x the current production scale volume.
Sandbox test to failure: Besides ensuring success at scale, ClearBlade will expand tests to failure scenarios. This will better uncover the next issues and give adequate time for remediation activities.
Enhanced scenario testing: Isolation within the ClearBlade solution typically means ClearBlade is expanding test cases to include a customizable mix of:
- Telemetry/state send
- Device create, update, connect, disconnect, and delete lifecycles
- Duplicate connections
- Command/config receive
Improved monitoring of internal communication: Bytes communicated and failed calls are easily transmitted. Still, additional information, including queue sizes, active workers, and messages sent at each node, will allow ClearBlade to tune performance and settings better.
On-the-fly RPC configuration: The ability to update internal communication strategies without a broker restart.
Reduced RPC communication dependencies in bridged brokers.
Review production triage procedures: ClearBlade current procedures include:
- Full at-scale sandbox testing of any new configuration.
- Waiting for a stabilization period after reconnection events.
  These stabilization periods result in delays in how quickly ClearBlade can resolve issues. Evaluating this process may provide a faster time to resolution.

Active Action Items

Improved logging of internal communication: Completed
Profile configurable Kubernetes test clusters that allow for more rapid large-scale tests with different workload profiles: In progress
Blue/green capabilities for load balancers and brokers: In progress
RPC communication reductions: In progress
Reduced scoped updates: In progress

Conclusion

ClearBlade is committed to providing high-quality IoT Core customer service and avoiding disruptions. ClearBlade fully intends to continue to innovate and improve this solution with new user features and internal optimizations. ClearBlade will continue to keep internal dependencies updated with the latest patches and security updates.

ClearBlade recognizes the opportunity to learn from this event to improve customer service.

ClearBlade does offer an IoT Enterprise product that allows users to have their own IoT Core single-tenant hosted IoT Core solution. Please contact iotcore@clearblade.com for more information.

Appendix A: Information shared about the August 31st incident

On August 29, ClearBlade upgraded its IoT Core Asia region environment. This update was planned, tested, and validated to have only a reconnect event with a total downtime of less than 1 minute, and all devices reconnected within 10 minutes. This update’s purpose included the following:

Priority 1: Resolve previous issues experienced the week of Aug 24
Priority 2: Ensure infrastructure is up-to-date and supported
Priority 3: Improve the deployment process to make future updates have no outage
Priority 4: Fix IoT core bugs and introduce new beta features requested by users in the region

This update included the following major components:

Kubernetes control plane and node pool update
Improved distribution of load balancers across the Kubernetes cluster
ClearBlade Enterprise software update:
New feature for OIDC
Enhanced performance fixes
Bug fixes
Other new capabilities for logging and monitoring the environment
Configure cluster updates to improve performance and future updates
Database logging improvement
Blue/green labels added

About Testing

ClearBlade performs extensive testing and validation of every setting before every release. This includes API-level testing across all IoT Core capabilities. The testing also includes broad IoT Enterprise functional and duration tests and IoT Core sandbox environment tests scaled to a high volume of device connects with significant API calls. Scale testing environments like IoT Core can be challenging to reproduce the variety of behaviors that users can perform as part of a standard connect, including connect rates, connect retries, API calls, device creation, and device deletion. The IoT sandbox environment had been successfully reproducing the issue from Aug 24th. It identified that the above configuration could resolve the challenge.

Update Timeline

The initial ClearBlade update immediately showed CPU maxed on the caching layer:

These caused timeouts to devices, which began to reconnect faster than they were being disconnected.
This resulted in all load balancers maxing out CPU as the connection requests built up and TLS termination was processed.
Devices and APIs were no longer able to get access.

Many modifications to increase ClearBlade infrastructure resources:

Timeouts begin occurring with inter-cluster communication.
Devices and APIs would no longer be available.

ClearBlade built and scale tested a new update to more rapidly grant connections without waiting for disconnect confirmation.

Timeouts within inter-cluster communication

ClearBlade tested procedures and then rolled back all updates to the previous configuration.

The environment stabilized at ~12:00 AM CDT.

At ~3:45 AM CDT, the environment went unstable again, and connected device counts tripled.

ClearBlade stabilized the environment for device connection but continues to see it move into an unhealthy state.

ClearBlade applied new updates at ~12:00 CDT to increase communication queue sizes between cluster pods.

Currently, the environment is stable for device connections, API calls, and user interface.

ClearBlade continues to triage the environment as its top priority. A post-mortem follow-up will be provided with additional information on how ClearBlade will prevent these issues from happening in the future and ensure stability for all users.

Posted Sep 06, 2023 - 18:09 UTC

Resolved

This incident has been resolved.

Posted Aug 31, 2023 - 21:57 UTC

Update

INFORMATION UPDATE

On Aug 29 ClearBlade upgraded its IoT Core Asia region environment. This update was planned, test and validated to have only a reconnect event with a total downtime of less than 1 minute and all devices reconnected within 10 minutes. The purpose of this update included

Priority 1 - Resolve previous issues experienced the week of Aug 24
Priority 2 - Ensure infrastructure was up-to-date and supported
Priority 3 - Improve deployment process to make future updates have no outage
Priority 4 - Fix IoT core bugs and introduce beta new features requested by users in the region

This update included the following major components
- update of Kubernetes Control Plane
- update of the Kubernetes Node pool
- improved distribution of load balancers across Kubernetes cluster
- update ClearBlade Enterprise software for
* new feature for OIDC
* enhanced performance fixes
* bug fixes
* other new capabilities for logging and monitoring of the environment
- configure cluster updates to improve performance and future updates
* database logging improvement
* blue/green labels added

About Testing

ClearBlade performs extensive testing and validation of every setting prior to every release. This testing include API level testing across all capabilities of the IoT Core solution. The testing included broad IoT Enterprise functional tests and duration tests. Testing includ scale test in IoT Core Sandbox environment scaled to high volume of device connects and with significant API calls. Scale testing environments like IoT Core can be challenging to reproduce the variety of behaviors that users can perform as part of a standard connect including connect rates, connect retries, API calls, device creation, device deletion. The IoT Sandbox environment had been successfully reproducing issue from Aug 24 and had identified that the above configuration was valid to resolve the challenge.

Update Timeline

Initial ClearBlade update immediately showed CPU maxed on caching layer
- These caused timeouts to devices which then began to reconnect faster than they were being disconnected
- This resulted in all load balancer maxing out cpu as the connection requests built up and TLS terminiation was processed
- Devices and APIs were no longer able to get access

Many modifications to increase ClearBlade infrastructure resources
- timeouts begin occurring with inter cluster communication
- Devices and APIs would become no longer available

ClearBlade built, scale tested a new update to more rapidly grant connections without waiting for disconnect confirmation
- timeouts within inter-cluster communication

ClearBlade tested procedures and then rolled back all updates to previous configuration
- environment stabilized at ~12:00 AM CDT

At ~3:45 AM CDT the environment went unstable again and connected device counts tripled

ClearBlade stabilized the environment for device connection but continues to see it move into an unhealthy state

ClearBlade applied new updates at ~12:00 CDT to increase communication queue sizes between cluster pods

Currently the environment is stable for device connections, API calls, and User Interface

ClearBlade continues to triage the environment as its top priority. A post-mortem follow-on will be provided with additional information on how ClearBlade will prevent these issues from happening in the future and ensure stability for all users

Posted Aug 30, 2023 - 17:44 UTC

Update

The environment is now in a stable state. Devices connection rates are as expected. API call response times are healthy. The UI console is responding as expected. ClearBlade will continue to monitor

Posted Aug 30, 2023 - 17:27 UTC

Update

We are investigating long API response times

Posted Aug 30, 2023 - 12:09 UTC

Monitoring

A fix has been applied. We continuing to monitor.

Posted Aug 30, 2023 - 05:35 UTC

Update

We are continuing to work to remediate this issue

Posted Aug 30, 2023 - 04:33 UTC

Update

Devices are now connecting and data is flowing Google Pub/Sub. ClearBlade is triaging the slow API responses.

Posted Aug 30, 2023 - 01:47 UTC

Update

We are continuing to investigate this issue. We do see a subset of devices getting connected and sending messages but many devices are currently unable to connect

Posted Aug 29, 2023 - 23:23 UTC

Investigating

We are currently investigating this issue.

Posted Aug 29, 2023 - 16:03 UTC

This incident affected: ClearBlade IoT Core (Asia-East1).