Aws 東京 リージョン 障害。 AWS東京リージョンの障害を回避するインフラ構成とは?

AWS東京リージョンで発生した大規模障害は冷却制御システムのバグが原因

aws 東京 リージョン 障害

These controllers needed to be reset. The overheating was caused by a control system failure that caused multiple, redundant cooling systems to fail in parts of the affected Availability Zone. Reminder: The cloud is just a computer in Reston with a bad power supply. We have also trained our local operations teams to quickly identify and remediate this situation if it were to recur, and we are confident that we could reset the system before seeing any customer impact if a similar situation was to occur for any reason. After these controllers were reset, cooling was restored to the affected area of the datacenter and temperatures began to decrease. As we have further investigated this event with our customers, we have discovered a few isolated cases where customers' applications running across multiple Availability Zones saw unexpected impact i. This control system contains third-party code which allows it to communicate with third-party devices such as fans, chillers, and temperature sensors. Then it took them four days to figure this out and tell us about it. The Issue have been resolved and the service is operating normarly. We apologize for any inconvenience this event may have caused. We are sharing additional details on these isolated issues directly with impacted customers. A small number of instances and volumes were hosted on hardware which was adversely affected by the loss of power and excessive heat. The overheating was due to a control system failure that caused multiple, redundant cooling systems to fail in parts of the affected Availability Zone. This event was caused by a failure of our datacenter control system, which is used to control and optimize the various cooling systems used in our datacenters. Just prior to the event, the datacenter control system was in the process of failing away from one of the control hosts. It took longer to recover these instances and volumes and some needed to be retired as a result of failures to the underlying hardware. The control system runs on multiple hosts for high availability. In the interim, we have disabled the failover mode that triggered this bug on our control systems to ensure we do not have a recurrence of this issue. As temperatures returned to normal, power was restored to the affected instances. At this point, temperatures began to rise in the affected part of the datacenter and servers began to power off when they became too hot. Our datacenters are designed such that if the datacenter control system fails, the cooling systems go into maximum cooling mode until the control system functionality is restored. For immediate recovery, we recommend replacing any remaining affected instances or volumes if possible. While this worked correctly in most of the datacenter, in a small portion of the datacenter, the cooling system did not correctly transition to this safe cooling configuration and instead shut down. We continue to work to recover all affected instances and volumes. We have been working to recover the remaining instances and volumes. The team attempted to activate purge in the affected areas of the datacenter, but this also failed. During this kind of failover, the control system has to exchange information with other control systems and the datacenter equipment it controls e. A small number of remaining instances and volumes are hosted on hardware which was adversely affected by the loss of power. Customers that were running their applications thoroughly across multiple Availability Zones were able to maintain availability throughout the event. We are working to resolve the issue. Some of the affected instances may require action from customers and we will be reaching out to those customers with next steps. To recover, the team had to manually investigate and reset all of the affected pieces of equipment and put them into a maximum cooling configuration. Because the datacenter control system was unavailable, the operations team had minimum visibility into the health and state of the datacenter cooling systems. For customers that need the highest availability for their applications, we continue to recommend running applications with this multiple Availability Zone architecture; any application component that can create availability issues for customers should run in this fault tolerant way. We continue to work towards recovery for all affected instances. We are never satisfied with operational performance that is anything less than perfect, and we will do everything we can to learn from this event and drive improvement across our services. Due to a bug in the third-party control system logic, this exchange resulted in excessive interactions between the control system and the devices in the datacenter which ultimately resulted in the control system becoming unresponsive.。 。 。 。 。 。

次の

AWS障害対策~予防から運用上の注意まで

aws 東京 リージョン 障害

。 。 。 。 。 。 。

次の

AWS東京リージョンで発生した障害について(4.20)

aws 東京 リージョン 障害

。 。 。 。 。 。

次の

api.atmanco.com:AWSの東京リージョンに障害発生 ゲームからWebサービスまで大きな影響

aws 東京 リージョン 障害

。 。 。 。 。

次の

Amazon Web Services(AWS)東京リージョン一部サービスにて障害が発生

aws 東京 リージョン 障害

。 。 。 。 。 。 。

次の

AWS東京リージョン障害に思うこと

aws 東京 リージョン 障害

。 。 。 。 。 。

次の