今週のWindows Azure界隈の大ニュースと言えば、間違いなく大規模障害です。(現在は収束し、詳細な分析結果が10日以内に発表される予定となっています。)
マイクロソフトが提供するWindows Azureのサービスダッシュボードの情報によれば、米国にある「North Central US」リージョンと「South Central US」リージョン、アイルランドにある「North Europe」リージョンという三つのデータセンターにおいて、Windows Azureの仮想マシン(Windows Azure Compute)に対する外部から内部への通信(インバウンド通信)が利用できなくなった。最大時で、North Central USリージョンの6.7%、South Central USの28%、North Europeリージョンの37%がサービス障害の対象となった。この通信障害は日本時間の2月29日午前10時45分に発生し、日本時間の29日午後7時57分までに大部分で復旧した。
「うるう年」の処理ミスでWindows Azureにサービス障害 – ニュース:ITpro
クラウド黎明期、そして普及初期段階で、大規模障害のニュースが流れると盲目的に、「Windows Azureダメじゃん!」とか「クラウド危険」と考えてしまう契機になりえます。特にクラウドサービスって、本当に大丈夫なの?っと、おそるおそる手を出しているIT担当者にはショッキングなニュースだったと思います。
今回の障害内容を踏まえて、Windows Azureはどうなのだろうかを考えてみました。
障害から見えるWindows Azureサービスに組み込まれているもの
Windows Azureの障害は、マイクロソフト内で、おそらく緊急度Aと認定され24時間体制の緊急対応チームが結成され、CEOのスティーブ・バルマー監督の元、副社長のBill氏指揮で対応が行われました。
結局は、オンプレミスかクラウドか。 と言った、単純な対立軸で考える話ではなく、システムを構築する際のリスクコントロールの話になります。システムを構築する際に、考えられるリスクをリストアップし、それぞれのリスクへの対応を検討して、システムの構成図を決定します。
Windows Azure Computeと、Windows Azure Computeで動作するサービスが障害の対象となりました。対象サービスは、ダッシュボードによると、「Access Control 2.0、Marketplace、Service Bus、Access Control & Caching Portal」です。
Windows Azure サービス中断に関する最新情報 « S/N Ratio (by SATO Naoki)
- 1:45 AM UTC
Windows Azure運用チームは、複数のリージョンのコンピューティング サービスに影響を与える問題を認識しました。この問題は迅速にトリアージ (緊急度判断、優先順位付け) が行われ、ソフトウェアのバグによって引き起こされたと判断されました。最終的な根本原因分析は進行中ですが、この問題は、うるう年に対して不正確な時間計算が原因だと考えられます。
- 10:57 AM UTC
問題の発見後、すでに稼働しているお客様のサービスを保護するための措置を即座に講じ、この問題の修正を作成し始めました。この修正はほとんどのWindows Azureサブリーションでデプロイに成功し、2月29日 午前2:57 (PST) (日本時間 2月29日 午後7:57) までに大部分のお客様のWindows Azureサービスの可用性を回復しました。
- 10:55 AM UTC
We are experiencing an issue with Windows Azure Compute in the South Central US sub-region. Incoming traffic may not go through for a subset of hosted services in this sub-region. Deployed applications will continue to run. There is no impact to storage accounts either. Further updates will be published to keep you apprised of the situation. We apologize for any inconvenience this causes our customers.
- 12:30 PM UTC
We are still troubleshooting this issue and capturing all the data that will allow us to resolve it. Further updates will be published to keep you apprised of the situation. We apologize for any inconvenience this causes our customers.
- 1:30 PM UTC
We are still troubleshooting this issue, and verifying the most probable cause. Further updates will be published to keep you apprised of the situation. We apologize for any inconvenience this causes our customers.
- 2:30 PM UTC
We have determined that about 28% of hosted services in this sub-region are impacted by this issue. The restoration steps to mitigate the issue are underway. Further updates will be published to keep you apprised of the situation. We apologize for any inconvenience this causes our customers.
- 3:30 PM UTC
The restoration steps to mitigate the issue are underway. Further updates will be published to keep you apprised of the situation. We apologize for any inconvenience this causes our customers.
- 4:30 PM UTC
The restoration steps to mitigate the issue are still underway. Further updates will be published to keep you apprised of the situation. We apologize for any inconvenience this causes our customers.
- 5:30 PM UTC
The restoration steps to mitigate the issue are still underway. Further updates will be published to keep you apprised of the situation. We apologize for any inconvenience this causes our customers.
- 6:30 PM UTC
The restoration steps to mitigate the issue are still underway. This incident impacts Access Control 2.0, Marketplace, Service Bus and the Access Control & Caching Portal in the same regions where Windows Azure Compute is impacted. As a result affected customers may experience a loss of application functionality. Further updates will be published to keep you apprised of the situation. We apologize for any inconvenience this causes our customers.
- 7:30 PM UTC
We are actively recovering Windows Azure hosted services in this sub-region. More and more customers applications should be back up-and-running even if service management functionality is not yet restored. Further updates will be published to keep you apprised of the situation. We apologize for any inconvenience this causes our customers.
- 8:30 PM UTC
We have recovered over half of the Windows Azure hosted services that were impacted. Further updates will be published to keep you apprised of the situation. We apologize for any inconvenience this causes our customers.
- 9:30 PM UTC
Recovery efforts are still underway. Further updates will be published to keep you apprised of the situation. We apologize for any inconvenience this causes our customers.
- 10:30 PM UTC
Recovery efforts are still underway. Further updates will be published to keep you apprised of the situation. We apologize for any inconvenience this causes our customers.
- 11:30 PM UTC
Recovery efforts are still underway, we are 70% through the recovery process and we restored full service management functionality for more customers in this region. Further updates will be published to keep you apprised of the situation. We apologize for any inconvenience this causes our customers.
- 1:00 AM UTC
Recovery efforts are still underway. The majority of Windows Azure customers and services have been restored. However, a few customers and services remain affected and we are working aggressively to restore full functionality. We restored full service management functionality for all customers in this region. We have published a recap of the incident since it started on the Windows Azure team blog (http://blogs.msdn.com/b/windowsazure/archive/2012/03/01/windows-azure-service-disruption-update.aspx). Further updates will be published to keep you apprised of the situation. We apologize for any inconvenience this causes our customers.
- 3:00 AM UTC
We are working on stabilizing the Windows Azure Platform as well as following-up with all customers who were impacted by this incident. The recovery process continues. We are also focusing on restoring full service management functionality in this sub-region. Further updates will be published to keep you apprised of the situation. We apologize for any inconvenience this causes our customers.
- 4:30 AM UTC
Recovery efforts are still underway, we are 85% through the recovery process in terms of impacted hosted services. Further updates will be published to keep you apprised of the situation. We apologize for any inconvenience this causes our customers.
- 7:30 AM UTC
Our recovery efforts to restore compute service to impacted customers in this sub-region are complete. A small number of customers in this sub-region may face long delays during service management operations. We request any customers experiencing compute or service management issues in this sub-region to reach out to us through the support channel described on this site: (http://www.windowsazure.com/en-us/support/contact). We apologize for any inconvenience this incident causes our customers.
- 4:57 PM UTC
We have confirmed that full functionality is restored in the South Central US sub-region. We apologize for any inconvenience this incident caused our customers.