Windows Azure大規模障害から考えるリスクコントロール～クラウドはやっぱり危険？障害から見えるAzureサービスに組み込まれているもの～

2012-03-04

今週のWindows Azure界隈の大ニュースと言えば、間違いなく大規模障害です。（現在は収束し、詳細な分析結果が10日以内に発表される予定となっています。）

マイクロソフトが提供するWindows Azureのサービスダッシュボードの情報によれば、米国にある「North Central US」リージョンと「South Central US」リージョン、アイルランドにある「North Europe」リージョンという三つのデータセンターにおいて、Windows Azureの仮想マシン（Windows Azure Compute）に対する外部から内部への通信（インバウンド通信）が利用できなくなった。最大時で、North Central USリージョンの6.7％、South Central USの28％、North Europeリージョンの37％がサービス障害の対象となった。この通信障害は日本時間の2月29日午前10時45分に発生し、日本時間の29日午後7時57分までに大部分で復旧した。
「うるう年」の処理ミスでWindows Azureにサービス障害 – ニュース：ITpro

クラウド黎明期、そして普及初期段階で、大規模障害のニュースが流れると盲目的に、「Windows Azureダメじゃん！」とか「クラウド危険」と考えてしまう契機になりえます。特にクラウドサービスって、本当に大丈夫なの？っと、おそるおそる手を出しているIT担当者にはショッキングなニュースだったと思います。

今回の障害内容を踏まえて、Windows Azureはどうなのだろうかを考えてみました。

障害から見えるWindows Azureサービスに組み込まれているもの

Windows Azureの障害は、マイクロソフト内で、おそらく緊急度Aと認定され24時間体制の緊急対応チームが結成され、CEOのスティーブ・バルマー監督の元、副社長のBill氏指揮で対応が行われました。

そして、マイクロソフトの管理下にあるAzureには、マイクロソフトが問題を認識してから9時間後に、マイクロソフトが必須と考える緊急パッチが適用され、障害が発生していたユーザの大半はサービス可用性が回復しました。そして、認知してから24時間超で障害は完全に収束させました。

オンプレミスで提供されているビジネス用途のマイクロソフト製品を使用ているとわかるかと思いますが、このような対応を期待できるのは、マイクロソフトとプレミアム契約を結んでいる必要があります。さらに、場合によっては重要顧客認定されている必要があります。マイクロソフトの製品不具合で障害が発生したとしても、無条件の即時対応で、障害の完全対応をしてもらえるケースは多くはないのです。

しかし、Azureは、Azure全体が一個の巨大な製品であるため、製品不具合発生時には即時対応の対象となります。つまり、Azureユーザは、マイクロソフトとプレミアムサポート契約を結び、顧客ランクも上位に認定された状態にあると言えます。Azureチームの手元に実働環境があり、ある程度均一環境なので、多環境試験もいくつか省けるでしょうし・・・。

OSS製品との比較

上記の話をすると、マイクロソフトにしかアクセスできないソースコード・サービスを運用するから、マイクロソフトが対応してくれるのを待つしかなくなるっという話になります。
OSSは、製品不具合時には自分たちでソースコードを解析し、問題を特定し修正パッチを適用することが可能です。さらにOSSによっては、世界中のエキスパートが集まって対応に向けて動いてくれます。

しかし、自組織にOSSのエキスパートがいて、不具合時に問題のないパッチを適用できる人材を抱えている組織はどれぐらいいるでしょうか。OSS不具合時に、OSS不具合解決を最優先任務として時間を割いて専任できる人がどれぐらいいるでしょうか。いるかもしれませんし、いないかもしれません。
指摘を受け考えるに、話が飛躍しており誰も得しない微妙な文章だったので、訂正します。

結論

結局は、オンプレミスかクラウドか。と言った、単純な対立軸で考える話ではなく、システムを構築する際のリスクコントロールの話になります。システムを構築する際に、考えられるリスクをリストアップし、それぞれのリスクへの対応を検討して、システムの構成図を決定します。

クラウドを採用することで、リスクコントロールはどうなるのでしょうか？
OS起因の不具合発生した場合の対応方法として、24時間365日のサポートをマイクロソフトに臨むのならオンプレミス環境下では「プレミアムサポート契約を結ぶ」になります。Azureの場合は、今回の事例のようにAzureサービスに組み込まれていると言えるでしょう。

オンプレミスをやめて、クラウド（Azure）を採用することで、リスクコントロールはどうなるのでしょう？
今回の障害で、Azure採用可否にどのような影響があるのでしょうか。
盲目的に恐れて、「クラウドはダメだ」と結論付ける前に、クラウドの方が危険なのか、オンプレミスの方が危険なのかを考え直してみましょう。

個人的には、小規模環境下ではAzureの方が安全と再認識しました。
クリティカルな不具合修正パッチは問答無用で適用対応されることが示され、クリティカルな問題にはマイクロソフトが24時間365日の対応をすることが実証されました。
クリティカルな問題が修正されるパッチがリリースされたとしても、稼働中のシステムに適用する手間と、クリティカルな問題修正パッチそのものを知らないケースすらあるかと思います。であるなら、本当にクリティカルなパッチは強制適用されるぐらいが楽だと思ってしまうのです。
//っと、最近オンプレミスの（リリース済みのSP1を適用していれば回避できた）不具合に遭遇した人間より・・・・。
//例えば、ミッションクリティカル環境やECサイトでは、採用しにくくなってしまう事例かなっと。

付録：障害対象と時間

Windows Azure Computeと、Windows Azure Computeで動作するサービスが障害の対象となりました。対象サービスは、ダッシュボードによると、「Access Control 2.0、Marketplace、Service Bus、Access Control & Caching Portal」です。

障害に関する情報として、

Windows Azure サービス中断に関する最新情報 « S/N Ratio (by SATO Naoki)

を参照してください。

サービスダッシュボードと佐藤さんのBlog記録を転載。

29-Feb-12

1:45 AM UTC
Windows Azure運用チームは、複数のリージョンのコンピューティングサービスに影響を与える問題を認識しました。この問題は迅速にトリアージ (緊急度判断、優先順位付け) が行われ、ソフトウェアのバグによって引き起こされたと判断されました。最終的な根本原因分析は進行中ですが、この問題は、うるう年に対して不正確な時間計算が原因だと考えられます。
10:57 AM UTC
問題の発見後、すでに稼働しているお客様のサービスを保護するための措置を即座に講じ、この問題の修正を作成し始めました。この修正はほとんどのWindows Azureサブリーションでデプロイに成功し、2月29日午前2:57 (PST) (日本時間 2月29日午後7:57) までに大部分のお客様のWindows Azureサービスの可用性を回復しました。
しかしながら、いくつかのサブリージョンおよびお客様では依然としてこの問題が発生しており、この問題の結果として、アプリケーション機能が失われている可能性があります。我々は、これらの残された問題を解決するために積極的に作業しています。
10:55 AM UTC
We are experiencing an issue with Windows Azure Compute in the South Central US sub-region. Incoming traffic may not go through for a subset of hosted services in this sub-region. Deployed applications will continue to run. There is no impact to storage accounts either. Further updates will be published to keep you apprised of the situation. We apologize for any inconvenience this causes our customers.
12:30 PM UTC
We are still troubleshooting this issue and capturing all the data that will allow us to resolve it. Further updates will be published to keep you apprised of the situation. We apologize for any inconvenience this causes our customers.
1:30 PM UTC
We are still troubleshooting this issue, and verifying the most probable cause. Further updates will be published to keep you apprised of the situation. We apologize for any inconvenience this causes our customers.
2:30 PM UTC
We have determined that about 28% of hosted services in this sub-region are impacted by this issue. The restoration steps to mitigate the issue are underway. Further updates will be published to keep you apprised of the situation. We apologize for any inconvenience this causes our customers.
3:30 PM UTC
The restoration steps to mitigate the issue are underway. Further updates will be published to keep you apprised of the situation. We apologize for any inconvenience this causes our customers.
4:30 PM UTC
The restoration steps to mitigate the issue are still underway. Further updates will be published to keep you apprised of the situation. We apologize for any inconvenience this causes our customers.
5:30 PM UTC
The restoration steps to mitigate the issue are still underway. Further updates will be published to keep you apprised of the situation. We apologize for any inconvenience this causes our customers.
6:30 PM UTC
The restoration steps to mitigate the issue are still underway. This incident impacts Access Control 2.0, Marketplace, Service Bus and the Access Control & Caching Portal in the same regions where Windows Azure Compute is impacted. As a result affected customers may experience a loss of application functionality. Further updates will be published to keep you apprised of the situation. We apologize for any inconvenience this causes our customers.
7:30 PM UTC
We are actively recovering Windows Azure hosted services in this sub-region. More and more customers applications should be back up-and-running even if service management functionality is not yet restored. Further updates will be published to keep you apprised of the situation. We apologize for any inconvenience this causes our customers.
8:30 PM UTC
We have recovered over half of the Windows Azure hosted services that were impacted. Further updates will be published to keep you apprised of the situation. We apologize for any inconvenience this causes our customers.
9:30 PM UTC
Recovery efforts are still underway. Further updates will be published to keep you apprised of the situation. We apologize for any inconvenience this causes our customers.
10:30 PM UTC
Recovery efforts are still underway. Further updates will be published to keep you apprised of the situation. We apologize for any inconvenience this causes our customers.
11:30 PM UTC
Recovery efforts are still underway, we are 70% through the recovery process and we restored full service management functionality for more customers in this region. Further updates will be published to keep you apprised of the situation. We apologize for any inconvenience this causes our customers.

1-Mar-12

1:00 AM UTC
Recovery efforts are still underway. The majority of Windows Azure customers and services have been restored. However, a few customers and services remain affected and we are working aggressively to restore full functionality. We restored full service management functionality for all customers in this region. We have published a recap of the incident since it started on the Windows Azure team blog (http://blogs.msdn.com/b/windowsazure/archive/2012/03/01/windows-azure-service-disruption-update.aspx). Further updates will be published to keep you apprised of the situation. We apologize for any inconvenience this causes our customers.
3:00 AM UTC
We are working on stabilizing the Windows Azure Platform as well as following-up with all customers who were impacted by this incident. The recovery process continues. We are also focusing on restoring full service management functionality in this sub-region. Further updates will be published to keep you apprised of the situation. We apologize for any inconvenience this causes our customers.
4:30 AM UTC
Recovery efforts are still underway, we are 85% through the recovery process in terms of impacted hosted services. Further updates will be published to keep you apprised of the situation. We apologize for any inconvenience this causes our customers.
7:30 AM UTC
Our recovery efforts to restore compute service to impacted customers in this sub-region are complete. A small number of customers in this sub-region may face long delays during service management operations. We request any customers experiencing compute or service management issues in this sub-region to reach out to us through the support channel described on this site: (http://www.windowsazure.com/en-us/support/contact). We apologize for any inconvenience this incident causes our customers.
4:57 PM UTC
We have confirmed that full functionality is restored in the South Central US sub-region. We apologize for any inconvenience this incident caused our customers.

Windows AzureWindows Azure, 障害

Posted by 大和屋貴仁