Lessons from a Cloud Malfunction: An Analysis of a Global System Failure
The paper digs into the true root causes of Skype’s global outage last Christmas focusing on which mechanisms, tools, operational and engineering practices could have prevented such a failure from escalating to a global outage or from happening altogether. It derives 11 practical lessons how to build reliable clouds and other types of large-scale systems (including those built on cloud platforms). Since the first version of this paper was written another spectacular regional outage of Amazon EC2 cloud took place. After reading Amazon’s postmortem, it appears that following the above guidelines would have prevented or contained that failure as well.
- by Alex Maclinovsky
Senior Manager - Architecture Innovation of Accenture
Alex brings over two decades of technology leadership experience. His interests include: Cloud, Internet-scale computing, dependable systems and SOA. Alex built his first global cloud last century, delivered Amazon’s Auto Scaling and launched several Cloud 2.0 initiatives. He is now working on a system designed for 1.2 Billion users.
register today!
Thank you for your interest in Second Annual UP 2011 conference. Please use the form below to register for full access to the conference. If you experience any problems with this form, or it does not render please try to register directly at http://up11.eventbrite.com If you still experience any difficulties, please contact us at info@up-con.com For feature comparison list, please visit this page.