Welcome!

SDN Journal Authors: Yeshim Deniz, Pat Romanski, Liz McMillan, Elizabeth White, TJ Randall

Related Topics: Artificial Intelligence, Machine Learning , @CloudExpo

Artificial Intelligence: Article

AWS Broke the Internet Again or, Better, a Typo | @CloudExpo #AI #ML #DL

An AI-defined infrastructure can help to avoid service disruptions

Amazon Web Services (AWS) broke the Internet again or better "a typo". On February 28, 2017, an Amazon S3 service disruption in AWS' oldest region US-EAST-1 shuts down several major websites and services like Slack, Trello, Quora, Business Insider, Coursera and Time Inc. Other users were reporting that they were also unable to control devices which were connected via the Internet of Things since IFTTT was also down. Those kinds of disruptions are becoming more and more business critical for today's digital economy. To prevent these situations, cloud users should always consider the shared responsibility model in the public cloud. However, there are also ways where Artificial Intelligence (AI) can help. This article describes that an AI-defined Infrastructure respectively an AI-powered IT management system can help to avoid service disruptions of public cloud providers.

Amazon S3 Service Disruption - What has happened
After every service disruption AWS writes a summary of what was going on during an incident. This is what happened on the morning of February 28.

"The Amazon Simple Storage Service (S3) team was debugging an issue causing the S3 billing system to progress more slowly than expected. At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended. The servers that were inadvertently removed supported two other S3 subsystems.  One of these subsystems, the index subsystem, manages the metadata and location information of all S3 objects in the region. This subsystem is necessary to serve all GET, LIST, PUT, and DELETE requests. The second subsystem, the placement subsystem, manages allocation of new storage and requires the index subsystem to be functioning properly to correctly operate. The placement subsystem is used during PUT requests to allocate storage for new objects. Removing a significant portion of the capacity caused each of these systems to require a full restart. While these subsystems were being restarted, S3 was unable to service requests. Other AWS services in the US-EAST-1 Region that rely on S3 for storage, including the S3 console, Amazon Elastic Compute Cloud (EC2) new instance launches, Amazon Elastic Block Store (EBS) volumes (when data was needed from a S3 snapshot), and AWS Lambda were also impacted while the S3 APIs were unavailable."

Read more under "Summary of the Amazon S3 Service Disruption in the Northern Virginia (US-EAST-1) Region".

Bottom line, a typo crashed the AWS powered Internet! AWS outages already have a long history and the more AWS customers running their web infrastructure on the cloud giant, the more issues end customers will experience in the future. According to SimilarTech only Amazon S3 is already used by 152,123 websites and 124,577 unique domains.

However, following the philosophy of "Everything fails all the time (Werner Vogels, CTO Amazon.com)" means if you are using AWS you must "Design for Failure".  Something cloud role model and video on demand provider Netflix is doing in perfection. In doing so, Netflix has developed its Simian Army an open source toolset everyone can use to run a cloud infrastructure on AWS high-available.

Netflix "simply" uses the two levels of redundancy AWS offers. Multiple regions and multiple availability zones (AZ). Multiple regions are the masterclass of using AWS, very complex and sophisticated since you must build and manage entire separated infrastructure environments within AWS' worldwide distributed cloud infrastructure. Multiple AZs are the preferred and "easiest" way for high availability (HA) on AWS. In this case, the infrastructure is built within more than one data center (AZ). In doing so, a single region HA architecture is deployed in at least two or more AZs - a load balancer in front of it is controlling the data traffic.

However, even if "typos" shouldn't happen the recent accident shows, that human error is still the biggest issue running IT systems. In addition, you can blame AWS only to a certain extend since the public cloud is about shared responsibility.

Shared Responsibility in the Public Cloud
An important public cloud detail is the self-service. Depending on its DNA the providers are only taking responsibility for specific areas. The customer is responsible for the rest. In the public cloud, it is about sharing responsibilities - this model is called Shared Responsibility. The provider and its customers divide the field of duties among themselves. In doing so, the customer's self-responsibility plays a major role. In the context of IaaS utilization, the provider is responsible for the operations and security of the physical environment. He is taking care of:

  • Set up and maintenance of the entire data center infrastructure.
  • Deployment of compute power, storage, network and managed services (like databases) and other micro services.
  • Provisioning the virtualization layer customers are using to demand virtual resources at any time.
  • Deployment of services and tools customers can use to manage their areas of responsibility.

The customer is responsible for the operations and security of the logical environment. This includes:

  • Set up of the virtual infrastructure.
  • Installation of operating systems.
  • Configuration of networks and firewall settings.
  • Operations of own applications and self-developed (micro) services.

Thus, the customer is responsible for the operations and security of his own infrastructure environment and the systems, applications, services, as well as stored data on top of it. However, providers like Amazon Web Services or Microsoft Azure provide comprehensive tools and services customers can use e.g. to encrypt their data as well as ensure identity and access controls. In addition, enablement services (micro services) exist that customers can adopt to develop own applications more quickly and easily.

In doing so, the customer is all alone in its area of responsibility and thus must take self-responsibility. However, this part of the shared responsibility can be done by an AI-defined IT management system respectively an AI-defined Infrastructure.

An AI-defined Infrastructure can help to avoid Service Disruptions
An AI-defined Infrastructure can help to avoid service disruptions in the public cloud. However, the basis of this kind of infrastructure is a General AI that combines three major human abilities that enable enterprises to tackle IT and business process challenges.

  • Understanding: By creating a semantic data map the General AI understands the world of the company in which its IT and business exists.
  • Learning: By creating Knowledge Items the General AI learns best practices and reasoning from experts. Knowledge is taught in atomic pieces of information (Knowledge Items) that represent separate steps of a process.
  • Solving: With machine reasoning problems are solved in ambiguous and changing environments. The General AI dynamically reacts to the ever-changing context, selecting the best course of action. Based on machine learning the results are optimized through experiments.

To put this into the context of an AWS service disruption:

  • Understanding: The General AI creates a semantic map of the AWS environment as part of the world in which the company exists.
  • Learning: IT experts create Knowledge Items while they are configuring and working with AWS from what the General AI learns best practices. Thus, the experts teach the General AI contextual knowledge that includes what, when, where and why something needs to be done - for example when a specific AWS service is not responding.
  • Solving: The General AI dynamically reacts to incidents based on the learned knowledge. Thus, the AI (probably) knows what to do at this very moment - even if no high availability setup was considered from the beginning.

Frankly speaking, everything described above is no magic. Like every new born organism an AI-defined Infrastructure needs to be trained but afterwards can work autonomously as well as can detect anomalies as well as service disruptions in the public cloud and solve them. Therefore, you need the knowledge of experts who have a deep understanding of AWS and how the cloud works in general. These experts need to teach the General AI with their contextual knowledge that includes not only what, when and where but also why. They have to teach the AI with atomic pieces (Knowledge Items, KI) that can be indexed and prioritized by the AI. Context and indexing enable this KIs to be combined to form many solutions.

KIs created by various IT experts create pooled expertise that is further optimized by machine selection of best knowledge combinations for problem resolution. This type of collaborative learning improves process time task by task. However, the number of possible permutations grows exponentially with added knowledge. Connected to a knowledge core, the General AI continuously optimizes performance by eliminating unnecessary steps and even changing routes based on other contextual learning. And the bigger the semantic graph and knowledge core gets, the better and more dynamically the infrastructure can act in terms of service disruptions.

On a final note, do not underestimate the "power of we"! Our research at Arago revealed that with an overlap of 33 percent in basic knowledge, this knowledge can and is used outside a specific organizational environment, i.e. across different client environments. The reuse of knowledge within a client is up to 80 percent. Thus, exchanging basic knowledge within a community becomes imperative from an efficiency perspective and improve the abilities of the General AI.

More Stories By Rene Buest

Rene Buest is Director of Market Research & Technology Evangelism at Arago. Prior to that he was Senior Analyst and Cloud Practice Lead at Crisp Research, Principal Analyst at New Age Disruption and member of the worldwide Gigaom Research Analyst Network. At this time he was considered a top cloud computing analyst in Germany and one of the worldwide top analysts in this area. In addition, he was one of the world’s top cloud computing influencers and belongs to the top 100 cloud computing experts on Twitter and Google+. Since the mid-90s he is focused on the strategic use of information technology in businesses and the IT impact on our society as well as disruptive technologies.

Rene Buest is the author of numerous professional technology articles. He regularly writes for well-known IT publications like Computerwoche, CIO Magazin, LANline as well as Silicon.de and is cited in German and international media – including New York Times, Forbes Magazin, Handelsblatt, Frankfurter Allgemeine Zeitung, Wirtschaftswoche, Computerwoche, CIO, Manager Magazin and Harvard Business Manager. Furthermore he is speaker and participant of experts rounds. He is founder of CloudUser.de and writes about cloud computing, IT infrastructure, technologies, management and strategies. He holds a diploma in computer engineering from the Hochschule Bremen (Dipl.-Informatiker (FH)) as well as a M.Sc. in IT-Management and Information Systems from the FHDW Paderborn.

@CloudExpo Stories
JETRO showcased Japan Digital Transformation Pavilion at SYS-CON's 21st International Cloud Expo® at the Santa Clara Convention Center in Santa Clara, CA. The Japan External Trade Organization (JETRO) is a non-profit organization that provides business support services to companies expanding to Japan. With the support of JETRO's dedicated staff, clients can incorporate their business; receive visa, immigration, and HR support; find dedicated office space; identify local government subsidies; get...
René Bostic is the Technical VP of the IBM Cloud Unit in North America. Enjoying her career with IBM during the modern millennial technological era, she is an expert in cloud computing, DevOps and emerging cloud technologies such as Blockchain. Her strengths and core competencies include a proven record of accomplishments in consensus building at all levels to assess, plan, and implement enterprise and cloud computing solutions. René is a member of the Society of Women Engineers (SWE) and a m...
In this presentation, you will learn first hand what works and what doesn't while architecting and deploying OpenStack. Some of the topics will include:- best practices for creating repeatable deployments of OpenStack- multi-site considerations- how to customize OpenStack to integrate with your existing systems and security best practices.
Explosive growth in connected devices. Enormous amounts of data for collection and analysis. Critical use of data for split-second decision making and actionable information. All three are factors in making the Internet of Things a reality. Yet, any one factor would have an IT organization pondering its infrastructure strategy. How should your organization enhance its IT framework to enable an Internet of Things implementation? In his session at @ThingsExpo, James Kirkland, Red Hat's Chief Archi...
Digital transformation has increased the pace of business creating a productivity divide between the technology haves and have nots. Managing financial information on spreadsheets and piecing together insight from numerous disconnected systems is no longer an option. Rapid market changes and aggressive competition are motivating business leaders to reevaluate legacy technology investments in search of modern technologies to achieve greater agility, reduced costs and organizational efficiencies. ...
As organizations shift towards IT-as-a-service models, the need for managing and protecting data residing across physical, virtual, and now cloud environments grows with it. Commvault can ensure protection, access and E-Discovery of your data – whether in a private cloud, a Service Provider delivered public cloud, or a hybrid cloud environment – across the heterogeneous enterprise. In his general session at 18th Cloud Expo, Randy De Meno, Chief Technologist - Windows Products and Microsoft Part...
"With Digital Experience Monitoring what used to be a simple visit to a web page has exploded into app on phones, data from social media feeds, competitive benchmarking - these are all components that are only available because of some type of digital asset," explained Leo Vasiliou, Director of Web Performance Engineering at Catchpoint Systems, in this SYS-CON.tv interview at DevOps Summit at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
In his general session at 19th Cloud Expo, Manish Dixit, VP of Product and Engineering at Dice, discussed how Dice leverages data insights and tools to help both tech professionals and recruiters better understand how skills relate to each other and which skills are in high demand using interactive visualizations and salary indicator tools to maximize earning potential. Manish Dixit is VP of Product and Engineering at Dice. As the leader of the Product, Engineering and Data Sciences team at D...
It is ironic, but perhaps not unexpected, that many organizations who want the benefits of using an Agile approach to deliver software use a waterfall approach to adopting Agile practices: they form plans, they set milestones, and they measure progress by how many teams they have engaged. Old habits die hard, but like most waterfall software projects, most waterfall-style Agile adoption efforts fail to produce the results desired. The problem is that to get the results they want, they have to ch...
Organizations planning enterprise data center consolidation and modernization projects are faced with a challenging, costly reality. Requirements to deploy modern, cloud-native applications simultaneously with traditional client/server applications are almost impossible to achieve with hardware-centric enterprise infrastructure. Compute and network infrastructure are fast moving down a software-defined path, but storage has been a laggard. Until now.
Without a clear strategy for cost control and an architecture designed with cloud services in mind, costs and operational performance can quickly get out of control. To avoid multiple architectural redesigns requires extensive thought and planning. Boundary (now part of BMC) launched a new public-facing multi-tenant high resolution monitoring service on Amazon AWS two years ago, facing challenges and learning best practices in the early days of the new service.
HyperConvergence came to market with the objective of being simple, flexible and to help drive down operating expenses. It reduced the footprint by bundling the compute/storage/network into one box. This brought a new set of challenges as the HyperConverged vendors are very focused on their own proprietary building blocks. If you want to scale in a certain way, let's say you identified a need for more storage and want to add a device that is not sold by the HyperConverged vendor, forget about it...
Digital Transformation is much more than a buzzword. The radical shift to digital mechanisms for almost every process is evident across all industries and verticals. This is often especially true in financial services, where the legacy environment is many times unable to keep up with the rapidly shifting demands of the consumer. The constant pressure to provide complete, omnichannel delivery of customer-facing solutions to meet both regulatory and customer demands is putting enormous pressure on...
The best way to leverage your CloudEXPO | DXWorldEXPO presence as a sponsor and exhibitor is to plan your news announcements around our events. The press covering CloudEXPO | DXWorldEXPO will have access to these releases and will amplify your news announcements. More than two dozen Cloud companies either set deals at our shows or have announced their mergers and acquisitions at CloudEXPO. Product announcements during our show provide your company with the most reach through our targeted audienc...
@DevOpsSummit at Cloud Expo, taking place November 12-13 in New York City, NY, is co-located with 22nd international CloudEXPO | first international DXWorldEXPO and will feature technical sessions from a rock star conference faculty and the leading industry players in the world.
DXWorldEXPO LLC announced today that ICC-USA, a computer systems integrator and server manufacturing company focused on developing products and product appliances, will exhibit at the 22nd International CloudEXPO | DXWorldEXPO. DXWordEXPO New York 2018, colocated with CloudEXPO New York 2018 will be held November 11-13, 2018, in New York City. ICC is a computer systems integrator and server manufacturing company focused on developing products and product appliances to meet a wide range of ...
With 10 simultaneous tracks, keynotes, general sessions and targeted breakout classes, @CloudEXPO and DXWorldEXPO are two of the most important technology events of the year. Since its launch over eight years ago, @CloudEXPO and DXWorldEXPO have presented a rock star faculty as well as showcased hundreds of sponsors and exhibitors!
DXWorldEXPO LLC announced today that the upcoming DXWorldEXPO | CloudEXPO New York event will feature 10 companies from Poland to participate at the "Poland Digital Transformation Pavilion" on November 12-13, 2018.
Today we can collect lots and lots of performance data. We build beautiful dashboards and even have fancy query languages to access and transform the data. Still performance data is a secret language only a couple of people understand. The more business becomes digital the more stakeholders are interested in this data including how it relates to business. Some of these people have never used a monitoring tool before. They have a question on their mind like “How is my application doing” but no id...
22nd International Cloud Expo, taking place June 5-7, 2018, at the Javits Center in New York City, NY, and co-located with the 1st DXWorld Expo will feature technical sessions from a rock star conference faculty and the leading industry players in the world. Cloud computing is now being embraced by a majority of enterprises of all sizes. Yesterday's debate about public vs. private has transformed into the reality of hybrid cloud: a recent survey shows that 74% of enterprises have a hybrid cloud ...
"We're focused on how to get some of the attributes that you would expect from an Amazon, Azure, Google, and doing that on-prem. We believe today that you can actually get those types of things done with certain architectures available in the market today," explained Steve Conner, VP of Sales at Cloudistics, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.