Welcome!

SDN Journal Authors: Elizabeth White, Yeshim Deniz, Liz McMillan, Pat Romanski, TJ Randall

Blog Feed Post

Explicitly-defined failure domains in the datacenter

While the bulk of the networking industry’s focus is on CapEx and automation, the two major trends driving changes in these areas will have a potentially greater impact on that which matters most: availability. In fact, despite that datacenter purchasing decisions skew towards CapEx, the number one requirement for most datacenters is uptime.

If availability is so important, why is it not the number one purchasing criteria already?

First, it’s not that availability doesn’t matter. It’s more that when everyone is building the same thing, it ceases to be a differentiating point. Most switch vendors have converged on Broadcom silicon and a basic set of features required to run datacenter architectures that have really been unchanged for a decade or more. But are those architectures going to continue unscathed?

SDN and network architecture

For those who believe in the transformative power of SDN, the answer is unequivocally no. If, after all of the SDN work is said and done, we emerge with the same architectures with an extra smidge of automation sprinkled on top, we will have grossly under-delivered on what should be the kind of change that happens once every couple of decades.

The rise of a central controller is more than just pushing provisioning to a single pane of glass. It is about central control over a network using a global perspective to make intelligent resourcing and pathing decisions. While this does not necessarily mean that legacy networking cannot (or should not) co-exist, the model is dramatically different than what exists today.

Switching is just a stop along the way for bare metal

Bare metal switching will also change IT infrastructure in a meaningful way. Again, the scope of public discourse is fairly narrow. The story goes something like this: if someone makes a commodity switch, pricing will come down. But there are two things that are really happening here.

First, what we are really seeing is a shift in monetization from hardware to software. This shift should not be surprising, as software investment on the vendor side has dwarfed hardware R&D for 15 years now. The real change here is that the companies that emerge have a tolerance for lower margins than the behemoths already entrenched in the space. Anyone can drop price; the question is at what margins is a business still attractive. What will play out over the next couple of years is a game of chicken with price.

Second, the objective of bare metal switching is less about the switching and more about the bare metal. Taken to its logical conclusion, the hope has to be that all infrastructure is eventually run on the same set of hardware. Whether something is a server, a storage device, an appliance, or a switch should ultimately be determined by the software that is being run on it. In this case, we see multi-purpose devices whose role depends on the context in which they are deployed. This would eventually allow for the fungibility of resources across what are currently very hard silos.

Domain, Domain, Domain

Both SDN and bare metal lead to very different architectures than that which exists today. But as architects consider how they will evolve their own instantiations of these technologies, they need to be clear about a couple of facts that get glossed over.

If availability really is the number one requirement for datacenters, then architectures need to explicitly consider how they impact overall resource availability. Consider that there are a number of sources for downtime:

  1. Human error - By far the leading source of downtime in most networks, human error is why there is such momentum around things like ITIL. Put differently, when is uptime the highest for most datacenters? Holidays, when everyone is away from the office.
  2. System issues – After human error, the next biggest cause of downtime is issues in the systems themselves. These are most likely software bugs, but they can include device and link failures as well.
  3. Maintenance – Another major contributor to uptime is overall infrastructure maintenance. When individual systems need to be upgraded or replaced, there is frequently some interruption to service. Of course, if maintenance is planned, then the impact to overall downtime should be low.
  4. Other – The Other category covers things like power outages and backhoes.

Of these, SDN promises to improve the first one. By expanding the management domain, it reduces the number of opportunities for pesky humans to make mistakes. Automated workflows that are executed from a single point of control and orchestrated across disparate elements of the infrastructure should help drive the number of provisioning mistakes in the datacenter down.

Additionally, a central point of control helps improve visibility (or at least it will over time). This helps operators diagnose issues more quickly, which will lower the Mean-Time-to-Repair (MTTR) for network-related issues.

But the management domain is not the only one that matters. There are at least two others that impact downtime: failure domains and maintenance domains. The impacts on these by SDN and bare metal need to be explicitly understood.

Failure domains

While there are tremendous operational benefits of collapsing domains under a single umbrella, one thing that becomes more difficult is managing the impact of failures when they do occur.

For instance, if a network is under the control of a single SDN controller, what happens if that controller is not reachable? If the controller is an active part of the data path, there is one set of outcomes. If the controller is not an active part of the data path, there is a different set of outcomes.

The point here is not to advocate for one or the other, but rather to point out that architects need to be explicit in defining the failure domain so that they can adjust operations appropriately. For instance, it might be the case that you prefer to balance the control benefits with failure scenarios, opting to create several smaller management domains, each with a correspondingly smaller failure domain. This gives you some benefit over a completely distributed management environment (where management domains are defined by the devices themselves) without putting the entire network under the same failure domain.

The same is true with bare metal. If bare metal leads to platform convergence, it allows architects to co-host compute, storage, and networking in the same device. Whether you actually group them depends an awful lot on how you view failure domains. Again, any balance is useful so long as it is explicitly chosen. Collapsing everything to a single device creates a different failure domain, which might make sense in some environments and less so in others.

Maintenance domains

The same discussion extends to maintenance domains. Collapsing everything might create a single maintenance domain (depending on architecture, of course). Keeping things separate might enable much smaller maintenance domains. There is no right size, but whatever the architecture that is chosen, the maintenance domain needs to be an explicit requirement.

The bottom line

Architectures are changing. There is little doubt that technology advances in IT generally and networking specifically are enabling us to do things we couldn’t really even consider just a few years ago. When deciding how to do those things, though, we need to be explicitly designing for availability. What has always been a requirement but has seen less and less talk as architectures matured should be dominating discussions again. Depending on your own specific requirements, this could lead to some unexpected architectural decisions.

[Today’s fun fact: Playing in a marching band is considered moderate exercise. But lest there be any confusion, this does not make it a sport.]

The post Explicitly-defined failure domains in the datacenter appeared first on Plexxi.

Read the original blog entry...

More Stories By Michael Bushong

The best marketing efforts leverage deep technology understanding with a highly-approachable means of communicating. Plexxi's Vice President of Marketing Michael Bushong has acquired these skills having spent 12 years at Juniper Networks where he led product management, product strategy and product marketing organizations for Juniper's flagship operating system, Junos. Michael spent the last several years at Juniper leading their SDN efforts across both service provider and enterprise markets. Prior to Juniper, Michael spent time at database supplier Sybase, and ASIC design tool companies Synopsis and Magma Design Automation. Michael's undergraduate work at the University of California Berkeley in advanced fluid mechanics and heat transfer lend new meaning to the marketing phrase "This isn't rocket science."

CloudEXPO Stories
DXWorldEXPO LLC announced today that Big Data Federation to Exhibit at the 22nd International CloudEXPO, colocated with DevOpsSUMMIT and DXWorldEXPO, November 12-13, 2018 in New York City. Big Data Federation, Inc. develops and applies artificial intelligence to predict financial and economic events that matter. The company uncovers patterns and precise drivers of performance and outcomes with the aid of machine-learning algorithms, big data, and fundamental analysis. Their products are deployed by some of the world's largest financial institutions. The company develops and applies innovative machine-learning technologies to big data to predict financial, economic, and world events. The team is a group of passionate technologists, mathematicians, data scientists and programmers in Silicon Valley with over 100 patents to their names. Big Data Federation was incorporated in 2015 and is ...
Dynatrace is an application performance management software company with products for the information technology departments and digital business owners of medium and large businesses. Building the Future of Monitoring with Artificial Intelligence. Today we can collect lots and lots of performance data. We build beautiful dashboards and even have fancy query languages to access and transform the data. Still performance data is a secret language only a couple of people understand. The more business becomes digital the more stakeholders are interested in this data including how it relates to business. Some of these people have never used a monitoring tool before. They have a question on their mind like "How is my application doing" but no idea how to get a proper answer.
All in Mobile is a place where we continually maximize their impact by fostering understanding, empathy, insights, creativity and joy. They believe that a truly useful and desirable mobile app doesn't need the brightest idea or the most advanced technology. A great product begins with understanding people. It's easy to think that customers will love your app, but can you justify it? They make sure your final app is something that users truly want and need. The only way to do this is by researching target group and involving users in the designing process.
CloudEXPO New York 2018, colocated with DevOpsSUMMIT and DXWorldEXPO New York 2018 will be held November 12-13, 2018, in New York City and will bring together Cloud Computing, FinTech and Blockchain, Digital Transformation, Big Data, Internet of Things, DevOps, AI and Machine Learning to one location.
CloudEXPO | DevOpsSUMMIT | DXWorldEXPO are the world's most influential, independent events where Cloud Computing was coined and where technology buyers and vendors meet to experience and discuss the big picture of Digital Transformation and all of the strategies, tactics, and tools they need to realize their goals. Sponsors of DXWorldEXPO | CloudEXPO benefit from unmatched branding, profile building and lead generation opportunities.