Welcome!

SDN Journal Authors: Rex Morrow, Datical, Pat Romanski, Carmen Gonzalez, JP Morgenthal, Lori MacVittie

Related Topics: Big Data Journal, Java, Linux, Web 2.0, Cloud Expo, SDN Journal

Big Data Journal: Blog Feed Post

Scaling Big Data Fabrics

The size of the network might be the least interesting aspect of scaling Big Data fabrics

When people talk about Big Data, the emphasis is usually on the Big. Certainly, Big Data applications are distributed largely because the size of the data on which computations are executed warrants more than a typical application can handle. But scaling the network that provides connectivity between Big Data nodes is not just about creating massive interconnects.

In fact, the size of the network might be the least interesting aspect of scaling Big Data fabrics.

Just how big is Big Data?

Not that long ago, I asked the question: how large is a typical Big Data deployment? I was expecting, as I suspect many people are, that the Big in the title meant that the deployments would be, in a word, big. But the average Big Data deployment is actually far smaller than most people realize. I grabbed a list from HadoopWizard in an article dating back to last year.

What is remarkable about this list is just how unremarkable the sizes of the deployments are. Sure, the list is dated, and deployments have certainly gotten larger. And yes, companies like Yahoo! are pushing scaling limits. But the average deployment if you take Yahoo! out is a mere 113 nodes. Even if every node is multi-homed to two switches, this means the average deployment could be handled by 4 access switches.

Even if every deployment quadrupled, you would still only be talking about 16-access-switch deployments. When our industry talks about scaling, we usually think well beyond 16 switches.

Is scaling an issue?

So if deployments are small, does that mean scaling is a solved issue? The answer is both yes and no. If the end game is building individual networks for each Big Data application, then yes. While the web scale companies will always need more, the vast majority of customers will be well-served by the scaling limits that are around today.

But the issue with Big Data is that it isn’t really just Big Data. When we talk about Big Data, we usually ought to be using a different moniker. For most people, Big Data is less about Hadoop and more about clustered applications (at least so far as the network is concerned). By expanding the definition to clustered applications, you move past Hadoop and into clustered compute and even clustered storage environments. Anything clustered has a dependency on some kind of interconnect.

The challenge in clustered environments

The challenge of all these types of clustered environments is that their requirements vary. For Hadoop, job completion times are dominated by the compute side of things, so the network is really about providing a congestion-free interconnect that is always available. For clustered compute, latency might be more important. And for multi-tenant environments, it might be most important to isolate traffic. Whatever the application, the point is that the requirements are highly contextual.

Which brings us back to scaling.

The real issue in scaling Big Data fabrics is less about making a small interconnect larger. Networks are not going to scale along the lines of single applications (or at least they shouldn’t). The actual scaling challenge is plotting a course from a single Big Data application to an environment that hosts multiple clustered applications, each with different requirements.

This might seem dead simple, but it isn’t. When people deploy Big Data applications today, the Big part leads people to purpose-build architecture with massive data workloads in mind. In many cases, this includes building out separate networks aimed at specific workloads.

But even in the best cases, Hadoop makes use of things like rack awareness, which help provide application resilience while minimizing traffic across the network. Regardless of whether you view this as for the application or for the network, the result is that proximity and locality are built into the infrastructure. This creates interesting considerations (and potentially limitations) when expanding. If you want to grow a cluster, you can’t just use any available server in the datacenter; there are servers that are more preferable than others based solely on their physical location.

Scalability is more than scaling

Making a scalable interconnect for these types of clustered applications is more than just supporting a large (or as I mentioned previously, not so large) number of nodes. The objective for scalability is to provide a graceful path from start to finish. This means architectures need to consider not just what the ending state is but also how to get from here to there.

With Hadoop, this means that things like locality have to be an explicit consideration in architecting the interconnect. Is the right answer a bunch of cross-connects zigzagging across the datacenter? Maybe. Or it might be a different architectural approach to providing interconnect between clustered servers.

Additionally, it isn’t just about one application. Architecting for bandwidth because you have a Hadoop-y application is great, but what if the next clustered application is latency-sensitive? Or if it brings with it a set of auditing and compliance requirements more typical of HIPAA-style applications?

If the architecture doesn’t explicitly consider how to expand beyond a single application, even if it can grow to thousands of switches, it won’t really matter.

The bottom line

The punch line here is that scaling is not only about growing larger. It also means potentially growing more diverse. And if there is one thing that the Hadoop deployment numbers tell me, it’s that people are still experimenting. If you are still experimenting, how can you predict with certainty what the next 5 or 10 years will mean in terms of applications for your business? You can’t. Which means that the most important architectural objective might go well beyond the number of switches in a deployment. Scalability could be about building flexibility into you datacenter. How do you get a bunch of different purpose-built capabilities into a single, general-purpose network? Answering that might be the real key to determining how to scale Big Data fabrics.

[Today’s fun fact: It is against the law to use the Star Spangled Banner as dance music in Massachusetts. There go my party plans!]

The post Scaling Big Data fabrics appeared first on Plexxi.

More Stories By Michael Bushong

The best marketing efforts leverage deep technology understanding with a highly-approachable means of communicating. Plexxi's Vice President of Marketing Michael Bushong has acquired these skills having spent 12 years at Juniper Networks where he led product management, product strategy and product marketing organizations for Juniper's flagship operating system, Junos. Michael spent the last several years at Juniper leading their SDN efforts across both service provider and enterprise markets. Prior to Juniper, Michael spent time at database supplier Sybase, and ASIC design tool companies Synopsis and Magma Design Automation. Michael's undergraduate work at the University of California Berkeley in advanced fluid mechanics and heat transfer lend new meaning to the marketing phrase "This isn't rocket science."

@CloudExpo Stories
Once the decision has been made to move part or all of a workload to the cloud, a methodology for selecting that workload needs to be established. How do you move to the cloud? What does the discovery, assessment and planning look like? What workloads make sense? Which cloud model makes sense for each workload? What are the considerations for how to select the right cloud model? And how does that fit in with the overall IT tranformation? In his session at 15th Cloud Expo, John Hatem, head of V...
Cloud services are the newest tool in the arsenal of IT products in the market today. These cloud services integrate process and tools. In order to use these products effectively, organizations must have a good understanding of themselves and their business requirements. In his session at 15th Cloud Expo, Brian Lewis, Principal Architect at Verizon Cloud, will outline key areas of organizational focus, and how to formalize an actionable plan when migrating applications and internal services to...
SAP is delivering break-through innovation combined with fantastic user experience powered by the market-leading in-memory technology, SAP HANA. In his General Session at 15th Cloud Expo, Thorsten Leiduck, VP ISVs & Digital Commerce, SAP, will discuss how SAP and partners provide cloud and hybrid cloud solutions as well as real-time Big Data offerings that help companies of all sizes and industries run better. SAP launched an application challenge to award the most innovative SAP HANA and SAP ...
SYS-CON Events announced today that ElasticBox is holding a Hackathon at DevOps Summit, November 6 from 12 pm -4 pm at the Santa Clara Convention Center in Santa Clara, CA. You can enter as an individual or team of up to 10 developers. A New Star Is Born Every Month! All completed ElasticBoxes will then be sent to a judging panel - 12 winners will be featured on the ElasticBox website in 2015. All entrants will receive five full enterprise licenses for one year + ElasticBox headphones + Elasti...
Ixia develops amazing products so its customers can connect the world. Ixia helps its customers provide an always-on user experience through fast, secure delivery of dynamic connected technologies and services. Through actionable insights that accelerate and secure application and service delivery, Ixia's customers benefit from faster time to market, optimized application performance and higher-quality deployments.
SYS-CON Events announced today that Calm.io has been named “Bronze Sponsor” of DevOps Summit Silicon Valley, which will take place on November 4–6, 2014, at the Santa Clara Convention Center in Santa Clara, CA. Calm.io is a cloud orchestration platform for AWS, vCenter, OpenStack, or bare metal, that runs your CL tools puppet, Chef, shell, git, Jenkins, nagios, and will soon support New Relic and Docker. It can run hosted, or on premise and provides VM automation / expiry, self-service portals,...
In her General Session at 15th Cloud Expo, Anne Plese, Senior Consultant, Cloud Product Marketing, at Verizon Enterprise, will focus on finding the right mix of renting vs. buying Oracle capacity to scale to meet business demands, and offer validated Oracle database TCO models for Oracle development and testing environments. Anne Plese is a marketing and technology enthusiast/realist with over 19+ years in high tech. At Verizon Enterprise, she focuses on driving growth for the Verizon Cloud pla...
SYS-CON Events announced today that Aria Systems, the recurring revenue expert, has been named "Bronze Sponsor" of SYS-CON's 15th International Cloud Expo®, which will take place on November 4-6, 2014, at the Santa Clara Convention Center in Santa Clara, CA. Aria Systems helps leading businesses connect their customers with the products and services they love. Industry leaders like Pitney Bowes, Experian, AAA NCNU, VMware, HootSuite and many others choose Aria to power their recurring revenue bu...
The Internet of Things (IoT) is going to require a new way of thinking and of developing software for speed, security and innovation. This requires IT leaders to balance business as usual while anticipating for the next market and technology trends. Cloud provides the right IT asset portfolio to help today’s IT leaders manage the old and prepare for the new. Today the cloud conversation is evolving from private and public to hybrid. This session will provide use cases and insights to reinforce t...
As Platform as a Service (PaaS) matures as a category, developers should have the ability to use the programming language of their choice to build applications and have access to a wide array of services. Bluemix is IBM's open cloud development platform that enables users to easily build cloud-based, creative mobile and web applications without having to spend large amounts of time and resources on configuring infrastructure and multiple software licenses. In this track, you will learn about the...
Blue Box has closed a $10 million Series B financing. The round was led by a strategic investor and included participation from prior investors including Voyager Capital and Founders Collective, as well as the Blue Box executive team. This round follows a $4.3 million Series A closed in December of 2012 and led by Voyager Capital. In May of this year, the company announced general availability of its private cloud as a service offering, Blue Box Cloud. Since that release, the company has dem...
SYS-CON Events announced today that Verizon has been named "Gold Sponsor" of SYS-CON's 15th International Cloud Expo®, which will take place on November 4-6, 2014, at the Santa Clara Convention Center in Santa Clara, CA. Verizon Enterprise Solutions creates global connections that generate growth, drive business innovation and move society forward. With industry-specific solutions and a full range of global wholesale offerings provided over the company's secure mobility, cloud, strategic network...
SimpleECM is the only platform to offer a powerful combination of enterprise content management (ECM) services, capture solutions, and third-party business services providing simplified integrations and workflow development for solution providers. SimpleECM is opening the market to businesses of all sizes by reinventing the delivery of ECM services. Our APIs make the development of ECM services simple with the use of familiar technologies for a frictionless integration directly into web applicat...
The only place to be June 9-11 is Cloud Expo & @ThingsExpo 2015 East at the Javits Center in New York City. Join us there as delegates from all over the world come to listen to and engage with speakers & sponsors from the leading Cloud Computing, IoT & Big Data companies. Cloud Expo & @ThingsExpo are the leading events covering the booming market of Cloud Computing, IoT & Big Data for the enterprise. Speakers from all over the world will be hand-picked for their ability to explore the economic...
Cloudwick, the leading big data DevOps service and solution provider to the Fortune 1000, announced Big Loop, its multi-vendor operations platform. Cloudwick Big Loop creates greater collaboration between Fortune 1000 IT staff, developers and their database management systems as well as big data vendors. This allows customers to comprehensively manage and oversee their entire infrastructure, which leads to more successful production cluster operations, and scale-out. Cloudwick Big Loop supports ...
To manage complex web services with lots of calls to the cloud, many businesses have invested in Application Performance Management (APM) and Network Performance Management (NPM) tools. Together APM and NPM tools are essential aids in improving a business’s infrastructure required to support an effective web experience… but they are missing a critical component – Internet visibility. Internet connectivity has always played a role in customer access to web presence, but in the past few years use...
SAP is delivering break-through innovation combined with fantastic user experience powered by the market-leading in-memory technology, SAP HANA. In his General Session at 15th Cloud Expo, Thorsten Leiduck, VP ISVs & Digital Commerce, SAP, will discuss how SAP and partners provide cloud and hybrid cloud solutions as well as real-time Big Data offerings that help companies of all sizes and industries run better. SAP launched an application challenge to award the most innovative SAP HANA and SAP ...
Software AG helps organizations transform into Digital Enterprises, so they can differentiate from competitors and better engage customers, partners and employees. Using the Software AG Suite, companies can close the gap between business and IT to create digital systems of differentiation that drive front-line agility. We offer four on-ramps to the Digital Enterprise: alignment through collaborative process analysis; transformation through portfolio management; agility through process automation...
What are the benefits of using an enterprise-grade orchestration platform? In their session at 15th Cloud Expo, Jeff Tegethoff, CEO of Appcore, and Kedar Poduri, Senior Director of Product Management at Citrix Systems, will take a closer look at the architectural design factors needed to support diverse workloads and how to run these workloads efficiently as a service provider. They will also discuss how to deploy private cloud environments in 15 minutes or less.
Headquartered in Santa Monica, California, Bitium was founded by Kriz and Erik Gustavson. The 1,500 cloud-based application using Bitium’s analytics, app management, and single sign-on services include bug trackers, customer service dashboards, Google Apps, and social networks. The firm states website administrators can do multiple tasks online without revealing passwords. Bitium’s advisors include Microsoft’s former CMO and the former senior vice president of strategy, the founder and CEO of Li...