Welcome!

SDN Journal Authors: Pat Romanski, Elizabeth White, Michael Bushong, Ashley Stephenson, Brad Anderson

Related Topics: Big Data Journal, Java, SOA & WOA, Open Source, Cloud Expo, Apache, SDN Journal

Big Data Journal: Article

The Fallacies of Big Data

No software, not even Hadoop, can make sense out of anything

The biggest problem with software is that it doesn’t do us any good at all unless our wetware is working properly – and unfortunately, the wetware which resides between our ears is limited, fallible, and insists on a good Chianti every now and then.

Improving our information technology, alas, only exacerbates this problem. Case in point: Big Data. As we’re able to collect, store, and analyze data sets of ever increasing size, our ability to understand and process the results of such analysis putters along, occasionally falling into hidden traps that we never even see coming.

I’m talking about fallacies: widely held beliefs that are nevertheless quite false. While we like to think of ourselves as creatures of logic and reason, we all fall victim to misperceptions, misjudgments, and miscalculations far more often than we care to admit, often without even realizing we’ve lost touch with reality. Such is the human condition.

Combine our natural proclivity to succumb to popular fallacies with the challenge of getting our wetware around just how big Big Data can be, and you have a recipe for disaster. But the good news is that there is hope. The best way to avoid an unseen trap in your path is to know it’s there. Fallacies are easy to avoid if you recognize them for what they are before they mislead you.

The Lottery Paradox
The first fallacy to recognize – and thus, to avoid – is the lottery paradox. The lottery paradox states that people place an inordinate emphasis on improbable events. Nobody would ever buy a lottery ticket if they based their decision to purchase on the odds of winning. As the probability of winning drops to extraordinarily low numbers (for example, the chance of winning the Powerball is less than 175,000,000 to 1), people simply lose touch with the reality of the odds.

Furthermore, it’s important to note that the chance someone will win the jackpot is relatively high, simply because so many tickets are sold. People erroneously correlate these two probabilities as though they were somehow comparable: “someone has to win, so why not me?” we all like to say, as we shell out our $2 per ticket. Assuming tens of millions of people were to read this article (I should be so lucky!) then it would be somewhat likely that some member of this impressive audience will win the lottery. But sorry to say, it won’t be you.

The same fallacy can crop up with Big Data. As the size of Big Data sets explode, the chance of finding a particular analytical result, in other words, a “nugget of wisdom,” becomes increasingly small. However, the chance of finding some interesting result is quite high. Our natural tendency to conflate these two probabilities can lead to excess investment in the expectation of a particular result. And then when we don’t get the result we’re looking for, we wonder if we’ve just wasted all the money we just sunk into all our Big Data tools.

Another way of looking at the lottery paradox goes under the name the law of truly large numbers. Essentially, this law states that if your sample size is very large, then any outrageous thing is likely to happen. And with Big Data, our sample sizes can be truly enormous. With the lottery example, we have a single outrageous event (I win the lottery!) but in a broader context, any outrageous result will occur as long as your data sets are large enough. But just because we’re dealing with Big Data doesn’t mean that outrageous events are any more likely than before.

The Fallacy of Statistical Significance
Anybody who’s ever wondered how political pollsters can draw broad conclusions of popular opinion based upon a small handful of people knows that statistical sampling can lead to plenty of monkey business. Small sampling sizes lead to large margins of uncertainty, which in turn can lead to statistically insignificant results. For example, if candidate A is leading candidate B by 2%, but the margin of error is 5%, then the 2% is insignificant – there’s a very good chance the 2% is the result of sampling error rather than reflecting the population at large. For a lead to be significant, it has to be a bit more than the margin of error. So if candidate A is leading by, say, 7%, we can be reasonably sure that lead reflects the true opinion of the population.

So far so good, but if we add Big Data to the mix, we have a different problem. Let’s say we up the sample size from a few hundred to a few million. Now our margins of error are a fraction of a percent. Candidate A may have a statistically significant lead even if it’s 50.1% vs. 49.9%. But while a 7% lead might be difficult to overcome in the few weeks leading up to an election, a 0.2% lead could easily be reversed in a single day. Our outsized sample size has led us to place too much stock in the notion of statistical significance, because it no longer relates to how we define significance in a broader sense.

The way to avoid this fallacy is to make proper use of sampling theory: even when you have immense Big Data sets, you may want to take random samples of a manageable size in order to obtain useful results. In other words, fewer data can actually be better than more data. Note that this sampling approach flies in the face of exhaustive processing algorithms like the ones that Hadoop is particularly good at, which are likely to lead you directly into the fallacy of statistical significance.

Playing with Numbers
Just as people struggle to grok astronomically small probabilities, people also struggle to get their heads around very large numbers as well. Inevitably, they end up resorting so some wacky metaphor that inevitably contains an astronomical comparison involving stacks of pancakes to the moon or some such. Such metaphors can help people understand large numbers – or they can simply confuse or mislead people about large numbers. Add Big Data to the mix and you suddenly have the power to sow misinformation far and wide.

Take, for example, the NSA. In a document released August 9th, the NSA explained that:

According to the figures published by a major tech provider, the Internet carries 1,826 Petabytes of information per day. In its foreign intelligence mission, NSA touches about 1.6% of that. However, of the 1.6% of the data, only 0.025% is actually selected for review. The net effect is that NSA analysts look at 0.00004% of the world’s traffic in conducting their mission – that’s less than one part in a million. Put another way, if a standard basketball court represented the global communications environment, NSA’s total collection would be represented by an area smaller than a dime on that basketball court.

Confused yet? Let’s pick apart what this paragraph is actually saying and you be the judge. The NSA claims to be analyzing 1.6% of 1,826 Petabytes per day, which works out to about 29 Petabytes per day, or 30,000 terabytes. (29 petabytes per day also works out to over 10 exabytes per year. Talk about Big Data!)

When they say they select 0.025% (one fortieth of a percent) of this 30,000 terabytes per day for review, what they’re saying is that their automated Big Data crunching analysis algorithms give them 7.5 terabytes of results to process manually, every day. To place this number into context, assume that those 7.5 terabytes consisted entirely of telephone call detail records, or CDRs. Now, we know that the NSA is analyzing far more than CDRs, but we can use CDRs to do a little counter-spin of our own. Since a rule of thumb is that an average CDR is 200 bytes long, 7.5 terabytes represents records of 37 quadrillion (37,000,000,000,000,000) phone calls, or about 5 million phone calls per day for each person on earth.

So, which is a more accurate way of looking at the NSA data analysis: a dime in a basketball court or 5 million phone calls per day for each man, woman, and child on the planet? The answer is that both comparisons are skewed to prove a point. You should take any such explanation of Big Data with a Big Data-sized grain of salt.

The ZapThink Take
Perhaps the most pernicious fallacy to target Big Data is the “more is better” paradox: the false assumption that if a certain quantity of data is good, then more data are necessarily better. In reality, more data can actually be a bad thing. You may be encouraging the creation of duplicate or incorrect data. The chance your data are redundant goes way up. And worst of all, you may be collecting increasing quantities of irrelevant data.

In our old, “small data” world, we were careful what data we collected in the first place, because we knew we were using tools that could only deal with so much data. So if you wanted, say, to understand the pitching stats for the Boston Red Sox, you’d start with only Red Sox data, not data from all of baseball. But now it’s all about Big Data! Let’s collect everything and anything, and let Hadoop make sense of it all!

But no software, not even Hadoop, can make sense out of anything. Only our wetware can do that. As our Big Data sets grow and our tools improve, we must never lose sight of the fact that our ability to understand what the technology tells us is a skill set we must continue to hone. Otherwise, not only are the data fooling us, but we’re actually fooling ourselves.

Image credit: _rockinfree

 

More Stories By Jason Bloomberg

Jason Bloomberg is Chief Evangelist at EnterpriseWeb, where he drives the message and the community for EnterpriseWeb’s next generation enterprise platform. He is a global thought leader in the areas of Cloud Computing, Enterprise Architecture, and Service-Oriented Architecture. He is a frequent conference speaker and prolific writer, and he also serves as blogger for DevX. His latest book, The Agile Architecture Revolution: How Cloud Computing, REST-based SOA, and Mobile Computing are Changing Enterprise IT (John Wiley & Sons), was published in March 2013. Prior to EnterpriseWeb he was President of ZapThink, where he created the Licensed ZapThink Architect (LZA) SOA course and associated credential, and ran the LZA course as well as his Enterprise Cloud Computing course around the world. He was also the primary contributor to the ZapFlash newsletter and blog for twelve years. Mr. Bloomberg is one of the original Managing Partners of ZapThink LLC, the leading SOA advisory and analysis firm, which was acquired by Dovel Technologies in August 2011. Mr. Bloomberg’s book, Service Orient or Be Doomed! How Service Orientation Will Change Your Business (John Wiley & Sons, 2006, coauthored with Ron Schmelzer), is recognized as the leading business book on Service Orientation. He also co-authored the books XML and Web Services Unleashed (SAMS Publishing, 2002), and Web Page Scripting Techniques (Hayden Books, 1996). He has a diverse background in eBusiness technology management and industry analysis, including serving as a senior analyst in IDC’s eBusiness Advisory group, as well as holding eBusiness management positions at USWeb/CKS (later marchFIRST) and WaveBend Solutions (now Hitachi Consulting).

Cloud Expo Breaking News
File sync and share. Endpoint protection. Both are massive opportunities for today’s enterprise thanks to their business benefits and widespread user appeal. But one size does not fit all, especially user-adopted consumer technologies. Organizations must apply the right enterprise-ready tool for the job in order to properly manage and protect endpoint data. In his session at 14th Cloud Expo, Michael Bachman, Senior Enterprise Systems Architect at Code42, he will discuss how the synergy of an enterprise platform – where sync/share and endpoint protection converge – delivers incredible value for the business.
Simply defined the SDDC promises that you’ll be able to treat “all” of your IT infrastructure as if it’s completely malleable. That there are no restrictions to how you can use and assign everything from border controls to VM size as long as you stay within the technical capabilities of the devices. The promise is great, but the reality is still a dream for the majority of enterprises. In his session at 14th Cloud Expo, Mark Thiele, EVP, Data Center Tech, at SUPERNAP, will cover where and how a business might benefit from SDDC and also why they should or shouldn’t attempt to adopt today.
Today, developers and business units are leading the charge to cloud computing. The primary driver: faster access to computing resources by using the cloud's automated infrastructure provisioning. However, fast access to infrastructure exposes the next friction point: creating, delivering, and operating applications much faster. In his session at 14th Cloud Expo, Bernard Golden, VP of Strategy at ActiveState, will discuss why solving the next friction point is critical for true cloud computing success and how developers and business units can leverage service catalogs, frameworks, and DevOps to achieve the true goal of IT: delivering increased business value through applications.
APIs came about to help companies create and manage their digital ecosystem, enabling them not only to reach more customers through more devices, but also create a large supporting ecosystem of developers and partners. While Facebook, Twitter and Netflix were the early adopters of APIs, large enterprises have been quick to embrace the concept of APIs and have been leveraging APIs as a connective tissue that powers all interactions between their customers, partners and employees. As enterprises embrace APIs, some very specific Enterprise API Adoption patterns and best practices have started emerging. In his session at 14th Cloud Expo, Sachin Agarwal, VP of Product Marketing and Strategy at SOA Software, will talk about the most common enterprise API patterns and will discuss how enterprises can successfully launch an API program.
MapDB is an Apache-licensed open source database specifically designed for Java developers. The library uses the standard Java Collections API, making it totally natural for Java developers to use and adopt, while scaling database size from GBs to TBs. MapDB is very fast and supports an agile approach to data, allowing developers to construct flexible schemas to exactly match application needs and tune performance, durability and caching for specific requirements.
The social media expansion has shown just how people are eager to share their experiences with the rest of the world. Cloud technology is the perfect platform to satisfy this need given its great flexibility and readiness. At Cynny, we aim to revolutionize how people share and organize their digital life through a brand new cloud service, starting from infrastructure to the users’ interface. A revolution that began from inventing and designing our very own infrastructure: we have created the first server network powered solely by ARM CPU. The microservers have “organism-like” features, differentiating them from any of the current technologies. Benefits include low consumption of energy, making Cynny the ecologically friendly alternative for storage as well as cheaper infrastructure, lower running costs, etc.
Next-Gen Cloud. Whatever you call it, there’s a higher calling for cloud computing that requires providers to change their spots and move from a commodity mindset to a premium one. Businesses can no longer maintain the status quo that today’s service providers offer. Yes, the continuity, speed, mobility, data access and connectivity are staples of the cloud and always will be. But cloud providers that plan to not only exist tomorrow – but to lead – know that security must be the top priority for the cloud and are delivering it now. In his session at 14th Cloud Expo, Kurt Hagerman, Chief Information Security Officer at FireHost, will detail why and how you can have both infrastructure performance and enterprise-grade security – and what tomorrow's cloud provider will look like.
Web conferencing in a public cloud has the same risks as any other cloud service. If you have ever had concerns over the types of data being shared in your employees’ web conferences, such as IP, financials or customer data, then it’s time to look at web conferencing in a private cloud. In her session at 14th Cloud Expo, Courtney Behrens, Senior Marketing Manager at Brother International, will discuss how issues that had previously been out of your control, like performance, advanced administration and compliance, can now be put back behind your firewall.
More and more enterprises today are doing business by opening up their data and applications through APIs. Though forward-thinking and strategic, exposing APIs also increases the surface area for potential attack by hackers. To benefit from APIs while staying secure, enterprises and security architects need to continue to develop a deep understanding about API security and how it differs from traditional web application security or mobile application security. In his session at 14th Cloud Expo, Sachin Agarwal, VP of Product Marketing and Strategy at SOA Software, will walk you through the various aspects of how an API could be potentially exploited. He will discuss the necessary best practices to secure your data and enterprise applications while continue continuing to support your business’s digital initiatives.
The revolution that happened in the server universe over the past 15 years has resulted in an eco-system that is more open, more democratically innovative and produced better results in technically challenging dimensions like scale. The underpinnings of the revolution were common hardware, standards based APIs (ex. POSIX) and a strict adherence to layering and isolation between applications, daemons and kernel drivers/modules which allowed multiple types of development happen in parallel without hindering others. Put simply, today's server model is built on a consistent x86 platform with few surprises in its core components. A kernel abstracts away the platform, so that applications and daemons are decoupled from the hardware. In contrast, networking equipment is still stuck in the mainframe era. Today, networking equipment is a single appliance, including hardware, OS, applications and user interface come as a monolithic entity from a single vendor. Switching between different vendor'...
Cloud backup and recovery services are critical to safeguarding an organization’s data and ensuring business continuity when technical failures and outages occur. With so many choices, how do you find the right provider for your specific needs? In his session at 14th Cloud Expo, Daniel Jacobson, Technology Manager at BUMI, will outline the key factors including backup configurations, proactive monitoring, data restoration, disaster recovery drills, security, compliance and data center resources. Aside from the technical considerations, the secret sauce in identifying the best vendor is the level of focus, expertise and specialization of their engineering team and support group, and how they monitor your day-to-day backups, provide recommendations, and guide you through restores when necessary.
Cloud scalability and performance should be at the heart of every successful Internet venture. The infrastructure needs to be resilient, flexible, and fast – it’s best not to get caught thinking about architecture until the middle of an emergency, when it's too late. In his interactive, no-holds-barred session at 14th Cloud Expo, Phil Jackson, Development Community Advocate for SoftLayer, will dive into how to design and build-out the right cloud infrastructure.
You use an agile process; your goal is to make your organization more agile. What about your data infrastructure? The truth is, today’s databases are anything but agile – they are effectively static repositories that are cumbersome to work with, difficult to change, and cannot keep pace with application demands. Performance suffers as a result, and it takes far longer than it should to deliver on new features and capabilities needed to make your organization competitive. As your application and business needs change, data repositories and structures get outmoded rapidly, resulting in increased work for application developers and slow performance for end users. Further, as data sizes grow into the Big Data realm, this problem is exacerbated and becomes even more difficult to address. A seemingly simple schema change can take hours (or more) to perform, and as requirements evolve the disconnect between existing data structures and actual needs diverge.
SYS-CON Events announced today that SherWeb, a long-time leading provider of cloud services and Microsoft's 2013 World Hosting Partner of the Year, will exhibit at SYS-CON's 14th International Cloud Expo®, which will take place on June 10–12, 2014, at the Javits Center in New York City, New York. A worldwide hosted services leader ranking in the prestigious North American Deloitte Technology Fast 500TM, and Microsoft's 2013 World Hosting Partner of the Year, SherWeb provides competitive cloud solutions to businesses and partners around the world. Founded in 1998, SherWeb is a privately owned company headquartered in Quebec, Canada. Its service portfolio includes Microsoft Exchange, SharePoint, Lync, Dynamics CRM and more.
The world of cloud and application development is not just for the hardened developer these days. In their session at 14th Cloud Expo, Phil Jackson, Development Community Advocate for SoftLayer, and Harold Hannon, Sr. Software Architect at SoftLayer, will pull back the curtain of the architecture of a fun demo application purpose-built for the cloud. They will focus on demonstrating how they leveraged compute, storage, messaging, and other cloud elements hosted at SoftLayer to lower the effort and difficulty of putting together a useful application. This will be an active demonstration and review of simple command-line tools and resources, so don’t be afraid if you are not a seasoned developer.