Welcome!

SDN Journal Authors: Daniel Gordon, John Walsh, Elizabeth White, Liz McMillan, Sven Olav Lund

Related Topics: @DXWorldExpo, Containers Expo Blog, Cognitive Computing , @CloudExpo, Apache, SDN Journal

@DXWorldExpo: Article

Extending and Augmenting Hadoop

Use the right tool for the job

In the last two years, the Apache Hadoop software library has emerged as a veritable Swiss Army knife of data management and analytical infrastructure. The Hadoop toolset has been positioned as a universal platform for all types of commercial and organizational analytical needs.

Hadoop is an ideal solution for use cases in which the data is easily partitioned and distributed. For example, consider keyword searches, a major component of SEO. Simply identifying and counting distinct words in text data is a central part of the process for keyword-based searches. No matter how many pieces of text you have, each document, article, blog post, or other piece of content is distinct from the other.

In order to enable keyword search, a program then computes the number of words in each distinct document or item of text. Clearly, this can be done in isolation: to count the number of times a word occurs in a given set of documents, you can count it within one document, and add up the counts across many documents. Moreover, you can count each document at the same time as another since they are distinct. An Internet-scale search engine like Google essentially leverages this concept to distribute such processing across a large number of simple machines (a cluster).

Another example where Hadoop shines is when it comes to counting the number of times a specific user, as represented by an IP address, has visited a particular web page or website. Again, this can be broken up into a series of smaller problems and spread over multiple machines in a Hadoop cluster. Results from the smaller sets can then be aggregated to obtain the total count.

MapReduce, the programming model behind Hadoop, was designed to address problems in which an operation over a very large dataset can easily be broken into the same operation on smaller datasets. The promise of Hadoop is in the ability to use open source software relatively inexpensively to address this whole class of partitionable problems.

However, there are a number of analytical use cases for which Hadoop is inadequate. In such cases, the Hadoop toolset may need to be augmented and extended with other technologies to properly resolve these problems.

Understanding Graph Connections
In parallel with the emergence of Hadoop, the world of social media has exploded: As of 2012, the social media powerhouse Facebook had more than one billion registered users, according to CEO Mark Zuckerburg.

Social media networks such as Facebook and LinkedIn are driven by a fundamental focus on relationships and connections. For example, Facebook users can now use the service's Graph Search to find friends of friends who live in the same city or like the same baseball team, and the site frequently suggests "people you may know" based on the mutual connections that two unconnected individuals have established. LinkedIn focuses on helping business professionals grow their social networks by helping them find key contacts or prospects who are connected to existing friends or colleagues, and allowing users to leverage those existing relationships to form new connections. The use of such data connections is becoming ever more useful to individuals for enhancing their personal and business lives.

Likewise, the capacity to comprehend and assess such relationships is a key component driving the world of business analytics. For example, business managers frequently want to know the answers to questions such as:

  • What are all the ways in which a person of interest in a crime database may be related to another person of interest?
  • Based on known patterns of suspicious behavior in a corporate network, how can we identify malicious hacking attacks before they have a financial impact on our company?
  • Which of an organization's partners have a financial exposure to the failure of another company?

Take the question of how two people might be connected on social media. This may seem simple, but as soon as you look closely, it's not quite so clear. The simplest example of such a problem is in looking at how two people may be connected on Facebook. They can be friends - a direct connection that is hard to miss. Or they might be friends of friends, which starts getting a little murkier. The connections can be even more distant and difficult to immediately pinpoint. For instance, Person A may be married to someone whose brother is a friend of Person B. Or perhaps they have a shared affiliation, such as attending the same school, working at the same organization, or attending the same church.

In some cases, two individuals' only connection may be sharing a few Likes. These shared affinities may be valuable information to a business if, for example, those Likes happen to be something your organization addresses. In that case, you may want to drill down to those specific people out of the entire billion users on Facebook, so that you can target your online advertising directly to them.

If you think of all the possible ways that one Facebook user can be connected to another user, it is a very different kind of Big Data problem. You cannot simply break up the problem into smaller segments because, by definition, it involves connections that require link analysis. This makes it a problem that Hadoop isn't ideally suited to address.

Link analysis problems occur in many domains beyond social networks. The network of neurons in the brain and the pathways between these neurons is an example. A group of suspicious people and their connections (as observed by their interactions) is another. The network of genes and proteins and their interactions is yet another.

What do you do to solve problems that involve complex relationship patterns and require detailed link analysis? Enter graph analytics.

Graph Analytics
Essentially graphs provide a way of organizing data to specifically highlight relationships. On such a foundation, it is possible to apply a number of simple to complex analytical techniques to understand groups of similar related entities, to identify the central influencer in a social network, or to identify complex patterns of behavior indicative of fraud.

In fact, the secret to Google's search engine success is the use of a specific graph analytics technique called PageRank. Rather than focus on the prevalence of keywords in a web page, Google focused on the relationships between webpages on the World Wide Web and prioritizing results from highly authoritative sites - resulting in astonishing accuracy in determining relevant results for keyword search.

A common, standard way of representing data in this relationship-oriented format is RDF, a W3C standard, which is accompanied by a query language called SPARQL to specifically analyze such data. In the Life Sciences domain, companies and public consortia are increasingly representing data in this form, because this method provides a more comprehensive overview of the data relationships - whether it is gene/protein interactions, or diseases and their genetic characteristics.

Requires Secret Sauce
Since the nature of graphs makes them difficult to partition, Hadoop is not well suited to this class of analytical problems.

As a matter of fact, the problems are even deeper than that: Because of the unpredictable nature of data access while following and analyzing relationships, commodity hardware architectures are fundamentally challenged. Merely grouping machines together does not address these issues, because the challenges posed by graph analytics are at the network level and are not significantly addressed by the computing capacity of a single machine. What is the ideal approach for solving complex problems involving the analysis of relationships in data?

The secret sauce behind the best performing graph analytics tools is massive un-partitioned memory. One tool, for example, uses a memory pool of up to 512TB (half a petabyte) to perform continuous data and link analysis in real-time even as data continues to pour in. This eliminates latency problems and memory scalability issues while customized chips speed performance.

Comparison Table: Hadoop-Graph Analytics

 

Hadoop

Graph Analytics

Operation mode

Batch

Real-time

Language

Map-reduce

SPARQL

Platform

Any commodity hardware

Specialized hardware

Queries

Must be partitioned

Allows non-partitioned

Query types

Seek specific data answers

Discover relationships, connections

Results

Tables of entities

Relationships between entities

Graph Analytics Use Cases
Graph analytics is a new player in the Big Data game (which, itself, is quite new). Still, the pioneers and early adopters are reporting promising results for graph analytics as an alternative for solving diverse types of problems. Several examples include:

  • Actionable intelligence: QinetiQ North America (QNA) delivers "actionable intelligence" to government customers interested in identifying threats through the detection of non-obvious patterns of relationships in big data. Graph analytics were the obvious approach, for which QinetiQ uses a purpose-built graph analytics appliance running graph-optimized hardware and a graph database. It interacts with the appliance through the industry standard interface RDF/SPARQL, as defined by the Worldwide Web Consortium (W3C).
  • Life sciences: Oak Ridge National Laboratory (ORNL) opted for a graph analytics appliance to conduct research in healthcare fraud and analytics for a leading healthcare payer. In addition to the healthcare fraud detection program, researchers and scientists at ORNL will also apply the capabilities of the graph-analytics appliance to other areas of research where data discovery is vital. These potential use cases include healthcare treatment efficacy and outcome analysis, analyzing drugs and side effects, and the analysis of proteins and gene pathways.
  • Higher education: The Pittsburgh Supercomputer Center (PSC) turned to agraph analytics appliance called Sherlock (no relation to IBM's Watson) to provide researchers with the ability to search extremely large and complex bodies of information using a straightforward command similar to ‘find something important.' Sherlock took advantage of specialized graph analytics hardware to run 128 threads per processor on dedicated hardware and speeded memory access across a terabyte of global shared memory. The appliance helped PSC win public recognition for extending graph analytics techniques to a wide range of scientific research projects.

The potential uses of graph analytics are just beginning to be explored. Already, the technology is being applied across a broad array of industries, including manufacturing, energy and gas exploration, earth sciences and meteorology, and government and defense.

Advantages Offered by Graph Analytics
A key advantage of graphs is the ease with which new sources of data and new relationships can be added. Graph databases using RDF to represent the graph can easily merge and unify diverse datasets without significant upfront investment in data modeling. Such an approach lies in stark contrast to ‘traditional' analytics, in which a great deal of time is spent organizing data, and the addition of new data sources requires time-consuming and error prone effort by analysts.

The easy on-boarding of new data is particularly important when dealing with Big Data. Traditional analytics focus on finding answers to known questions. By contrast, many of the highest value applications, such as those identified above, are focused on discovery, where the questions to be answered are not known in advance. The ability to quickly and easily add new data sources or new relationships within the data when needed to support a new line of questioning is crucial for discovery, and graphs are uniquely well qualified to support these requirements.

Graph analytics also offer sophisticated capabilities for analyzing relationships, while traditional analytics focus on summarizing, aggregating and reporting on data. Use the right tool for the job. Some common graph analytic techniques include:

  1. Centrality analysis: To identify the most central entities in your network, a very useful capability for influencer marketing.
  2. Path analysis: To identify all the connections between a pair of entities, useful in understanding risks and exposure.
  3. Community detection: To identify clusters or communities, which is of great importance to understanding issues in sociology and biology.
  4. Sub-graph isomorphism: To search for a pattern of relationships, useful for validating hypotheses and searching for abnormal situations, such as hacker attacks.

Complementary to Hadoop
Interestingly, Hadoop and graph analytics complement each other perfectly. Hadoop is a scale-out solution, allowing independent items of work to be parceled out to the computers in a cluster. Graph analytics, on the other hand, excel at looking at the "big picture," analyzing complex networks of relationships that cannot be partitioned.

For example, consider risk analysis within a financial solution. Many documents will need to be independently analyzed, and the relationships between organizations extracted. This is a perfect job for Hadoop since each document is independent of the others. On the other hand, the complex network of relationships between organizations form an un-partitionable graph, which is best analyzed as a single entity, in-memory.

Relationships and Connections
Analysts today have a tabular, "row-and-column" mindset when it comes to data and analytics - probably a byproduct of the spreadsheet's decades of success.

But don't you often think about problems and data in different ways?

Graph analytics explicitly model and reason about the relationships between different entities, and graph tools also display those relationships visually. The analyst can see all the relationships in which an entity participates, and intuitively assess which elements are close or important.

When it comes to customers, relationships, rather than tabular data, may be the most important element: they are more predictive of your likelihood of retaining or losing customers. The more connections customers have to your organization, its products, and its people, the more likely they will remain customers. Relationships, not tables, are also key to hacker and threat identification, risk and fraud analysis, influencer marketing and many other high value applications.

Graph analytics complement Hadoop and provide a level of immediate, deep insights that are not readily obtainable in any other way.

More Stories By Venkat Krishnamurthy

Venkat Krishnamurthy is the Product Management Director at YarcData, driving the direction and definition of YarcData products and solutions and working with customers to make them successful. Krishnamurthy has over a decade of experience in advanced analytics, including as a Director of Product Management at Oracle and as Vice President of Technology at Goldman Sachs. At Goldman, he conducted data analysis to assess risk controls across multiple trading desks/asset classes, algorithmic trading, market risk model validation, prime brokerage.

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


@CloudExpo Stories
"ZeroStack is a startup in Silicon Valley. We're solving a very interesting problem around bringing public cloud convenience with private cloud control for enterprises and mid-size companies," explained Kamesh Pemmaraju, VP of Product Management at ZeroStack, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
In his session at 21st Cloud Expo, Carl J. Levine, Senior Technical Evangelist for NS1, will objectively discuss how DNS is used to solve Digital Transformation challenges in large SaaS applications, CDNs, AdTech platforms, and other demanding use cases. Carl J. Levine is the Senior Technical Evangelist for NS1. A veteran of the Internet Infrastructure space, he has over a decade of experience with startups, networking protocols and Internet infrastructure, combined with the unique ability to it...
"Codigm is based on the cloud and we are here to explore marketing opportunities in America. Our mission is to make an ecosystem of the SW environment that anyone can understand, learn, teach, and develop the SW on the cloud," explained Sung Tae Ryu, CEO of Codigm, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
High-velocity engineering teams are applying not only continuous delivery processes, but also lessons in experimentation from established leaders like Amazon, Netflix, and Facebook. These companies have made experimentation a foundation for their release processes, allowing them to try out major feature releases and redesigns within smaller groups before making them broadly available. In his session at 21st Cloud Expo, Brian Lucas, Senior Staff Engineer at Optimizely, discussed how by using ne...
"There's plenty of bandwidth out there but it's never in the right place. So what Cedexis does is uses data to work out the best pathways to get data from the origin to the person who wants to get it," explained Simon Jones, Evangelist and Head of Marketing at Cedexis, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
"Cloud Academy is an enterprise training platform for the cloud, specifically public clouds. We offer guided learning experiences on AWS, Azure, Google Cloud and all the surrounding methodologies and technologies that you need to know and your teams need to know in order to leverage the full benefits of the cloud," explained Alex Brower, VP of Marketing at Cloud Academy, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clar...
Large industrial manufacturing organizations are adopting the agile principles of cloud software companies. The industrial manufacturing development process has not scaled over time. Now that design CAD teams are geographically distributed, centralizing their work is key. With large multi-gigabyte projects, outdated tools have stifled industrial team agility, time-to-market milestones, and impacted P&L stakeholders.
Gemini is Yahoo’s native and search advertising platform. To ensure the quality of a complex distributed system that spans multiple products and components and across various desktop websites and mobile app and web experiences – both Yahoo owned and operated and third-party syndication (supply), with complex interaction with more than a billion users and numerous advertisers globally (demand) – it becomes imperative to automate a set of end-to-end tests 24x7 to detect bugs and regression. In th...
Enterprises are moving to the cloud faster than most of us in security expected. CIOs are going from 0 to 100 in cloud adoption and leaving security teams in the dust. Once cloud is part of an enterprise stack, it’s unclear who has responsibility for the protection of applications, services, and data. When cloud breaches occur, whether active compromise or a publicly accessible database, the blame must fall on both service providers and users. In his session at 21st Cloud Expo, Ben Johnson, C...
"Infoblox does DNS, DHCP and IP address management for not only enterprise networks but cloud networks as well. Customers are looking for a single platform that can extend not only in their private enterprise environment but private cloud, public cloud, tracking all the IP space and everything that is going on in that environment," explained Steve Salo, Principal Systems Engineer at Infoblox, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Conventio...
Data scientists must access high-performance computing resources across a wide-area network. To achieve cloud-based HPC visualization, researchers must transfer datasets and visualization results efficiently. HPC clusters now compute GPU-accelerated visualization in the cloud cluster. To efficiently display results remotely, a high-performance, low-latency protocol transfers the display from the cluster to a remote desktop. Further, tools to easily mount remote datasets and efficiently transfer...
"MobiDev is a software development company and we do complex, custom software development for everybody from entrepreneurs to large enterprises," explained Alan Winters, U.S. Head of Business Development at MobiDev, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
"We're developing a software that is based on the cloud environment and we are providing those services to corporations and the general public," explained Seungmin Kim, CEO/CTO of SM Systems Inc., in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
Agile has finally jumped the technology shark, expanding outside the software world. Enterprises are now increasingly adopting Agile practices across their organizations in order to successfully navigate the disruptive waters that threaten to drown them. In our quest for establishing change as a core competency in our organizations, this business-centric notion of Agile is an essential component of Agile Digital Transformation. In the years since the publication of the Agile Manifesto, the conn...
SYS-CON Events announced today that CrowdReviews.com has been named “Media Sponsor” of SYS-CON's 22nd International Cloud Expo, which will take place on June 5–7, 2018, at the Javits Center in New York City, NY. CrowdReviews.com is a transparent online platform for determining which products and services are the best based on the opinion of the crowd. The crowd consists of Internet users that have experienced products and services first-hand and have an interest in letting other potential buye...
The question before companies today is not whether to become intelligent, it’s a question of how and how fast. The key is to adopt and deploy an intelligent application strategy while simultaneously preparing to scale that intelligence. In her session at 21st Cloud Expo, Sangeeta Chakraborty, Chief Customer Officer at Ayasdi, provided a tactical framework to become a truly intelligent enterprise, including how to identify the right applications for AI, how to build a Center of Excellence to oper...
"IBM is really all in on blockchain. We take a look at sort of the history of blockchain ledger technologies. It started out with bitcoin, Ethereum, and IBM evaluated these particular blockchain technologies and found they were anonymous and permissionless and that many companies were looking for permissioned blockchain," stated René Bostic, Technical VP of the IBM Cloud Unit in North America, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Conventi...
SYS-CON Events announced today that Telecom Reseller has been named “Media Sponsor” of SYS-CON's 22nd International Cloud Expo, which will take place on June 5-7, 2018, at the Javits Center in New York, NY. Telecom Reseller reports on Unified Communications, UCaaS, BPaaS for enterprise and SMBs. They report extensively on both customer premises based solutions such as IP-PBX as well as cloud based and hosted platforms.
In his session at 21st Cloud Expo, James Henry, Co-CEO/CTO of Calgary Scientific Inc., introduced you to the challenges, solutions and benefits of training AI systems to solve visual problems with an emphasis on improving AIs with continuous training in the field. He explored applications in several industries and discussed technologies that allow the deployment of advanced visualization solutions to the cloud.
While some developers care passionately about how data centers and clouds are architected, for most, it is only the end result that matters. To the majority of companies, technology exists to solve a business problem, and only delivers value when it is solving that problem. 2017 brings the mainstream adoption of containers for production workloads. In his session at 21st Cloud Expo, Ben McCormack, VP of Operations at Evernote, discussed how data centers of the future will be managed, how the p...