Welcome!

SDN Journal Authors: Paul Speciale, Lori MacVittie, JP Morgenthal, Elizabeth White, Esmeralda Swartz

Related Topics: Big Data Journal, SOA & WOA, .NET, Cloud Expo, Apache, SDN Journal

Big Data Journal: Article

Five Big Data Features in DB2 Databases

Traditional RDBMS and Big Data

Traditional RDBMS & New Data Processing
Over the past two decades relational databases have been most successful in serving large-scale OLTP and OLAP applications across enterprises. However, in the past couple of years with the advent of Big Data processing, especially processing unstructured data coupled with the need for processing massive quantities of data, the industry started to look into non RDBMS solutions. This has lead into the popularity of NoSQL databases as well as massively parallel processing frameworks.

However the traditional RDBMS has been quick to react and added several Big Data features as part of their offering such that the enterprises with the heavy investment of traditional RDBMS can have best of both worlds by properly leveraging these new features.

The following sections provide idea about big data features in the popular DB2 Databases, a similar analysis will be performed against Oracle also in a later article. Please refer to my earlier article on Five Big Data Features in SQL Server.

1. DB2 Text Search
DB2 Text Search provides extensive capabilities for searching data in text columns stored in a DB2 table. The search system provides fast query response times and a consolidated, ranked result set that enables you to quickly and easily locate the information that you need. By incorporating the functions of DB2 Text Search in your SQL and XQuery statements, you can create powerful and versatile text-retrieval programs.

DB2 Text Search works by collecting data from diverse sources and indexing it for subsequent fast retrieval. DB2 Text Search uses linguistic analysis to improve search results and supports the following document formats:

  • Unstructured plain text.
  • Structured text such as that in HTML or XML documents
  • Proprietary document formats such as PDF or Microsoft Office document formats.

We can perform various kinds of searches like,

  • Basic Search : Using Boolean Operators and Modifiers
  • Fuzzy Search : Using words with similar spelling to search term
  • Proximity Search : A proximity search retrieves documents that contain search words which are located within a specified distance from each other.
  • SCORE Search : We use the SCORE function to find out the extent to which a document matches a search document.

DB2® Text Search provides dictionary packs to support the linguistic processing of documents and queries. In addition, n-gram segmentation is supported for languages such as Chinese, Japanese, and Korean. As an alternative to dictionary-based word segmentation, the search engine provides an option to select n-gram segmentation for languages such as Chinese, Japanese, and Korean. It is evident from the use cases and patterns on Big Data without such features on natural language processing much of the insights like sentiment analysis cannot be fruitful.

2. Partitioned Databases
MPP (Massively Parallel Processing) frame works like Hadoop are found to be well suited for processing large quantities of data due to their Shared Nothing Architecture and the ability to process data in parallel. DB2 on Unix/Windows is the pioneer in implementing such a concept using the partitioned database option.

A partitioned database environment is a database installation that supports the distribution of data across database partitions. Because data is distributed across database partitions, you can use the power of multiple processors on multiple physical machines to satisfy requests for information. Data retrieval and update requests are decomposed automatically into sub-requests, and executed in parallel among the applicable database partitions. The fact that databases are split across database partitions is transparent to users issuing SQL statements. Interpartition parallelism refers to the ability to break up a query into multiple parts across multiple partitions of a partitioned database, on one machine or multiple machines. The query is run in parallel. Some DB2 utilities also perform this type of parallelism.

In support of Unstructured Big Data processing, DB2 Text Search explained earlier is integrated with the partitioned database environment. DB2® Text Search supports full-text search in a partitioned database environment. Text search indexes are distributed in a pattern that matches the base tables on which they are created. For each database partition, a text index partition, also called a collection, is created. This pattern facilitates text search maintenance by allowing text search index updates with parallel execution on all index partitions.

3. Pure XML
The pureXML® feature allows you to store well-formed XML documents in database table columns that have the XML data type. By storing XML data in XML columns, the data is kept in its native hierarchical form, rather than stored as text or mapped to a different data model.

There is no architectural limit on the size of an XML value in a database. An index over XML data can be used to improve the efficiency of queries on XML documents that are stored in an XML column. In contrast to traditional relational indexes, where index keys are composed of one or more table columns you specify, an index over XML data uses a particular XML pattern expression to index paths and values in XML documents stored within a single column. The data type of that column must be XML.

In partitioned database environments, tables containing XML columns can be stored in multi-partition databases. In DB2 latest version, the pureXML feature is supported in partitioned database environments. With both features tightly integrated, pureXML customers can distribute XML data across multiple database partitions and parallelize XML queries for better performance, while partitioned database environments customers can deploy pureXML for new business applications.

The above combination of processing large XML documents in a parallel environment make a best case for DB2 used for big data processing.

4. DB2 Federation

One of the important needs of big data processing is the need to connect to multiple disparate data sources and bring the best out of them. Enterprises no longer can afford to have a single common data store for all their data processing needs.

In DB2 a federated system is a type of distributed database management system that you can use to access data sources across your enterprise. As documented in the IBM Documentation site, DB2 federation support almost all kinds of structured and unstructured data sources. In particular there is support for flat files, Microsoft Excel and VSAM files.

One interesting component of DB2 federation is, the support for connecting to Netezza DB. Netezza is the high performance data warehouse appliance . IBM® Netezza® Analytics is an embedded, purpose-built, advanced analytics platform .

5. Pure Scale
While the Shared Nothing Architecture has been a standard for many massively parallel processing environments, there are successful architectures using Shared Disk model too. The major examples being the IBM's Mainframe Parallel SYSPLEX and Oracle Real Application Clusters.

With the DB2 pureScale Feature, scaling your database solution is simple. Multiple database servers, known as members, process incoming database requests; these members operate in a clustered system and share data. You can transparently add more members to scale out to meet even the most demanding business needs. There are no application changes to make, data to redistribute, or performance tuning to do. The IBM® DB2® pureScale® Feature, much like a multi-partition database environment, provides a scalable and highly available database solution. However, the instance type and data layout of a DB2 pureScale environment and a multi-partition database environment are different.

A DB2 pureScale environment is ideal for short transactions where there is little need to parallelize each query. Queries are automatically routed to different members, based on member workload. While this is not a ideal work load in a Big Data processing scenario , but Big Data Environments do invest on options like Hbase, Cassandra to process short transactions.

Summary
Traditional high performance RDBMS like DB2 have their strengths. They are very strong in maintaining the data integrity and quality in the form of constraints, foreign keys and other validation mechanisms. They are also strong in transactional integrity by providing superior locking model, automatic dead lock resolution etc.. However initially they are not found to adjust to Big Data processing needs of enterprises.

With the enhancements in the products made by respective vendors, now databases like DB2 have been enhanced with big data processing features and makes them the best candidate for enterprises looking for best of the breed features between traditional RDBMS and Big Data processing systems, and to leverage the best of existing investments.

More Stories By Srinivasan Sundara Rajan

I am passionate about ownership and driving things on my own, with my breadth and depth on Enterprise Technology I could run any aspect of IT Industry and make it a success. I am a Seasoned Enterprise IT Expert, mainly in the areas of Solution,Integration and Architecture, across Structured, Unstructured data sources, especially in manufacturing domain. My recent work is on Natural Language Processing, Semantic Enrichment of Unstructured Data, Data Mining and Predictive Analytics. However I have a strong footing across all tiers of Enterprise IT spectrum. I am geared to handle the massive flow of data by Internet Of Things with appropriate platform, tools and processes.

@CloudExpo Stories
Compute virtualization has been transformational, yet security policy implementation and enforcement has lagged behind in agility and automation. There are a number of key considerations when implementing policy in private and hybrid clouds. In his session at 15th Cloud Expo, Holland Barry, VP of Technology at Catbird, will discuss the impact of this new paradigm and what organizations can do today to safely move to software-defined network and compute architectures, including: How normal ope...
Samsung VP Jacopo Lenzi, who headed the company's recent SmartThings acquisition under the auspices of Samsung's Open Innovaction Center (OIC), answered a few questions we had about the deal. This interview was in conjunction with our interview with SmartThings CEO Alex Hawkinson. IoT Journal: SmartThings was developed in an open, standards-agnostic platform, and will now be part of Samsung's Open Innovation Center. Can you elaborate on your commitment to keep the platform open? Jacopo Lenzi: S...
How do APIs and IoT relate? The answer is not as simple as merely adding an API on top of a dumb device, but rather about understanding the architectural patterns for implementing an IoT fabric. There are typically two or three trends: Exposing the device to a management framework Exposing that management framework to a business centric logic • Exposing that business layer and data to end users. This last trend is the IoT stack, which involves a new shift in the separation of what stuff hap...
SYS-CON Events announced today that SOA Software, an API management leader, will exhibit at SYS-CON's 15th International Cloud Expo®, which will take place on November 4–6, 2014, at the Santa Clara Convention Center in Santa Clara, CA. SOA Software is a leading provider of API Management and SOA Governance products that equip business to deliver APIs and SOA together to drive their company to meet its business strategy quickly and effectively. SOA Software’s technology helps businesses to accel...
SYS-CON Events announced today that Utimaco will exhibit at SYS-CON's 15th International Cloud Expo®, which will take place on November 4–6, 2014, at the Santa Clara Convention Center in Santa Clara, CA. Utimaco is a leading manufacturer of hardware based security solutions that provide the root of trust to keep cryptographic keys safe, secure critical digital infrastructures and protect high value data assets. Only Utimaco delivers a general-purpose hardware security module (HSM) as a customiz...
Almost everyone sees the potential of Internet of Things but how can businesses truly unlock that potential. The key will be in the ability to discover business insight in the midst of an ocean of Big Data generated from billions of embedded devices via Systems of Discover. Businesses will also need to ensure that they can sustain that insight by leveraging the cloud for global reach, scale and elasticity.
SYS-CON Events announced today that ElasticBox is holding a Hackathon at DevOps Summit, November 6 from 12 pm -4 pm at the Santa Clara Convention Center in Santa Clara, CA. You can enter as an individual or team of up to 10 developers. A New Star Is Born Every Month! All completed ElasticBoxes will then be sent to a judging panel - 12 winners will be featured on the ElasticBox website in 2015. All entrants will receive five full enterprise licenses for one year + ElasticBox headphones + Elasti...
Once the decision has been made to move part or all of a workload to the cloud, a methodology for selecting that workload needs to be established. How do you move to the cloud? What does the discovery, assessment and planning look like? What workloads make sense? Which cloud model makes sense for each workload? What are the considerations for how to select the right cloud model? And how does that fit in with the overall IT tranformation? In his session at 15th Cloud Expo, John Hatem, head of V...
Cloud services are the newest tool in the arsenal of IT products in the market today. These cloud services integrate process and tools. In order to use these products effectively, organizations must have a good understanding of themselves and their business requirements. In his session at 15th Cloud Expo, Brian Lewis, Principal Architect at Verizon Cloud, will outline key areas of organizational focus, and how to formalize an actionable plan when migrating applications and internal services to...
SAP is delivering break-through innovation combined with fantastic user experience powered by the market-leading in-memory technology, SAP HANA. In his General Session at 15th Cloud Expo, Thorsten Leiduck, VP ISVs & Digital Commerce, SAP, will discuss how SAP and partners provide cloud and hybrid cloud solutions as well as real-time Big Data offerings that help companies of all sizes and industries run better. SAP launched an application challenge to award the most innovative SAP HANA and SAP ...
Ixia develops amazing products so its customers can connect the world. Ixia helps its customers provide an always-on user experience through fast, secure delivery of dynamic connected technologies and services. Through actionable insights that accelerate and secure application and service delivery, Ixia's customers benefit from faster time to market, optimized application performance and higher-quality deployments.
SYS-CON Events announced today that Calm.io has been named “Bronze Sponsor” of DevOps Summit Silicon Valley, which will take place on November 4–6, 2014, at the Santa Clara Convention Center in Santa Clara, CA. Calm.io is a cloud orchestration platform for AWS, vCenter, OpenStack, or bare metal, that runs your CL tools puppet, Chef, shell, git, Jenkins, nagios, and will soon support New Relic and Docker. It can run hosted, or on premise and provides VM automation / expiry, self-service portals,...
In her General Session at 15th Cloud Expo, Anne Plese, Senior Consultant, Cloud Product Marketing, at Verizon Enterprise, will focus on finding the right mix of renting vs. buying Oracle capacity to scale to meet business demands, and offer validated Oracle database TCO models for Oracle development and testing environments. Anne Plese is a marketing and technology enthusiast/realist with over 19+ years in high tech. At Verizon Enterprise, she focuses on driving growth for the Verizon Cloud pla...
SYS-CON Events announced today that Aria Systems, the recurring revenue expert, has been named "Bronze Sponsor" of SYS-CON's 15th International Cloud Expo®, which will take place on November 4-6, 2014, at the Santa Clara Convention Center in Santa Clara, CA. Aria Systems helps leading businesses connect their customers with the products and services they love. Industry leaders like Pitney Bowes, Experian, AAA NCNU, VMware, HootSuite and many others choose Aria to power their recurring revenue bu...
The Internet of Things (IoT) is going to require a new way of thinking and of developing software for speed, security and innovation. This requires IT leaders to balance business as usual while anticipating for the next market and technology trends. Cloud provides the right IT asset portfolio to help today’s IT leaders manage the old and prepare for the new. Today the cloud conversation is evolving from private and public to hybrid. This session will provide use cases and insights to reinforce t...
As Platform as a Service (PaaS) matures as a category, developers should have the ability to use the programming language of their choice to build applications and have access to a wide array of services. Bluemix is IBM's open cloud development platform that enables users to easily build cloud-based, creative mobile and web applications without having to spend large amounts of time and resources on configuring infrastructure and multiple software licenses. In this track, you will learn about the...
Blue Box has closed a $10 million Series B financing. The round was led by a strategic investor and included participation from prior investors including Voyager Capital and Founders Collective, as well as the Blue Box executive team. This round follows a $4.3 million Series A closed in December of 2012 and led by Voyager Capital. In May of this year, the company announced general availability of its private cloud as a service offering, Blue Box Cloud. Since that release, the company has dem...
SYS-CON Events announced today that Verizon has been named "Gold Sponsor" of SYS-CON's 15th International Cloud Expo®, which will take place on November 4-6, 2014, at the Santa Clara Convention Center in Santa Clara, CA. Verizon Enterprise Solutions creates global connections that generate growth, drive business innovation and move society forward. With industry-specific solutions and a full range of global wholesale offerings provided over the company's secure mobility, cloud, strategic network...
SimpleECM is the only platform to offer a powerful combination of enterprise content management (ECM) services, capture solutions, and third-party business services providing simplified integrations and workflow development for solution providers. SimpleECM is opening the market to businesses of all sizes by reinventing the delivery of ECM services. Our APIs make the development of ECM services simple with the use of familiar technologies for a frictionless integration directly into web applicat...
The only place to be June 9-11 is Cloud Expo & @ThingsExpo 2015 East at the Javits Center in New York City. Join us there as delegates from all over the world come to listen to and engage with speakers & sponsors from the leading Cloud Computing, IoT & Big Data companies. Cloud Expo & @ThingsExpo are the leading events covering the booming market of Cloud Computing, IoT & Big Data for the enterprise. Speakers from all over the world will be hand-picked for their ability to explore the economic...