Welcome!

SDN Journal Authors: Lori MacVittie, Elizabeth White, Roger Strukhoff, Liz McMillan, Pat Romanski

Related Topics: Big Data Journal, SOA & WOA, .NET, Cloud Expo, Apache, SDN Journal

Big Data Journal: Article

Five Big Data Features in DB2 Databases

Traditional RDBMS and Big Data

Traditional RDBMS & New Data Processing
Over the past two decades relational databases have been most successful in serving large-scale OLTP and OLAP applications across enterprises. However, in the past couple of years with the advent of Big Data processing, especially processing unstructured data coupled with the need for processing massive quantities of data, the industry started to look into non RDBMS solutions. This has lead into the popularity of NoSQL databases as well as massively parallel processing frameworks.

However the traditional RDBMS has been quick to react and added several Big Data features as part of their offering such that the enterprises with the heavy investment of traditional RDBMS can have best of both worlds by properly leveraging these new features.

The following sections provide idea about big data features in the popular DB2 Databases, a similar analysis will be performed against Oracle also in a later article. Please refer to my earlier article on Five Big Data Features in SQL Server.

1. DB2 Text Search
DB2 Text Search provides extensive capabilities for searching data in text columns stored in a DB2 table. The search system provides fast query response times and a consolidated, ranked result set that enables you to quickly and easily locate the information that you need. By incorporating the functions of DB2 Text Search in your SQL and XQuery statements, you can create powerful and versatile text-retrieval programs.

DB2 Text Search works by collecting data from diverse sources and indexing it for subsequent fast retrieval. DB2 Text Search uses linguistic analysis to improve search results and supports the following document formats:

  • Unstructured plain text.
  • Structured text such as that in HTML or XML documents
  • Proprietary document formats such as PDF or Microsoft Office document formats.

We can perform various kinds of searches like,

  • Basic Search : Using Boolean Operators and Modifiers
  • Fuzzy Search : Using words with similar spelling to search term
  • Proximity Search : A proximity search retrieves documents that contain search words which are located within a specified distance from each other.
  • SCORE Search : We use the SCORE function to find out the extent to which a document matches a search document.

DB2® Text Search provides dictionary packs to support the linguistic processing of documents and queries. In addition, n-gram segmentation is supported for languages such as Chinese, Japanese, and Korean. As an alternative to dictionary-based word segmentation, the search engine provides an option to select n-gram segmentation for languages such as Chinese, Japanese, and Korean. It is evident from the use cases and patterns on Big Data without such features on natural language processing much of the insights like sentiment analysis cannot be fruitful.

2. Partitioned Databases
MPP (Massively Parallel Processing) frame works like Hadoop are found to be well suited for processing large quantities of data due to their Shared Nothing Architecture and the ability to process data in parallel. DB2 on Unix/Windows is the pioneer in implementing such a concept using the partitioned database option.

A partitioned database environment is a database installation that supports the distribution of data across database partitions. Because data is distributed across database partitions, you can use the power of multiple processors on multiple physical machines to satisfy requests for information. Data retrieval and update requests are decomposed automatically into sub-requests, and executed in parallel among the applicable database partitions. The fact that databases are split across database partitions is transparent to users issuing SQL statements. Interpartition parallelism refers to the ability to break up a query into multiple parts across multiple partitions of a partitioned database, on one machine or multiple machines. The query is run in parallel. Some DB2 utilities also perform this type of parallelism.

In support of Unstructured Big Data processing, DB2 Text Search explained earlier is integrated with the partitioned database environment. DB2® Text Search supports full-text search in a partitioned database environment. Text search indexes are distributed in a pattern that matches the base tables on which they are created. For each database partition, a text index partition, also called a collection, is created. This pattern facilitates text search maintenance by allowing text search index updates with parallel execution on all index partitions.

3. Pure XML
The pureXML® feature allows you to store well-formed XML documents in database table columns that have the XML data type. By storing XML data in XML columns, the data is kept in its native hierarchical form, rather than stored as text or mapped to a different data model.

There is no architectural limit on the size of an XML value in a database. An index over XML data can be used to improve the efficiency of queries on XML documents that are stored in an XML column. In contrast to traditional relational indexes, where index keys are composed of one or more table columns you specify, an index over XML data uses a particular XML pattern expression to index paths and values in XML documents stored within a single column. The data type of that column must be XML.

In partitioned database environments, tables containing XML columns can be stored in multi-partition databases. In DB2 latest version, the pureXML feature is supported in partitioned database environments. With both features tightly integrated, pureXML customers can distribute XML data across multiple database partitions and parallelize XML queries for better performance, while partitioned database environments customers can deploy pureXML for new business applications.

The above combination of processing large XML documents in a parallel environment make a best case for DB2 used for big data processing.

4. DB2 Federation

One of the important needs of big data processing is the need to connect to multiple disparate data sources and bring the best out of them. Enterprises no longer can afford to have a single common data store for all their data processing needs.

In DB2 a federated system is a type of distributed database management system that you can use to access data sources across your enterprise. As documented in the IBM Documentation site, DB2 federation support almost all kinds of structured and unstructured data sources. In particular there is support for flat files, Microsoft Excel and VSAM files.

One interesting component of DB2 federation is, the support for connecting to Netezza DB. Netezza is the high performance data warehouse appliance . IBM® Netezza® Analytics is an embedded, purpose-built, advanced analytics platform .

5. Pure Scale
While the Shared Nothing Architecture has been a standard for many massively parallel processing environments, there are successful architectures using Shared Disk model too. The major examples being the IBM's Mainframe Parallel SYSPLEX and Oracle Real Application Clusters.

With the DB2 pureScale Feature, scaling your database solution is simple. Multiple database servers, known as members, process incoming database requests; these members operate in a clustered system and share data. You can transparently add more members to scale out to meet even the most demanding business needs. There are no application changes to make, data to redistribute, or performance tuning to do. The IBM® DB2® pureScale® Feature, much like a multi-partition database environment, provides a scalable and highly available database solution. However, the instance type and data layout of a DB2 pureScale environment and a multi-partition database environment are different.

A DB2 pureScale environment is ideal for short transactions where there is little need to parallelize each query. Queries are automatically routed to different members, based on member workload. While this is not a ideal work load in a Big Data processing scenario , but Big Data Environments do invest on options like Hbase, Cassandra to process short transactions.

Summary
Traditional high performance RDBMS like DB2 have their strengths. They are very strong in maintaining the data integrity and quality in the form of constraints, foreign keys and other validation mechanisms. They are also strong in transactional integrity by providing superior locking model, automatic dead lock resolution etc.. However initially they are not found to adjust to Big Data processing needs of enterprises.

With the enhancements in the products made by respective vendors, now databases like DB2 have been enhanced with big data processing features and makes them the best candidate for enterprises looking for best of the breed features between traditional RDBMS and Big Data processing systems, and to leverage the best of existing investments.

More Stories By Srinivasan Sundara Rajan

I am passionate about ownership and driving things on my own, with my breadth and depth on Enterprise Technology I could run any aspect of IT Industry and make it a success. I am a Seasoned Enterprise IT Expert, mainly in the areas of Solution,Integration and Architecture, across Structured, Unstructured data sources, especially in manufacturing domain. My recent work is on Natural Language Processing, Semantic Enrichment of Unstructured Data, Data Mining and Predictive Analytics. However I have a strong footing across all tiers of Enterprise IT spectrum. I am geared to handle the massive flow of data by Internet Of Things with appropriate platform, tools and processes.

Cloud Expo Latest Stories
With the explosion of the cloud, more businesses are transitioning to a recurring revenue model to generate reliable sales, grow profits, and open new markets. This opportunity requires businesses to get to market quickly with the pricing and packaging options customers want. In addition, you will want to take advantage of the ensuing tidal wave of data to more effectively upsell, cross-sell and manage your customers. All of this is possible, but only with the right approach. At 15th Cloud Expo, Brendan O'Brien, Co-founder at Aria Systems and the inventor of cloud billing panelists, will lead a panel discussion on what it takes to launch and manage a successful recurring revenue business. The panelists will offer their insights about what each department will need to consider, from financial management to line of business and IT. The panelists will also offer examples from their success in recurring revenue with companies such as Audi, Constant Contact, Experian, Pitney-Bowes, Teleko...
Planning scalable environments isn't terribly difficult, but it does require a change of perspective. In his session at 15th Cloud Expo, Phil Jackson, Development Community Advocate for SoftLayer, will broaden your views to think on an Internet scale by dissecting a video publishing application built with The SoftLayer Platform, Message Queuing, Object Storage, and Drupal. By examining a scalable modular application build that can handle unpredictable traffic, attendees will able to grow your development arsenal and pick up a few strategies to apply to your own projects.
Come learn about what you need to consider when moving your data to the cloud. In her session at 15th Cloud Expo, Skyla Loomis, a Program Director of Cloudant Development at Cloudant, will discuss the security, performance, and operational implications of keeping your data on premise, moving it to the cloud, or taking a hybrid approach. She will use real customer examples to illustrate the tradeoffs, key decision points, and how to be successful with a cloud or hybrid cloud solution.
The cloud provides an easy onramp to building and deploying Big Data solutions. Transitioning from initial deployment to large-scale, highly performant operations may not be as easy. In his session at 15th Cloud Expo, Harold Hannon, Sr. Software Architect at SoftLayer, will discuss the benefits, weaknesses, and performance characteristics of public and bare metal cloud deployments that can help you make the right decisions.
Over the last few years the healthcare ecosystem has revolved around innovations in Electronic Health Record (HER) based systems. This evolution has helped us achieve much desired interoperability. Now the focus is shifting to other equally important aspects – scalability and performance. While applying cloud computing environments to the EHR systems, a special consideration needs to be given to the cloud enablement of Veterans Health Information Systems and Technology Architecture (VistA), i.e., the largest single medical system in the United States.
Cloud and Big Data present unique dilemmas: embracing the benefits of these new technologies while maintaining the security of your organization’s assets. When an outside party owns, controls and manages your infrastructure and computational resources, how can you be assured that sensitive data remains private and secure? How do you best protect data in mixed use cloud and big data infrastructure sets? Can you still satisfy the full range of reporting, compliance and regulatory requirements? In his session at 15th Cloud Expo, Derek Tumulak, Vice President of Product Management at Vormetric, will discuss how to address data security in cloud and Big Data environments so that your organization isn’t next week’s data breach headline.
Scott Jenson leads a project called The Physical Web within the Chrome team at Google. Project members are working to take the scalability and openness of the web and use it to talk to the exponentially exploding range of smart devices. Nearly every company today working on the IoT comes up with the same basic solution: use my server and you'll be fine. But if we really believe there will be trillions of these devices, that just can't scale. We need a system that is open a scalable and by using the URL as a basic building block, we open this up and get the same resilience that the web enjoys.
Is your organization struggling to deal with skyrocketing volumes of digital assets? The amount of data is growing exponentially and organizations are having a hard time managing this growth. In his session at 15th Cloud Expo, Amar Kapadia, Senior Director of Open Cloud Strategy at Seagate, will walk through the essential considerations when developing a cloud storage strategy. In this discussion, you will understand the challenges IT is facing, why companies need to move to cloud, and how the right cloud model can help your business economically overcome the data struggle.
If cloud computing benefits are so clear, why have so few enterprises migrated their mission-critical apps? The answer is often inertia and FUD. No one ever got fired for not moving to the cloud – not yet. In his session at 15th Cloud Expo, Michael Hoch, SVP, Cloud Advisory Service at Virtustream, will discuss the six key steps to justify and execute your MCA cloud migration.
The 16th International Cloud Expo announces that its Call for Papers is now open. 16th International Cloud Expo, to be held June 9–11, 2015, at the Javits Center in New York City brings together Cloud Computing, APM, APIs, Security, Big Data, Internet of Things, DevOps and WebRTC to one location. With cloud computing driving a higher percentage of enterprise IT budgets every year, it becomes increasingly important to plant your flag in this fast-expanding business opportunity. Submit your speaking proposal today!
Most of today’s hardware manufacturers are building servers with at least one SATA Port, but not every systems engineer utilizes them. This is considered a loss in the game of maximizing potential storage space in a fixed unit. The SATADOM Series was created by Innodisk as a high-performance, small form factor boot drive with low power consumption to be plugged into the unused SATA port on your server board as an alternative to hard drive or USB boot-up. Built for 1U systems, this powerful device is smaller than a one dollar coin, and frees up otherwise dead space on your motherboard. To meet the requirements of tomorrow’s cloud hardware, Innodisk invested internal R&D resources to develop our SATA III series of products. The SATA III SATADOM boasts 500/180MBs R/W Speeds respectively, or double R/W Speed of SATA II products.
In today's application economy, enterprise organizations realize that it's their applications that are the heart and soul of their business. If their application users have a bad experience, their revenue and reputation are at stake. In his session at 15th Cloud Expo, Anand Akela, Senior Director of Product Marketing for Application Performance Management at CA Technologies, will discuss how a user-centric Application Performance Management solution can help inspire your users with every application transaction.
SYS-CON Events announced today that Gridstore™, the leader in software-defined storage (SDS) purpose-built for Windows Servers and Hyper-V, will exhibit at SYS-CON's 15th International Cloud Expo®, which will take place on November 4–6, 2014, at the Santa Clara Convention Center in Santa Clara, CA. Gridstore™ is the leader in software-defined storage purpose built for virtualization that is designed to accelerate applications in virtualized environments. Using its patented Server-Side Virtual Controller™ Technology (SVCT) to eliminate the I/O blender effect and accelerate applications Gridstore delivers vmOptimized™ Storage that self-optimizes to each application or VM across both virtual and physical environments. Leveraging a grid architecture, Gridstore delivers the first end-to-end storage QoS to ensure the most important App or VM performance is never compromised. The storage grid, that uses Gridstore’s performance optimized nodes or capacity optimized nodes, starts with as few a...
SYS-CON Events announced today that Cloudian, Inc., the leading provider of hybrid cloud storage solutions, has been named “Bronze Sponsor” of SYS-CON's 15th International Cloud Expo®, which will take place on November 4–6, 2014, at the Santa Clara Convention Center in Santa Clara, CA. Cloudian is a Foster City, Calif.-based software company specializing in cloud storage. Cloudian HyperStore® is an S3-compatible cloud object storage platform that enables service providers and enterprises to build reliable, affordable and scalable hybrid cloud storage solutions. Cloudian actively partners with leading cloud computing environments including Amazon Web Services, Citrix Cloud Platform, Apache CloudStack, OpenStack and the vast ecosystem of S3 compatible tools and applications. Cloudian's customers include Vodafone, Nextel, NTT, Nifty, and LunaCloud. The company has additional offices in China and Japan.
SYS-CON Events announced today that TechXtend (formerly Programmer’s Paradise), a leading value-added provider of server and storage virtualization, and r-evolution will exhibit at SYS-CON's 15th International Cloud Expo®, which will take place on November 4–6, 2014, at the Santa Clara Convention Center in Santa Clara, CA. TechXtend (formerly Programmer’s Paradise) is a leading value-added provider of software, systems and solutions for corporations, government organizations, and academic institutions across the United States and Canada. TechXtend is the Exclusive Reseller in the United States for r-evolution