Welcome!

SDN Journal Authors: Olivier Huynh Van, David Paquette, Elizabeth White, Pat Romanski, Liz McMillan

Related Topics: @BigDataExpo, Microservices Expo, Microsoft Cloud, @CloudExpo, Apache, SDN Journal

@BigDataExpo: Article

Five Big Data Features in DB2 Databases

Traditional RDBMS and Big Data

Traditional RDBMS & New Data Processing
Over the past two decades relational databases have been most successful in serving large-scale OLTP and OLAP applications across enterprises. However, in the past couple of years with the advent of Big Data processing, especially processing unstructured data coupled with the need for processing massive quantities of data, the industry started to look into non RDBMS solutions. This has lead into the popularity of NoSQL databases as well as massively parallel processing frameworks.

However the traditional RDBMS has been quick to react and added several Big Data features as part of their offering such that the enterprises with the heavy investment of traditional RDBMS can have best of both worlds by properly leveraging these new features.

The following sections provide idea about big data features in the popular DB2 Databases, a similar analysis will be performed against Oracle also in a later article. Please refer to my earlier article on Five Big Data Features in SQL Server.

1. DB2 Text Search
DB2 Text Search provides extensive capabilities for searching data in text columns stored in a DB2 table. The search system provides fast query response times and a consolidated, ranked result set that enables you to quickly and easily locate the information that you need. By incorporating the functions of DB2 Text Search in your SQL and XQuery statements, you can create powerful and versatile text-retrieval programs.

DB2 Text Search works by collecting data from diverse sources and indexing it for subsequent fast retrieval. DB2 Text Search uses linguistic analysis to improve search results and supports the following document formats:

  • Unstructured plain text.
  • Structured text such as that in HTML or XML documents
  • Proprietary document formats such as PDF or Microsoft Office document formats.

We can perform various kinds of searches like,

  • Basic Search : Using Boolean Operators and Modifiers
  • Fuzzy Search : Using words with similar spelling to search term
  • Proximity Search : A proximity search retrieves documents that contain search words which are located within a specified distance from each other.
  • SCORE Search : We use the SCORE function to find out the extent to which a document matches a search document.

DB2® Text Search provides dictionary packs to support the linguistic processing of documents and queries. In addition, n-gram segmentation is supported for languages such as Chinese, Japanese, and Korean. As an alternative to dictionary-based word segmentation, the search engine provides an option to select n-gram segmentation for languages such as Chinese, Japanese, and Korean. It is evident from the use cases and patterns on Big Data without such features on natural language processing much of the insights like sentiment analysis cannot be fruitful.

2. Partitioned Databases
MPP (Massively Parallel Processing) frame works like Hadoop are found to be well suited for processing large quantities of data due to their Shared Nothing Architecture and the ability to process data in parallel. DB2 on Unix/Windows is the pioneer in implementing such a concept using the partitioned database option.

A partitioned database environment is a database installation that supports the distribution of data across database partitions. Because data is distributed across database partitions, you can use the power of multiple processors on multiple physical machines to satisfy requests for information. Data retrieval and update requests are decomposed automatically into sub-requests, and executed in parallel among the applicable database partitions. The fact that databases are split across database partitions is transparent to users issuing SQL statements. Interpartition parallelism refers to the ability to break up a query into multiple parts across multiple partitions of a partitioned database, on one machine or multiple machines. The query is run in parallel. Some DB2 utilities also perform this type of parallelism.

In support of Unstructured Big Data processing, DB2 Text Search explained earlier is integrated with the partitioned database environment. DB2® Text Search supports full-text search in a partitioned database environment. Text search indexes are distributed in a pattern that matches the base tables on which they are created. For each database partition, a text index partition, also called a collection, is created. This pattern facilitates text search maintenance by allowing text search index updates with parallel execution on all index partitions.

3. Pure XML
The pureXML® feature allows you to store well-formed XML documents in database table columns that have the XML data type. By storing XML data in XML columns, the data is kept in its native hierarchical form, rather than stored as text or mapped to a different data model.

There is no architectural limit on the size of an XML value in a database. An index over XML data can be used to improve the efficiency of queries on XML documents that are stored in an XML column. In contrast to traditional relational indexes, where index keys are composed of one or more table columns you specify, an index over XML data uses a particular XML pattern expression to index paths and values in XML documents stored within a single column. The data type of that column must be XML.

In partitioned database environments, tables containing XML columns can be stored in multi-partition databases. In DB2 latest version, the pureXML feature is supported in partitioned database environments. With both features tightly integrated, pureXML customers can distribute XML data across multiple database partitions and parallelize XML queries for better performance, while partitioned database environments customers can deploy pureXML for new business applications.

The above combination of processing large XML documents in a parallel environment make a best case for DB2 used for big data processing.

4. DB2 Federation

One of the important needs of big data processing is the need to connect to multiple disparate data sources and bring the best out of them. Enterprises no longer can afford to have a single common data store for all their data processing needs.

In DB2 a federated system is a type of distributed database management system that you can use to access data sources across your enterprise. As documented in the IBM Documentation site, DB2 federation support almost all kinds of structured and unstructured data sources. In particular there is support for flat files, Microsoft Excel and VSAM files.

One interesting component of DB2 federation is, the support for connecting to Netezza DB. Netezza is the high performance data warehouse appliance . IBM® Netezza® Analytics is an embedded, purpose-built, advanced analytics platform .

5. Pure Scale
While the Shared Nothing Architecture has been a standard for many massively parallel processing environments, there are successful architectures using Shared Disk model too. The major examples being the IBM's Mainframe Parallel SYSPLEX and Oracle Real Application Clusters.

With the DB2 pureScale Feature, scaling your database solution is simple. Multiple database servers, known as members, process incoming database requests; these members operate in a clustered system and share data. You can transparently add more members to scale out to meet even the most demanding business needs. There are no application changes to make, data to redistribute, or performance tuning to do. The IBM® DB2® pureScale® Feature, much like a multi-partition database environment, provides a scalable and highly available database solution. However, the instance type and data layout of a DB2 pureScale environment and a multi-partition database environment are different.

A DB2 pureScale environment is ideal for short transactions where there is little need to parallelize each query. Queries are automatically routed to different members, based on member workload. While this is not a ideal work load in a Big Data processing scenario , but Big Data Environments do invest on options like Hbase, Cassandra to process short transactions.

Summary
Traditional high performance RDBMS like DB2 have their strengths. They are very strong in maintaining the data integrity and quality in the form of constraints, foreign keys and other validation mechanisms. They are also strong in transactional integrity by providing superior locking model, automatic dead lock resolution etc.. However initially they are not found to adjust to Big Data processing needs of enterprises.

With the enhancements in the products made by respective vendors, now databases like DB2 have been enhanced with big data processing features and makes them the best candidate for enterprises looking for best of the breed features between traditional RDBMS and Big Data processing systems, and to leverage the best of existing investments.

More Stories By Srinivasan Sundara Rajan

Highly passionate about utilizing Digital Technologies to enable next generation enterprise. Believes in enterprise transformation through the Natives (Cloud Native & Mobile Native).

@CloudExpo Stories
Digital innovation is the next big wave of business transformation based on digital technologies of which IoT and Big Data are key components, For example: Business boundary innovation is a challenge to excavate third-party business value using IoT and BigData, like Nest Business structure innovation may propose re-building business structure from scratch, as Uber does in the taxicab industry The social model innovation is also a big challenge to the new social architecture with the design fr...
Fifty billion connected devices and still no winning protocols standards. HTTP, WebSockets, MQTT, and CoAP seem to be leading in the IoT protocol race at the moment but many more protocols are getting introduced on a regular basis. Each protocol has its pros and cons depending on the nature of the communications. Does there really need to be only one protocol to rule them all? Of course not. In his session at @ThingsExpo, Chris Matthieu, co-founder and CTO of Octoblu, walk you through how Oct...
We’ve been doing it for years, decades for some. How many websites have you created accounts on? Your bank, your credit card companies, social media sites, hotels and travel sites, online shopping sites, and that’s just the start. We do it often without even thinking about it, quickly entering our personal information, our data, in a plethora of systems. Sometimes we’re not even aware of the information we are providing. It could be very personal information (think of the security questions you ...
Complete Internet of Things (IoT) embedded device security is not just about the device but involves the entire product’s identity, data and control integrity, and services traversing the cloud. A device can no longer be looked at as an island; it is a part of a system. In fact, given the cross-domain interactions enabled by IoT it could be a part of many systems. Also, depending on where the device is deployed, for example, in the office building versus a factory floor or oil field, security ha...
Is your aging software platform suffering from technical debt while the market changes and demands new solutions at a faster clip? It’s a bold move, but you might consider walking away from your core platform and starting fresh. ReadyTalk did exactly that. In his General Session at 19th Cloud Expo, Michael Chambliss, Head of Engineering at ReadyTalk, will discuss why and how ReadyTalk diverted from healthy revenue and over a decade of audio conferencing product development to start an innovati...
All clouds are not equal. To succeed in a DevOps context, organizations should plan to develop/deploy apps across a choice of on-premise and public clouds simultaneously depending on the business needs. This is where the concept of the Lean Cloud comes in - resting on the idea that you often need to relocate your app modules over their life cycles for both innovation and operational efficiency in the cloud. In his session at @DevOpsSummit at19th Cloud Expo, Valentin (Val) Bercovici, CTO of So...
SYS-CON Events announced today that Niagara Networks will exhibit at the 19th International Cloud Expo, which will take place on November 1–3, 2016, at the Santa Clara Convention Center in Santa Clara, CA. Niagara Networks offers the highest port-density systems, and the most complete Next-Generation Network Visibility systems including Network Packet Brokers, Bypass Switches, and Network TAPs.
Data is an unusual currency; it is not restricted by the same transactional limitations as money or people. In fact, the more that you leverage your data across multiple business use cases, the more valuable it becomes to the organization. And the same can be said about the organization’s analytics. In his session at 19th Cloud Expo, Bill Schmarzo, CTO for the Big Data Practice at EMC, will introduce a methodology for capturing, enriching and sharing data (and analytics) across the organizati...
SYS-CON Events announced today that Commvault, a global leader in enterprise data protection and information management, has been named “Bronze Sponsor” of SYS-CON's 19th International Cloud Expo, which will take place on November 1–3, 2016, at the Santa Clara Convention Center in Santa Clara, CA. Commvault is a leading provider of data protection and information management solutions, helping companies worldwide activate their data to drive more value and business insight and to transform moder...
There is little doubt that Big Data solutions will have an increasing role in the Enterprise IT mainstream over time. Big Data at Cloud Expo - to be held November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA - has announced its Call for Papers is open. Cloud computing is being adopted in one form or another by 94% of enterprises today. Tens of billions of new devices are being connected to The Internet of Things. And Big Data is driving this bus. An exponential increase is...
IoT is fundamentally transforming the auto industry, turning the vehicle into a hub for connected services, including safety, infotainment and usage-based insurance. Auto manufacturers – and businesses across all verticals – have built an entire ecosystem around the Connected Car, creating new customer touch points and revenue streams. In his session at @ThingsExpo, Macario Namie, Head of IoT Strategy at Cisco Jasper, will share real-world examples of how IoT transforms the car from a static p...
SYS-CON Events announced today that Tintri Inc., a leading producer of VM-aware storage (VAS) for virtualization and cloud environments, will exhibit at the 19th International Cloud Expo, which will take place on November 1–3, 2016, at the Santa Clara Convention Center in Santa Clara, CA. Tintri VM-aware storage is the simplest for virtualized applications and cloud. Organizations including GE, Toyota, United Healthcare, NASA and 6 of the Fortune 15 have said “No to LUNs.” With Tintri they mana...
Creating replica copies to tolerate a certain number of failures is easy, but very expensive at cloud-scale. Conventional RAID has lower overhead, but it is limited in the number of failures it can tolerate. And the management is like herding cats (overseeing capacity, rebuilds, migrations, and degraded performance). Download Slide Deck: ▸ Here In his general session at 18th Cloud Expo, Scott Cleland, Senior Director of Product Marketing for the HGST Cloud Infrastructure Business Unit, discusse...
Whether they’re located in a public, private, or hybrid cloud environment, cloud technologies are constantly evolving. While the innovation is exciting, the end mission of delivering business value and rapidly producing incremental product features is paramount. In his session at @DevOpsSummit at 19th Cloud Expo, Kiran Chitturi, CTO Architect at Sungard AS, will discuss DevOps culture, its evolution of frameworks and technologies, and how it is achieving maturity. He will also cover various st...
If you had a chance to enter on the ground level of the largest e-commerce market in the world – would you? China is the world’s most populated country with the second largest economy and the world’s fastest growing market. It is estimated that by 2018 the Chinese market will be reaching over $30 billion in gaming revenue alone. Admittedly for a foreign company, doing business in China can be challenging. Often changing laws, administrative regulations and the often inscrutable Chinese Interne...
Internet of @ThingsExpo, taking place November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA, is co-located with the 19th International Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world and ThingsExpo Silicon Valley Call for Papers is now open.
SYS-CON Events has announced today that Roger Strukhoff has been named conference chair of Cloud Expo and @ThingsExpo 2016 Silicon Valley. The 19th Cloud Expo and 6th @ThingsExpo will take place on November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA. "The Internet of Things brings trillions of dollars of opportunity to developers and enterprise IT, no matter how you measure it," stated Roger Strukhoff. "More importantly, it leverages the power of devices and the Interne...
"We have several customers now running private clouds. They're not as large as they should be but it's getting there. The adoption challenge has been pretty simple. Look at the world today of virtualization vs cloud," stated Nara Rajagopalan, CEO of Accelerite, in this SYS-CON.tv interview at 18th Cloud Expo, held June 7-9, 2016, at the Javits Center in New York City, NY.
"My role is working with customers, helping them go through this digital transformation. I spend a lot of time talking to banks, big industries, manufacturers working through how they are integrating and transforming their IT platforms and moving them forward," explained William Morrish, General Manager Product Sales at Interoute, in this SYS-CON.tv interview at 18th Cloud Expo, held June 7-9, 2016, at the Javits Center in New York City, NY.
In his keynote at 18th Cloud Expo, Andrew Keys, Co-Founder of ConsenSys Enterprise, provided an overview of the evolution of the Internet and the Database and the future of their combination – the Blockchain. Andrew Keys is Co-Founder of ConsenSys Enterprise. He comes to ConsenSys Enterprise with capital markets, technology and entrepreneurial experience. Previously, he worked for UBS investment bank in equities analysis. Later, he was responsible for the creation and distribution of life sett...