Click here to close now.

Welcome!

SDN Journal Authors: Roger Strukhoff, AppDynamics Blog, Liz McMillan, Lori MacVittie, Pat Romanski

Related Topics: Big Data Journal, Microservices Journal, .NET, Cloud Expo, Apache, SDN Journal

Big Data Journal: Article

Five Big Data Features in DB2 Databases

Traditional RDBMS and Big Data

Traditional RDBMS & New Data Processing
Over the past two decades relational databases have been most successful in serving large-scale OLTP and OLAP applications across enterprises. However, in the past couple of years with the advent of Big Data processing, especially processing unstructured data coupled with the need for processing massive quantities of data, the industry started to look into non RDBMS solutions. This has lead into the popularity of NoSQL databases as well as massively parallel processing frameworks.

However the traditional RDBMS has been quick to react and added several Big Data features as part of their offering such that the enterprises with the heavy investment of traditional RDBMS can have best of both worlds by properly leveraging these new features.

The following sections provide idea about big data features in the popular DB2 Databases, a similar analysis will be performed against Oracle also in a later article. Please refer to my earlier article on Five Big Data Features in SQL Server.

1. DB2 Text Search
DB2 Text Search provides extensive capabilities for searching data in text columns stored in a DB2 table. The search system provides fast query response times and a consolidated, ranked result set that enables you to quickly and easily locate the information that you need. By incorporating the functions of DB2 Text Search in your SQL and XQuery statements, you can create powerful and versatile text-retrieval programs.

DB2 Text Search works by collecting data from diverse sources and indexing it for subsequent fast retrieval. DB2 Text Search uses linguistic analysis to improve search results and supports the following document formats:

  • Unstructured plain text.
  • Structured text such as that in HTML or XML documents
  • Proprietary document formats such as PDF or Microsoft Office document formats.

We can perform various kinds of searches like,

  • Basic Search : Using Boolean Operators and Modifiers
  • Fuzzy Search : Using words with similar spelling to search term
  • Proximity Search : A proximity search retrieves documents that contain search words which are located within a specified distance from each other.
  • SCORE Search : We use the SCORE function to find out the extent to which a document matches a search document.

DB2® Text Search provides dictionary packs to support the linguistic processing of documents and queries. In addition, n-gram segmentation is supported for languages such as Chinese, Japanese, and Korean. As an alternative to dictionary-based word segmentation, the search engine provides an option to select n-gram segmentation for languages such as Chinese, Japanese, and Korean. It is evident from the use cases and patterns on Big Data without such features on natural language processing much of the insights like sentiment analysis cannot be fruitful.

2. Partitioned Databases
MPP (Massively Parallel Processing) frame works like Hadoop are found to be well suited for processing large quantities of data due to their Shared Nothing Architecture and the ability to process data in parallel. DB2 on Unix/Windows is the pioneer in implementing such a concept using the partitioned database option.

A partitioned database environment is a database installation that supports the distribution of data across database partitions. Because data is distributed across database partitions, you can use the power of multiple processors on multiple physical machines to satisfy requests for information. Data retrieval and update requests are decomposed automatically into sub-requests, and executed in parallel among the applicable database partitions. The fact that databases are split across database partitions is transparent to users issuing SQL statements. Interpartition parallelism refers to the ability to break up a query into multiple parts across multiple partitions of a partitioned database, on one machine or multiple machines. The query is run in parallel. Some DB2 utilities also perform this type of parallelism.

In support of Unstructured Big Data processing, DB2 Text Search explained earlier is integrated with the partitioned database environment. DB2® Text Search supports full-text search in a partitioned database environment. Text search indexes are distributed in a pattern that matches the base tables on which they are created. For each database partition, a text index partition, also called a collection, is created. This pattern facilitates text search maintenance by allowing text search index updates with parallel execution on all index partitions.

3. Pure XML
The pureXML® feature allows you to store well-formed XML documents in database table columns that have the XML data type. By storing XML data in XML columns, the data is kept in its native hierarchical form, rather than stored as text or mapped to a different data model.

There is no architectural limit on the size of an XML value in a database. An index over XML data can be used to improve the efficiency of queries on XML documents that are stored in an XML column. In contrast to traditional relational indexes, where index keys are composed of one or more table columns you specify, an index over XML data uses a particular XML pattern expression to index paths and values in XML documents stored within a single column. The data type of that column must be XML.

In partitioned database environments, tables containing XML columns can be stored in multi-partition databases. In DB2 latest version, the pureXML feature is supported in partitioned database environments. With both features tightly integrated, pureXML customers can distribute XML data across multiple database partitions and parallelize XML queries for better performance, while partitioned database environments customers can deploy pureXML for new business applications.

The above combination of processing large XML documents in a parallel environment make a best case for DB2 used for big data processing.

4. DB2 Federation

One of the important needs of big data processing is the need to connect to multiple disparate data sources and bring the best out of them. Enterprises no longer can afford to have a single common data store for all their data processing needs.

In DB2 a federated system is a type of distributed database management system that you can use to access data sources across your enterprise. As documented in the IBM Documentation site, DB2 federation support almost all kinds of structured and unstructured data sources. In particular there is support for flat files, Microsoft Excel and VSAM files.

One interesting component of DB2 federation is, the support for connecting to Netezza DB. Netezza is the high performance data warehouse appliance . IBM® Netezza® Analytics is an embedded, purpose-built, advanced analytics platform .

5. Pure Scale
While the Shared Nothing Architecture has been a standard for many massively parallel processing environments, there are successful architectures using Shared Disk model too. The major examples being the IBM's Mainframe Parallel SYSPLEX and Oracle Real Application Clusters.

With the DB2 pureScale Feature, scaling your database solution is simple. Multiple database servers, known as members, process incoming database requests; these members operate in a clustered system and share data. You can transparently add more members to scale out to meet even the most demanding business needs. There are no application changes to make, data to redistribute, or performance tuning to do. The IBM® DB2® pureScale® Feature, much like a multi-partition database environment, provides a scalable and highly available database solution. However, the instance type and data layout of a DB2 pureScale environment and a multi-partition database environment are different.

A DB2 pureScale environment is ideal for short transactions where there is little need to parallelize each query. Queries are automatically routed to different members, based on member workload. While this is not a ideal work load in a Big Data processing scenario , but Big Data Environments do invest on options like Hbase, Cassandra to process short transactions.

Summary
Traditional high performance RDBMS like DB2 have their strengths. They are very strong in maintaining the data integrity and quality in the form of constraints, foreign keys and other validation mechanisms. They are also strong in transactional integrity by providing superior locking model, automatic dead lock resolution etc.. However initially they are not found to adjust to Big Data processing needs of enterprises.

With the enhancements in the products made by respective vendors, now databases like DB2 have been enhanced with big data processing features and makes them the best candidate for enterprises looking for best of the breed features between traditional RDBMS and Big Data processing systems, and to leverage the best of existing investments.

More Stories By Srinivasan Sundara Rajan

Srinivasan is passionate about ownership and driving things on his own, with his breadth and depth on Enterprise Technology he could run any aspect of IT Industry and make it a success.

He is a seasoned Enterprise IT Expert, mainly in the areas of Solution, Integration and Architecture, across Structured, Unstructured data sources, especially in manufacturing domain.

He currently works as Technology Head For GAVS Technologies.

@CloudExpo Stories
SYS-CON Events announced today that IDenticard will exhibit at SYS-CON's 16th International Cloud Expo®, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY. IDenticard™ is the security division of Brady Corp (NYSE: BRC), a $1.5 billion manufacturer of identification products. We have small-company values with the strength and stability of a major corporation. IDenticard offers local sales, support and service to our customers across the United States and Canada...
Containers and microservices have become topics of intense interest throughout the cloud developer and enterprise IT communities. Accordingly, attendees at the upcoming 16th Cloud Expo at the Javits Center in New York June 9-11 will find fresh new content in a new track called PaaS | Containers & Microservices Containers are not being considered for the first time by the cloud community, but a current era of re-consideration has pushed them to the top of the cloud agenda. With the launch ...
SYS-CON Events announced today the IoT Bootcamp – Jumpstart Your IoT Strategy, being held June 9–10, 2015, in conjunction with 16th Cloud Expo and Internet of @ThingsExpo at the Javits Center in New York City. This is your chance to jumpstart your IoT strategy. Combined with real-world scenarios and use cases, the IoT Bootcamp is not just based on presentations but includes hands-on demos and walkthroughs. We will introduce you to a variety of Do-It-Yourself IoT platforms including Arduino, Ras...
SYS-CON Events announced today the DevOps Foundation Certification Course, being held June ?, 2015, in conjunction with DevOps Summit and 16th Cloud Expo at the Javits Center in New York City, NY. This sixteen (16) hour course provides an introduction to DevOps – the cultural and professional movement that stresses communication, collaboration, integration and automation in order to improve the flow of work between software developers and IT operations professionals. Improved workflows will res...
Health care systems across the globe are under enormous strain, as facilities reach capacity and costs continue to rise. M2M and the Internet of Things have the potential to transform the industry through connected health solutions that can make care more efficient while reducing costs. In fact, Vodafone's annual M2M Barometer Report forecasts M2M applications rising to 57 percent in health care and life sciences by 2016. Lively is one of Vodafone's health care partners, whose solutions enable o...
The best mobile applications are augmented by dedicated servers, the Internet and Cloud services. Mobile developers should focus on one thing: writing the next socially disruptive viral app. Thanks to the cloud, they can focus on the overall solution, not the underlying plumbing. From iOS to Android and Windows, developers can leverage cloud services to create a common cross-platform backend to persist user settings, app data, broadcast notifications, run jobs, etc. This session provide...
“In the past year we've seen a lot of stabilization of WebRTC. You can now use it in production with a far greater degree of certainty. A lot of the real developments in the past year have been in things like the data channel, which will enable a whole new type of application," explained Peter Dunkley, Technical Director at Acision, in this SYS-CON.tv interview at @ThingsExpo, held Nov 4–6, 2014, at the Santa Clara Convention Center in Santa Clara, CA.
SYS-CON Events announced today that Vicom Computer Services, Inc., a provider of technology and service solutions, will exhibit at SYS-CON's 16th International Cloud Expo®, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY. They are located at booth #427. Vicom Computer Services, Inc. is a progressive leader in the technology industry for over 30 years. Headquartered in the NY Metropolitan area. Vicom provides products and services based on today’s requirements...
Public Cloud IaaS started it's life in the developer and startup communities and has grown rapidly to a $20B+ industry, but it still pales in comparison to how much is spent worldwide on IT: $3.6 trillion. In fact, there are 8.6 million data centers worldwide, the reality is many small and medium sized business have server closets and colocation footprints filled with servers and storage gear. While on-premise environment virtualization may have peaked at 75%, the Public Cloud has lagged in ado...
Dave will share his insights on how Internet of Things for Enterprises are transforming and making more productive and efficient operations and maintenance (O&M) procedures in the cleantech industry and beyond. Speaker Bio: Dave Landa is chief operating officer of Cybozu Corp (kintone US). Based in the San Francisco Bay Area, Dave has been on the forefront of the Cloud revolution driving strategic business development on the executive teams of multiple leading Software as a Services (SaaS) ap...
SYS-CON Events announced today that Soha will exhibit at SYS-CON's DevOps Summit New York, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY. Soha delivers enterprise-grade application security, on any device, as agile as the cloud. This turnkey, cloud-based service enables customers to solve secure application access and delivery challenges that traditional or virtualized network solutions cannot solve because they are too expensive, inflexible and operational...
SYS-CON Events announced today that Ciqada will exhibit at SYS-CON's @ThingsExpo, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY. Ciqada™ makes it easy to connect your products to the Internet. By integrating key components - hardware, servers, dashboards, and mobile apps - into an easy-to-use, configurable system, your products can quickly and securely join the internet of things. With remote monitoring, control, and alert messaging capability, you will mee...
SYS-CON Media announced today that @WebRTCSummit Blog, the largest WebRTC resource in the world, has been launched. @WebRTCSummit Blog offers top articles, news stories, and blog posts from the world's well-known experts and guarantees better exposure for its authors than any other publication. @WebRTCSummit Blog can be bookmarked ▸ Here @WebRTCSummit conference site can be bookmarked ▸ Here
of cloud, colocation, managed services and disaster recovery solutions, will exhibit at SYS-CON's 16th International Cloud Expo®, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY. TierPoint, LLC, is a leading national provider of information technology and data center services, including cloud, colocation, disaster recovery and managed IT services, with corporate headquarters in St. Louis, MO. TierPoint was formed through the strategic combination of some of t...
ProfitBricks, the provider of painless cloud infrastructure for IaaS, today announced the release of a Node.js SDK written against its recently launched REST API. This new JavaScript based library provides coverage for all existing ProfitBricks REST API functions. With additional libraries set to release this month, ProfitBricks continues to prove its dedication to the DevOps community and commitment to making cloud migrations and cloud management painless. Node.js is an open source, cross-pl...
The IoT Bootcamp is coming to Cloud Expo | @ThingsExpo on June 9-10 at the Javits Center in New York. Instructor. Registration is now available at http://iotbootcamp.sys-con.com/ Instructor Janakiram MSV previously taught the famously successful Multi-Cloud Bootcamp at Cloud Expo | @ThingsExpo in November in Santa Clara. Now he is expanding the focus to Janakiram is the founder and CTO of Get Cloud Ready Consulting, a niche Cloud Migration and Cloud Operations firm that recently got acquir...
The 17th International Cloud Expo has announced that its Call for Papers is open. 17th International Cloud Expo, to be held November 3-5, 2015, at the Santa Clara Convention Center in Santa Clara, CA, brings together Cloud Computing, APM, APIs, Microservices, Security, Big Data, Internet of Things, DevOps and WebRTC to one location. With cloud computing driving a higher percentage of enterprise IT budgets every year, it becomes increasingly important to plant your flag in this fast-expanding bu...
SYS-CON Events announced today that GENBAND, a leading developer of real time communications software solutions, has been named “Silver Sponsor” of SYS-CON's WebRTC Summit, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY. The GENBAND team will be on hand to demonstrate their newest product, Kandy. Kandy is a communications Platform-as-a-Service (PaaS) that enables companies to seamlessly integrate more human communications into their Web and mobile applicatio...
ProfitBricks has launched its new DevOps Central and REST API, along with support for three multi-cloud libraries and a Python SDK. This, combined with its already existing SOAP API and its new RESTful API, moves ProfitBricks into a position to better serve the DevOps community and provide the ability to automate cloud infrastructure in a multi-cloud world. Following this momentum, ProfitBricks has also introduced several libraries that enable developers to use their favorite language to code ...
What exactly is a cognitive application? In her session at 16th Cloud Expo, Ashley Hathaway, Product Manager at IBM Watson, will look at the services being offered by the IBM Watson Developer Cloud and what that means for developers and Big Data. She'll explore how IBM Watson and its partnerships will continue to grow and help define what it means to be a cognitive service, as well as take a look at the offerings on Bluemix. She will also check out how Watson and the Alchemy API team up to off...