Welcome!

SDN Journal Authors: Liz McMillan, Yeshim Deniz, Elizabeth White, Pat Romanski, TJ Randall

Related Topics: @DXWorldExpo, Java IoT, Microservices Expo, Containers Expo Blog, @CloudExpo, SDN Journal

@DXWorldExpo: Article

Columnar vs. Key-Value Storage Models

Pay attention to specific configuration and tuning around three points

What are the performance differences between in-memory columnar databases like SAP HANA and GridGain's In-Memory Database (IMDB) utilizing distributed key-value storage? This questions comes up regularly in conversations with our customers and the answer is not very obvious.

Storage Models
First off, let's clearly state that we are talking about storage model only and its implications on performance for various use cases. It's important to note that:

  • Storage model doesn't dictate of preclude a particular transactionality or consistency guarantees; there are columnar databases that support ACID (HANA) and those that don't (HBase); there are distributed key-value databases that support ACID (GridGain) and those that don't (for example, Riak and memcached).
  • Storage model doesn't dictate specific query language; using above examples - GridGain and HANA support SQL - HBase, for example, doesn't.

Unlike transactionality and query language - performance considerations, however, are not that straightforward.

Note also: SAP HANA has pluggable storage model and experimental row-based storage implementation. We'll concentrate on columnar storage that apparently accounts for all HANA usage at this point.

HANA's Columnar Storage Model
Let's recall what columnar storage model entails in general and note its HANA specifics.

Some of its stand out characteristics include:

  • Data in columnar model is kept in column (vs. rows as in row storage models).
  • Since data in a single column is almost always homogeneous it's frequently compressed for storage (especially in in-memory systems like HANA).
  • Aggregate functions (i.e. column functions) are very fast on columnar data model since the entire column can be fetched very quickly and effectively indexed.
  • Inserts, updates and row functions, however, are significantly slower than their row-based counterparts as a trade-off of columnar approach (inserting a row leads to multiple columns inserts). Because of this characteristic - columnar databased typically used in R/OLAP scenario (where data doesn't change) and very rarely in OLTP use cases (where data changes frequently).
  • Since columnar storage is fairly compact it doesn't generally require distribution (i.e. data partitioning) to store large datasets - the entire database can often be logically stored in memory of a single server. HANA, however, provides comprehensive support for data partitioning.

It is important to emphasize that columnar storage model is ideally suited for very compact memory utilization for the two main reasons:

  • Columnar model is a naturally fit for compression which often provides for dramatic reduction in memory consumption.
  • Since column-based functions are very fast - there is no need for materialized views for aggregated values in exchange for simply computing necessary values on the fly; this leads to significantly reduced memory footprint as well.

GridGain's IMDB Key-Value Storage Model
Key-value (KV) storage model is less defined than its columnar counterpart and usually involves a fair amount of vendor specifics.

Historically, there are two schools of KV storage models:

  • Traditional (examples include Riak, memcached, Redis). The common characteristic of these systems is a raw, language independent storage format for the keys and values.
  • Data Grid (examples include GridGain IMDB, GigaSpaces, Coherence). The common trait of these systems is the reliance on JVM as underlying runtime platform, and treating keys and values as user-defined JVM objects.

GridGain's IMDB belongs to Data Grid branch of KV storage models. Some of its key characteristics are:

  • Data is stored in a set of distributed maps (a.k.a. dictionaries or caches); in a simple approximation you can think of a value as a row in row-based model, and a key as that row's primary key. Following this analogy a single KV map can be approximated as row-based table with automatic primary key index.
  • Keys and values are represented as user-defined JVM objects and therefore no automatic compression can be performed.
  • Data distribution is designed from the ground up. Data is partitioned across the cluster mitigating, in part, lack of compression. Unlike HANA - data partitioning is mandatory.
  • MapReduce is the main API for data processing (SQL is supported as well).
  • Strong affinity and co-location semantics provided by default.
  • No bias towards aggregate or row-based processing performance and therefore no bias towards either OLAP or OLTP applicability.

Performance Considerations
It is somewhat expected that for heavy transactional processing GridGain will provide overall better performance in most cases:

  • Columnar model is rather inefficient in updating or inserting values in multiple columns.
  • Transactional locking is also less efficient in columnar model.
  • Required de-compression and re-compression further degrades performance.
  • KV storage model, on the other hand, provides an ideal model for individual updates as individual objects can be accessed, locked and updated very effectively.
  • Lack of compression in GridGain IMDB makes updates go even faster than in columnar model with compression.

As an example, GridGain just won a public tender for one of the biggest financial institutions in the world achieving 1 billion transactional updates per second on 10 commodity blades costing less than $25K all together. That transactional performance and associated TCO is clearly not the territory any columnar database can approach.

For OLAP workloads the picture is less obvious. HANA is heavily biased towards OLAP processing, and GridGain IMDB is neutral towards it. Both GridGain IMDB and SAP HANA provides comprehensive data partitioning capabilities and allow for processing parallelization - MPP traits necessary for scale out OLAP processing. I believe the actual difference observed by the customers will be driven primarily by three factors rooted deeply in differences between columnar and KV implementations in respective products:

  • Optimizations around data affinity and co-location.
  • Optimizations around the distribution overhead.
  • Optimizations around indexing of partitioned data.

Unfortunately - there's no way to provide any generalized guidance on performance difference here... We always recommend to try both in your particular scenario, pay attention to specific configuration and tuning around three points mentioned above - and see what results you'll get. It does take time and resources - but you may be surprised by your findings!

More Stories By Nikita Ivanov

Nikita Ivanov is founder and CEO of GridGain Systems, started in 2007 and funded by RTP Ventures and Almaz Capital. Nikita has led GridGain to develop advanced and distributed in-memory data processing technologies – the top Java in-memory computing platform starting every 10 seconds around the world today.

Nikita has over 20 years of experience in software application development, building HPC and middleware platforms, contributing to the efforts of other startups and notable companies including Adaptec, Visa and BEA Systems. Nikita was one of the pioneers in using Java technology for server side middleware development while working for one of Europe’s largest system integrators in 1996.

He is an active member of Java middleware community, contributor to the Java specification, and holds a Master’s degree in Electro Mechanics from Baltic State Technical University, Saint Petersburg, Russia.

CloudEXPO Stories
Eric Taylor, a former hacker, reveals what he's learned about cybersecurity. Taylor's life as a hacker began when he was just 12 years old and playing video games at home. Russian hackers are notorious for their hacking skills, but one American says he hacked a Russian cyber gang at just 15 years old. The government eventually caught up with Taylor and he pleaded guilty to posting the personal information on the internet, among other charges. Eric Taylor, who went by the nickname Cosmo the God, also posted personal information of celebrities and government officials, including Michelle Obama, former CIA director John Brennan, Kim Kardashian and Tiger Woods. Taylor recently became an advisor to cybersecurity start-up Path which helps companies make sure their websites are properly loading around the globe.
ClaySys Technologies is one of the leading application platform products in the ‘No-code' or ‘Metadata Driven' software business application development space. The company was founded to create a modern technology platform that addressed the core pain points related to the traditional software application development architecture. The founding team of ClaySys Technologies come from a legacy of creating and developing line of business software applications for large enterprise clients around the world.
To Really Work for Enterprises, MultiCloud Adoption Requires Far Better and Inclusive Cloud Monitoring and Cost Management … But How? Overwhelmingly, even as enterprises have adopted cloud computing and are expanding to multi-cloud computing, IT leaders remain concerned about how to monitor, manage and control costs across hybrid and multi-cloud deployments. It’s clear that traditional IT monitoring and management approaches, designed after all for on-premises data centers, are falling short in this new hybrid and dynamic environment.
Most modern computer languages embed a lot of metadata in their application. We show how this goldmine of data from a runtime environment like production or staging can be used to increase profits. Adi conceptualized the Crosscode platform after spending over 25 years working for large enterprise companies like HP, Cisco, IBM, UHG and personally experiencing the challenges that prevent companies from quickly making changes to their technology, due to the complexity of their enterprise. An accomplished expert in Enterprise Architecture, Adi has also served as CxO advisor to numerous Fortune executives.
DevOpsSUMMIT at CloudEXPO, to be held June 25-26, 2019 at the Santa Clara Convention Center in Santa Clara, CA – announces that its Call for Papers is open. Born out of proven success in agile development, cloud computing, and process automation, DevOps is a macro trend you cannot afford to miss. From showcase success stories from early adopters and web-scale businesses, DevOps is expanding to organizations of all sizes, including the world's largest enterprises – and delivering real results. Among the proven benefits, DevOps is correlated with 20% faster time-to-market, 22% improvement in quality, and 18% reduction in dev and ops costs, according to research firm Vanson-Bourne. It is changing the way IT works, how businesses interact with customers, and how organizations are buying, building, and delivering software.