Welcome!

SDN Journal Authors: Liz McMillan, Elizabeth White, Pat Romanski, Greg Schulz, Jerome McFarland

Related Topics: Containers Expo Blog, Microservices Expo, Microsoft Cloud, Linux Containers, Cloud Security, SDN Journal

Containers Expo Blog: Article

Data Efficiency at Scale

Overcoming limitations in data efficiency features

The initial wave of data efficiency features for primary storage focus on silos of information organized in terms of individual file systems. Deduplication and compression features provided by some vendors are limited by the scalability of those underlying file systems, essentially the file systems have become silos of optimized data. For example, NetApp deduplication can't scale beyond a 100 TB limit, because that's the limit in size of its WAFL file system. But ask anyone who's ever used NetApp deduplication if they've done it on a 100 TB file system, and you're likely to hear "are you crazy?" It's one thing to claim that data efficiency features can scale, quite a different one to actually use them with performance at scale.

Challenges around scalability generally center on two areas: scalability of random IO and memory overhead. Older solutions, like the one from NetApp, face the first challenge while newer flash-based storage systems are struggling with the second. I'll review both here:

The IO Challenge
Primary data-oriented storage devices handle both streaming and random throughput and therefore are sensitive to latency effects. Data efficiency requirements for primary storage must have fast hashing techniques to reduce the impact of latency. Fast hashes are non-cryptographic in nature and so require data comparison when used to do deduplication. It works like this:

  1. When a new chunk of data is read in it is first given a name using the hash algorithm.
  2. The system then checks a deduplication index to see if a chunk with that name has been seen before (note that this can consume disk IO and tremendous amounts of memory if done wrong).
  3. If the name has been seen we need to take extra steps. Because fast hashes are non-cryptographic, it is possible to have a name match while the data content differs. This is known in computer science as a hash-collision. To account for this, the existing copy of the chunk must be read in and compared bit-by-bit to the new. If they match, only a reference to the chunk is created. If not, then the new chunk must be written.

Essentially, this form of deduplication means trading a write of a duplicate chunk for a read. Depending on the design of the underlying block virtualization layer, duplicate chunks may be widely dispersed throughout the system. In that case, the bigger the system gets, the more expensive reads get - so processing of duplicate data becomes slower and slower as the storage system fills - this is why you won't find many 100 TB NetApp file systems with deduplication turned on. Certainly not for primary storage applications, the system would be flooded with random read requests and NetApp's deduplication process can end up taking months, years or even never complete.

A number of techniques have been used to reduce the impact of IO in other products. For example, the Hitachi NAS (HNAS) and Hitachi Unified Storage (HUS) solutions from HDS make use of hardware-acceleration to generate cryptographically secure hashes that do not require a data compare at all - this allows for linear scaling of deduplication performance on volumes up to 256 TB in size. Data is also written out before it is deduplicated to avoid introducing any latency through the hash computation process itself.

Permabit's own Albireo Virtual Data Optimizer (VDO) product, a plug-in module for Linux-based storage solutions, takes a different approach but with a similar result. VDO works inline to provide immediate data reduction. When data is written out, the VDO process intelligently lays it out in a sequential pattern, so that subsequent read compares of duplicates are more likely to be sequential as well. Both solutions do a fine job at solving the problem in real world scenarios, they just take different approaches.

The Memory Challenge
Many of today's flash array vendors are providing deduplication using similar fast hashing techniques to what I outlined above. With flash, the cost of doing random reads for read compares is a non-issue (random seeks on flash are much less expensive than for hard drive environments) so the use of the fast hash alone is enough to minimize latency. These systems (such as EMC's recently launched XtremIO product) are focused on delivering performance and the big challenge to performance at scale is available memory (DRAM). As above, after chunks are read in, they are named using a fast hashing algorithm. After that, the flash system must determine whether or not a chunk has been seen before. To get at this information as quickly as possible, flash-based storage systems have tended to use huge amounts of DRAM to cache chunk names in memory. It's not uncommon to see flash storage systems that allocate 16 GB of working cache per TB of storage. To support a 256 TB storage volume, such a system would require a TBs of DRAM. The increased hard costs in terms of more expensive (denser) DIMMS, as well as the increased cost of the server board required to support this many DIMMs combine to make this an extremely costly and unpopular proposition. Combine this with the fact that DRAM prices are not falling at the same rate as flash prices, and you can see why no vendor today makes a 256TB flash storage array with global deduplication capabilities.

The solution to the memory challenge is coming, in the form of a next generation of flash storage products that utilize Albireo indexing and Albireo VDO. Unlike the flash arrays described above, flash-optimized arrays with VDO takes advantage of advanced caching techniques to operate with 128 MB of working cache per TB of storage and deliver excellent performance. With VDO, a 256 TB system can be delivered with as little as 32 GB of RAM while delivering 1M IOPS performance. The net result is a cost effective and easily deployed data efficiency solution for flash arrays.

Conclusion

Deduplication Scalability by Vendor

As you can see in the table above, forward thinking vendors like HDS have done a good job at overcoming limitations in their data efficiency features and have products on the market today that can scale to meet the requirements of the large enterprise. Many other vendors are lagging behind, because of their inability to address IO and/or memory requirements, a serious downfall since data efficiency is at the core of distinguishing storage solutions, a critical end user requirement, and a ‘must have' component for 2014. Permabit's VDO product overcomes both of these limitations through the use of advanced memory-efficient caching techniques.

More Stories By Louis Imershein

As Senior Director of Product Strategy at Permabit Technology Corporation, Louis Imershein is responsible for product evolution and strategic planning for the Albireo family of products. He has 22 years of technical leadership experience in product management, software development and support. Prior to joining Permabit, Imershein was a Senior Product Marketing Manager for the Sun Microsystems Data Management Group. He has a Bachelor's degree in Biological Science from the University of California, Santa Cruz.

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


@CloudExpo Stories
The IoTs will challenge the status quo of how IT and development organizations operate. Or will it? Certainly the fog layer of IoT requires special insights about data ontology, security and transactional integrity. But the developmental challenges are the same: People, Process and Platform. In his session at @ThingsExpo, Craig Sproule, CEO of Metavine, will demonstrate how to move beyond today's coding paradigm and share the must-have mindsets for removing complexity from the development proc...
SYS-CON Events announced today TechTarget has been named “Media Sponsor” of SYS-CON's 18th International Cloud Expo, which will take place on June 7–9, 2016, at the Javits Center in New York City, NY, and the 19th International Cloud Expo, which will take place on November 1–3, 2016, at the Santa Clara Convention Center in Santa Clara, CA. TechTarget is the Web’s leading destination for serious technology buyers researching and making enterprise technology decisions. Its extensive global networ...
SYS-CON Events announced today that Commvault, a global leader in enterprise data protection and information management, has been named “Bronze Sponsor” of SYS-CON's 18th International Cloud Expo, which will take place on June 7–9, 2016, at the Javits Center in New York City, NY, and the 19th International Cloud Expo, which will take place on November 1–3, 2016, at the Santa Clara Convention Center in Santa Clara, CA. Commvault is a leading provider of data protection and information management...
Many banks and financial institutions are experimenting with containers in development environments, but when will they move into production? Containers are seen as the key to achieving the ultimate in information technology flexibility and agility. Containers work on both public and private clouds, and make it easy to build and deploy applications. The challenge for regulated industries is the cost and complexity of container security compliance. VM security compliance is already challenging, ...
The essence of data analysis involves setting up data pipelines that consist of several operations that are chained together – starting from data collection, data quality checks, data integration, data analysis and data visualization (including the setting up of interaction paths in that visualization). In our opinion, the challenges stem from the technology diversity at each stage of the data pipeline as well as the lack of process around the analysis.
Designing IoT applications is complex, but deploying them in a scalable fashion is even more complex. A scalable, API first IaaS cloud is a good start, but in order to understand the various components specific to deploying IoT applications, one needs to understand the architecture of these applications and figure out how to scale these components independently. In his session at @ThingsExpo, Nara Rajagopalan is CEO of Accelerite, will discuss the fundamental architecture of IoT applications, ...
A strange thing is happening along the way to the Internet of Things, namely far too many devices to work with and manage. It has become clear that we'll need much higher efficiency user experiences that can allow us to more easily and scalably work with the thousands of devices that will soon be in each of our lives. Enter the conversational interface revolution, combining bots we can literally talk with, gesture to, and even direct with our thoughts, with embedded artificial intelligence, wh...
SYS-CON Events announced today that Tintri Inc., a leading producer of VM-aware storage (VAS) for virtualization and cloud environments, will exhibit at the 18th International CloudExpo®, which will take place on June 7-9, 2016, at the Javits Center in New York City, New York, and the 19th International Cloud Expo, which will take place on November 1–3, 2016, at the Santa Clara Convention Center in Santa Clara, CA.
SYS-CON Events announced today that MangoApps will exhibit at SYS-CON's 18th International Cloud Expo®, which will take place on June 7-9, 2016, at the Javits Center in New York City, NY. MangoApps provides modern company intranets and team collaboration software, allowing workers to stay connected and productive from anywhere in the world and from any device. For more information, please visit https://www.mangoapps.com/.
Enterprise networks are complex. Moreover, they were designed and deployed to meet a specific set of business requirements at a specific point in time. But, the adoption of cloud services, new business applications and intensifying security policies, among other factors, require IT organizations to continuously deploy configuration changes. Therefore, enterprises are looking for better ways to automate the management of their networks while still leveraging existing capabilities, optimizing perf...
SYS-CON Events announced today that Alert Logic, Inc., the leading provider of Security-as-a-Service solutions for the cloud, will exhibit at SYS-CON's 18th International Cloud Expo®, which will take place on June 7-9, 2016, at the Javits Center in New York City, NY. Alert Logic, Inc., provides Security-as-a-Service for on-premises, cloud, and hybrid infrastructures, delivering deep security insight and continuous protection for customers at a lower cost than traditional security solutions. Ful...
In his session at 18th Cloud Expo, Bruce Swann, Senior Product Marketing Manager at Adobe, will discuss how the Adobe Marketing Cloud can help marketers embrace opportunities for personalized, relevant and real-time customer engagement across offline (direct mail, point of sale, call center) and digital (email, website, SMS, mobile apps, social networks, connected objects). Bruce Swann has more than 15 years of experience working with digital marketing disciplines like web analytics, social med...
SYS-CON Events announced today Object Management Group® has been named “Media Sponsor” of SYS-CON's 18th International Cloud Expo, which will take place on June 7–9, 2016, at the Javits Center in New York City, NY, and the 19th International Cloud Expo, which will take place on November 1–3, 2016, at the Santa Clara Convention Center in Santa Clara, CA.
SYS-CON Events announced today that EastBanc Technologies will exhibit at SYS-CON's 18th International Cloud Expo®, which will take place on June 7-9, 2016, at the Javits Center in New York City, NY. EastBanc Technologies has been working at the frontier of technology since 1999. Today, the firm provides full-lifecycle software development delivering flexible technology solutions that seamlessly integrate with existing systems – whether on premise or cloud. EastBanc Technologies partners with p...
SYS-CON Events announced today that AppNeta, the leader in performance insight for business-critical web applications, will exhibit and present at SYS-CON's @DevOpsSummit at Cloud Expo New York, which will take place on June 7-9, 2016, at the Javits Center in New York City, NY. AppNeta is the only application performance monitoring (APM) company to provide solutions for all applications – applications you develop internally, business-critical SaaS applications you use and the networks that deli...
The cloud market growth today is largely in public clouds. While there is a lot of spend in IT departments in virtualization, these aren’t yet translating into a true “cloud” experience within the enterprise. What is stopping the growth of the “private cloud” market? In his general session at 18th Cloud Expo, Nara Rajagopalan, CEO of Accelerite, will explore the challenges in deploying, managing, and getting adoption for a private cloud within an enterprise. What are the key differences betwee...
SYS-CON Events announced today that ContentMX, the marketing technology and services company with a singular mission to increase engagement and drive more conversations for enterprise, channel and SMB technology marketers, has been named “Sponsor & Exhibitor Lounge Sponsor” of SYS-CON's 18th Cloud Expo, which will take place on June 7-9, 2016, at the Javits Center in New York City, New York. “CloudExpo is a great opportunity to start a conversation with new prospects, but what happens after the...
As machines are increasingly connected to the internet, it’s becoming easier to discover the numerous ways Industrial IoT (IIoT) is helping to shape the business world. This is exactly why we have decided to take a closer look at this pervasive movement and to examine the desire to connect more things! Now if you need a refresher on IIoT and how it is changing the world, take a moment and listen to Greg Gorbach with ARC Advisory Group. Gorbach believes, "IIoT will significantly change the worl...
The IoT is changing the way enterprises conduct business. In his session at @ThingsExpo, Eric Hoffman, Vice President at EastBanc Technologies, discuss how businesses can gain an edge over competitors by empowering consumers to take control through IoT. We'll cite examples such as a Washington, D.C.-based sports club that leveraged IoT and the cloud to develop a comprehensive booking system. He'll also highlight how IoT can revitalize and restore outdated business models, making them profitable...
SYS-CON Events announced today the Docker Meets Kubernetes – Intro into the Kubernetes World, being held June 9, 2016, in conjunction with 18th Cloud Expo | @ThingsExpo, at the Javits Center in New York, NY. Register for 'Docker Meets Kubernetes Workshop' Here! This workshop led by Sebastian Scheele, co-founder of Loodse, introduces participants to Kubernetes (container orchestration). Through a combination of instructor-led presentations, demonstrations, and hands-on labs, participants learn ...