“I believe it is incumbent on the Cloud Service Providers (CSPs) and/or System Integrators (SIs) to understand the regulatory and compliance-related issues that their customers face,” noted Manjula Talreja, VP of Global Cloud Business Development at Cisco, in this exclusive Q&A with Cloud Expo Conference Chair Jeremy Geelan. “Of course these issues are different in each industry and in each country.”
Cloud Computing Journal: The move to cloud isn't about saving money, it is about saving time - ...| By Srinivasan Sundara Rajan | Article Rating: |
|
| March 14, 2013 11:15 AM EDT | Reads: |
3,258 |
Data Warehouse as a Service
Recently Amazon announced the availability of Redshift Data warehouse as a Service as a beta offering. Amazon Redshift is a fast, fully managed, petabyte-scale data warehouse service that makes it simple and cost-effective to efficiently analyze all your data using your existing business intelligence tools. It's optimized for datasets ranging from a few hundred gigabytes to a petabyte or more and costs less than $1,000 per terabyte per year, a tenth the cost of most traditional data warehousing solutions.
Architecture Behind Redshift
Any data warehouse service meant to serve data of petabyte scale should have a robust architecture as its backbone. The following are the salient features of Redshift service.
- Shared Nothing Architecture: As indicated in one of my earlier articles, Cloud Database Scale Out Using Shared Nothing Architecture, the shared nothing architectural pattern is the most desired for databases of this scale and the same concept is adhered to in Redshift. The core component of Redshift is a cluster and each cluster consists of multiple compute nodes, each node has its dedicated storage following the shared nothing principle.
- Massively Parallel Processing (MPP): Hand in hand with the shared nothing pattern MPP provides horizontal scale out capabilities for large data warehouses rather than scaling up the individual servers. Massively parallel processing (MPP) enables fast execution of the most complex queries operating on large amounts of data. Multiple compute nodes handle all query processing leading up to the final result aggregation, with each core of each node executing the same compiled query segments on portions of the entire data. With the concept of NodeSlices Redshift has taken the MPP to the next level to the cores of a compute node. A compute node is partitioned into slices; one slice for each core of the node's multi-core processor. Each slice is allocated a portion of the node's memory and disk space, where it processes a portion of the workload assigned to the node.

Refer to the following diagram from AWS Documentation, about Data warehouse system architecture
- Columnar Data Storage: Storing database table information in a columnar fashion reduces the number of disk I/O requests and reduces the amount of data you need to load from disk. Columnar storage for database tables drastically reduces the overall disk I/O requirements and is an important factor in optimizing analytic query performance.
- Leader Node: The leader node manages most communications with client programs and all communication with compute nodes. It parses and develops execution plans to carry out database operations, in particular, the series of steps necessary to obtain results for complex queries. Based on the execution plan, the leader node distributes compiled code to the compute nodes and assigns a portion of the data to each compute node.
- High Speed Network Connect: The clusters are connected internally by a 10 Gigabit Ethernet network, providing very fast communication between the leader node and the compute clusters.
Best Practices in Application Design on Redshift
The enablement of Big Data analytics through Redshift has created lot of excitement among the community. The usage of these kinds of alternate approaches to traditional data warehousing will be best in conjunction with the best practices for utilizing the features. The following are some of the best practices that can be considered for the design of applications on Redshift.
1. Collocated Tables: It is good practice to try to avoid sending data between the nodes to satisfy JOIN queries. Colocation between two joined tables occurs when the matching rows of the two tables are stored in the same compute nodes, so that the data need not be sent between nodes.
When you add data to a table, Amazon Redshift distributes the rows in the table to the cluster slices using one of two methods:
- Even distribution
- Key distribution
Even distribution is the default distribution method. With even distribution, the leader node spreads data rows across the slices in a round-robin fashion, regardless of the values that exist in any particular column. This approach is a good choice when you don't have a clear option for a distribution key.
If you specify a distribution key when you create a table, the leader node distributes the data rows to the slices based on the values in the distribution key column. Matching values from the distribution key column are stored together.
Colocation is best achieved by choosing the appropriate distribution keys than using the even distribution.
If you frequently join a table, specify the join column as the distribution key. If a table joins with multiple other tables, distribute on the foreign key of the largest dimension that the table joins with. If the dimension tables are filtered as part of the joins, compare the size of the data after filtering when you choose the largest dimension. This ensures that the rows involved with your largest joins will generally be distributed to the same physical nodes. Because local joins avoid data movement, they will perform better than network joins.
2. De-Normalization: In the traditional RDBMS, database storage is optimized by applying the normalization principles such that a particular attribute (column) is associated with one and only entity (Table). However in shared nothing scalable databases like Redshift this technique will not yield the desired results, rather keeping the redundancy of certain columns in the form of de-normalization is very important.
For example, the following query is one of the examples of a high performance query in the Redshift documentation.
SELECT * FROM tab1, tab2
WHERE tab1.key = tab2.key
AND tab1.timestamp > ‘1/1/2013'
AND tab2.timestamp > ‘1/1/2013';
Even if a predicate is already being applied on a table in a join query but transitively applies to another table in the query, it's useful to re-specify the redundant predicate if that other table is also sorted on the column in the predicate. That way, when scanning the other table, Redshift can efficiently skip blocks from that table as well.
By carefully applying de-normalization to bring the required redundancy, Amazon Redshift can perform at its best.
3. Native Parallelism: One of the biggest advantages of a shared nothing MPP architecture is about parallelism. Parallelism is achieved in multiple ways.
- Inter Node Parallelism: It refers the ability of the database system to break up a query into multiple parts across multiple instances across the cluster.
- Intra Node Parallelism: Intra node parallelism refers to the ability to break up query into multiple parts within a single compute node.
Typically in MPP architectures, both Inter Node Parallelism and Intra Node Parallelism will be combined and used at the same time to provide dramatic performance gains.
Amazon Redshift provides lot of operations to utilize both Intra Node and Inter Node parallelism.
When you use a COPY command to load data from Amazon S3, first split your data into multiple files instead of loading all the data from a single large file.
The COPY command then loads the data in parallel from multiple files, dividing the workload among the nodes in your cluster. Split your data into files so that the number of files is a multiple of the number of slices in your cluster. That way Amazon Redshift can divide the data evenly among the slices. Name each file with a common prefix. For example, each XL compute node has two slices, and each 8XL compute node has 16 slices. If you have a cluster with two XL nodes, you might split your data into four files named customer_1, customer_2, customer_3, and customer_4. Amazon Redshift does not take file size into account when dividing the workload, so make sure the files are roughly the same size.
Pre-Processing Data: Over the years RDBMS engines take pride of Location Independence. The Codd's 12 rules of the RDBMS states the following:
Rule 8: Physical data independence:
Changes to the physical level (how the data is stored, whether in arrays or linked lists, etc.) must not require a change to an application based on the structure.
However, in the columnar database services like Redshift the physical ordering of data does make major impact to the performance.
Sorting data is a mechanism for optimizing query performance.
When you create a table, you can define one or more of its columns as the sort key. When data is loaded into the table, the values in the sort key column (or columns) are stored on disk in sorted order. Information about sort key columns is passed to the query planner, and the planner uses this information to construct plans that exploit the way that the data is sorted. For example, a merge join, which is often faster than a hash join, is feasible when the data is distributed and presorted on the joining columns.
The VACUUM command also makes sure that new data in tables is fully sorted on disk. Vacuum as often as you need to in order to maintain a consistent query performance.
Summary
Platform as a Service (PaaS) is one of the greatest benefits to the IT community due to the Cloud Delivery Model, and from the beginning of pure play programming models like Windows Azure and Elastic Beanstalk it has moved to high-end services like data warehouse Platform as a Service. As the industry analysts see good adoption of the above service due to the huge cost advantages when compared to the traditional data warehouse platform, the best practices mentioned above will help to achieve the desired level of performance. Detailed documentation is also available on the vendor site in the form of developer and administrator guides.
Published March 14, 2013 Reads 3,258
Copyright © 2013 SYS-CON Media, Inc. — All Rights Reserved.
Syndicated stories and blog feeds, all rights reserved by the author.
More Stories By Srinivasan Sundara Rajan
Srinivasan Sundara Rajan (Also Known As Sundar) Is A Enterprise Technology Enabler for realizing business capabilities. His primary focus is enabling Agile Enterprises by facilitating the adoption of Every Thing As A Service Model with particular concentration on BpaaS (Business Process As A Service). He also helps enterprises in getting meaningful insights from their structured and unstructured and real time data sources. All the views expressed are Srinivasan's independent analysis of industry and solutions and need not necessarily be of his current or past organizations. Srinivasan would like to thank every one who augmented his Architectural skills with Analytical ideas.
“I believe it is incumbent on the Cloud Service Providers (CSPs) and/or System Integrators (SIs) to understand the regulatory and compliance-related issues that their customers face,” noted Manjula Talreja, VP of Global Cloud Business Development at Cisco, in this exclusive Q&A with Cloud Expo Conference Chair Jeremy Geelan. “Of course these issues are different in each industry and in each country.”
Cloud Computing Journal: The move to cloud isn't about saving money, it is about saving time - ...Jun. 17, 2013 07:00 AM EDT Reads: 3,883 |
By Jeremy Geelan “Regulations and compliance are key trust topics with regards to cloud solutions and technology,” noted Sven Denecken, Vice President, Strategy and Co-Innovation Cloud Solutions, SAP AG, in this exclusive Q&A with Cloud Expo Conference Chair Jeremy Geelan. “But it is also more than security of access – it is portability of data and a clear definition of where the data resides.”
Cloud Computing Journal: The move to cloud isn't about saving money, it is about saving time – agree or disagree?
Sve...Jun. 17, 2013 06:30 AM EDT Reads: 1,588 |
By Jeremy Geelan Many organizations want to expand upon the IaaS foundation to deliver cloud services in all forms – software, mobility, infrastructure and IT. Understanding the strategy, planning process and tools for this transformation will help catalyze changes in the way the business operates and deliver real value. Jun. 13, 2013 09:00 AM EDT Reads: 3,074 |
By Elizabeth White Jun. 13, 2013 07:00 AM EDT Reads: 2,240 |
By Jeremy Geelan IT has more opportunities than ever before with the growth in users, devices, data and secure cloud services. This creates not only a more enriching experience for users, but more opportunities for businesses. The key to capitalizing on these opportunities is to have the right tools in place to help scale operations. In his Day 3 Keynote at 12th Cloud Expo | Cloud Expo New York [June 10-13, 2013], Intel's Rob Crooke will describe the range of products that Intel provides to support different usa...Jun. 12, 2013 08:30 AM EDT Reads: 3,038 |
By Elizabeth White Jun. 11, 2013 12:00 PM EDT Reads: 1,880 |
By Elizabeth White One of the cloud’s biggest draws is the capability to virtualize computing resources, allowing it to be consumed with the click of a mouse. But behind that simple click is an enormous infrastructure challenge that has recently been cited as a major cause for slower enterprise adoption. Enterprises can better prepare for this shift and take full advantage of future computing benefits. Between architecture design and migration planning, the road can be long, so what do you do with your talent?
I...Jun. 11, 2013 09:00 AM EDT Reads: 4,122 |
By Pat Romanski In the old world of IT, if you didn't have hardware capacity or the budget to buy more, your project was dead in the water. Budget constraints can leave some of the best, most creative and most ingenious innovations on the cutting room floor. It’s a true dilemma for developers and innovators – why spend the time creating, when a project could be abandoned in a blink? That was the old world. In the new world of IT, developers rule. They have access to resources they can spin up instantly.
A hyb...Jun. 11, 2013 08:00 AM EDT Reads: 4,216 |
By Pat Romanski INetU, the industry's experts in complex hosting and a global provider of business-centric managed cloud and application hosting, has announced that Cloud Architect Rich Hand will be presenting "Private Cloud, Public Cloud - Is There a Third Option?" at the 12th International Cloud Expo taking place June 10-13, 2013 in New York City.
As more enterprise IT departments move into the cloud, many executives are evaluating whether to adopt a Public or Private cloud. The cost benefits of the Public ...Jun. 11, 2013 07:00 AM EDT Reads: 1,850 |
By Liz McMillan “I’m careful when using terms like Big Data, because it can mean so many things to different people,” explained Eric Hanselman, Chief Analyst at 451 Research, in this exclusive Q&A with Cloud Expo Conference Chair Jeremy Geelan. “There is huge value in analytics that companies can use to pull intelligence from a collection of data sources that are available in their businesses. The inexpensive storage that cloud services can offer make a great environment to pull together siloed data.”
Cloud Co...Jun. 10, 2013 01:00 PM EDT Reads: 2,106 |
- Cloud Expo New York: Cloud Is Changing the Economics of Business
- Cloud Expo New York: Rethink IT and Reinvent Business with IBM SmartCloud
- Cloud Expo New York: API Security, Does My Business Need an OAuth Server?
- Session Topics: 12th Cloud Expo / Cloud Expo New York
- Cloud Expo New York: Developing the World’s First IaaS Marketplace
- Cloud Expo NY: Best Practices for Delivering Oracle Database as a Service
- Cloud Expo NY: Best Practices for Architecting Your Cloud Infrastructure
- Cloud Expo New York: Aligning Your Cloud Security with the Business
- Measuring the Business Value of Cloud Computing
- Cloud Computing Is Smart
- Cloud Expo New York: Build Modern Business Applications
- Cloud Expo New York: Using APIs for Better Business Partnerships
- Cloud Expo New York: Cloud Is Changing the Economics of Business
- Enterasys Spotlights SDN's Impact on Traditional Networking in Upcoming Webinar
- Cloud Expo New York: Deploying Hybrid Cloud for Performance and Uptime
- Cloud Expo New York: Delivering Digital Marketing on the Cloud
- Cloud Expo New York: Rethink IT and Reinvent Business with IBM SmartCloud
- Cloud Expo New York: API Security, Does My Business Need an OAuth Server?
- Cloudant to Exhibit at Cloud Expo & Big Data Expo New York
- Cloud Expo New York: Basics of SSD Technology and Its Use in Cloud
- Session Topics: 12th Cloud Expo / Cloud Expo New York
- Cloud Expo New York: Developing the World’s First IaaS Marketplace
- The Accessibility of the Cloud
- Cloud Expo NY: Best Practices for Delivering Oracle Database as a Service
- Cloud Expo New York: Best CIO Practices Shared from SHI’s Customers
- Cloud Expo New York: Cloud Is Changing the Economics of Business
- Cloud Expo New York: How to Use Google Apps Script
- Enterasys Spotlights SDN's Impact on Traditional Networking in Upcoming Webinar
- ScaleOut Software to Exhibit at Cloud Expo New York
- Web Host Industry Review “Media Sponsor” of Cloud Expo NY & Silicon Valley
- Speed-up and Simplify Backup and Restores
- Rackspace Hosting Named “Platinum Plus Sponsor” of Cloud Expo New York
- Software Defined Networking – A Paradigm Shift
- MokaFive Gets New CEO
- Code 42 Software to Exhibit at Cloud Expo New York
- Cloud Expo New York: Why Big Data Is Really About Small Data








“Regulations and compliance are key trust topics with regards to cloud solutions and technology,” noted Sven Denecken, Vice President, Strategy and Co-Innovation Cloud Solutions, SAP AG, in this exclusive Q&A with Cloud Expo Conference Chair Jeremy Geelan. “But it is also more than security of access – it is portability of data and a clear definition of where the data resides.”
Cloud Computing Journal: The move to cloud isn't about saving money, it is about saving time – agree or disagree?
Sve...
Many organizations want to expand upon the IaaS foundation to deliver cloud services in all forms – software, mobility, infrastructure and IT. Understanding the strategy, planning process and tools for this transformation will help catalyze changes in the way the business operates and deliver real value.
IT has more opportunities than ever before with the growth in users, devices, data and secure cloud services. This creates not only a more enriching experience for users, but more opportunities for businesses. The key to capitalizing on these opportunities is to have the right tools in place to help scale operations. In his Day 3 Keynote at 12th Cloud Expo | Cloud Expo New York [June 10-13, 2013], Intel's Rob Crooke will describe the range of products that Intel provides to support different usa...
One of the cloud’s biggest draws is the capability to virtualize computing resources, allowing it to be consumed with the click of a mouse. But behind that simple click is an enormous infrastructure challenge that has recently been cited as a major cause for slower enterprise adoption. Enterprises can better prepare for this shift and take full advantage of future computing benefits. Between architecture design and migration planning, the road can be long, so what do you do with your talent?
I...
In the old world of IT, if you didn't have hardware capacity or the budget to buy more, your project was dead in the water. Budget constraints can leave some of the best, most creative and most ingenious innovations on the cutting room floor. It’s a true dilemma for developers and innovators – why spend the time creating, when a project could be abandoned in a blink? That was the old world. In the new world of IT, developers rule. They have access to resources they can spin up instantly.
A hyb...
INetU, the industry's experts in complex hosting and a global provider of business-centric managed cloud and application hosting, has announced that Cloud Architect Rich Hand will be presenting "Private Cloud, Public Cloud - Is There a Third Option?" at the 12th International Cloud Expo taking place June 10-13, 2013 in New York City.
As more enterprise IT departments move into the cloud, many executives are evaluating whether to adopt a Public or Private cloud. The cost benefits of the Public ...
“I’m careful when using terms like Big Data, because it can mean so many things to different people,” explained Eric Hanselman, Chief Analyst at 451 Research, in this exclusive Q&A with Cloud Expo Conference Chair Jeremy Geelan. “There is huge value in analytics that companies can use to pull intelligence from a collection of data sources that are available in their businesses. The inexpensive storage that cloud services can offer make a great environment to pull together siloed data.”
Cloud Co...
For more than half a century, cloud computing has changed names more often than a Hollywood starlet.
Utility computing. Time share. Thin client. SaaS. PaaS. IaaS. While concepts have been added and capabilities grown, cloud computing was no more invented by Amazon or other modern vendors in the las...
OpenStack is easily installed using a package called Packstack. Redhat is one of the primary contributors to packstack and my install experience is similar to the installation of RDO, described here
The procedure is quite simple:
Install Redhat, Fedora or Centos on one or more x86 servers.
I inst...
Virtual Desktop Infrastructure (VDI) solutions allow IT organizations to deploy and manage virtual user desktops in the data center, eliminating the tedious management of numerous physical desktops. At the same time, virtual desktops allow end users to maintain their own personal desktops with acces...
As with everything else, the best way to get a view of a new technology area is by asking for independent opinions. The old adage of the 6 blind men and the elephant comes to mind. Coincidentally, there were six "blind men" on the panel, including our very engaging host, Mr. Geelan. And there were v...
Cloud Expo 2013 New York is all about the technlogies that enable cloud computing. The multiple tracks,, boot camp, keynotes and general sessions all focus on how to enable cloud computing through hosting, storage, data, APIs and services and application - grouped under IaaS, PaaS, and SaaS models. ...
Legacy apps are surely the albatross of the modern cloud-enabled IT department – you put them there, and now you have to live with them.
Short of scrapping millions of dollars of worth of investments, something needs to be done with these apps, especially when cloud adoption is altering the effic...
Cloud is typically approached as a combination of virtualized or bare metal infrastructure. At this stage of the game, to think about cloud is to think at some level about virtualization. Businesses are also looking at some form of orchestration. And the underlying hardware – CPU, RAM, and disk – is...
Recently, there have been an increasing number of cloud-based static code quality analysis tools, or should I say services. A few that I’ve been watching include:
Code Climate consolidates the results from a suite of Ruby static analysis tools into a real-time report, giving teams the information t...
Over 81 percent of organizations have suffered at least one IT data breach over the past two years, whilst the Federation of Small Business (FSB) estimate that online criminal activity is currently costing SMEs a combined £785m every year.
The revolutionary concept of cloud hosting, by which access...
Sequestration burst out of obscurity and entered our household vocabulary in 2013. It got our attention because the impact of it is $1.2 trillion in automatic spending cuts from the Federal budget over the next ten years. About $85B of these cuts will occur by September of 2013 - and these cuts ar...













