Amazon Redshift is a fast, fully managed, petabyte-scale data warehouse that makes it simple and cost-effective to analyze an organization’s data using existing business intelligence tools.

In our first and second blog posts in this series, we pointed out that our experience assisting clients with slow Redshift queries points to three main contributing factors: ineffective zone maps, excessive network traffic and too many query steps. Addressing these three issues can overcome the majority of significantly slow Redshift queries.

To learn more about the first two of these factors, check out our prior posts Improving Redshift Query Performance Through Effective Use of Zone Maps and Improving Redshift Query Performance By Decreasing Network Traffic.

This blog post will focus on the third and last issue, too many query steps, negatively impacting Redshift query times.

How Too Many Query Steps Impacts Redshift Performance

Let’s face it, real world queries have little resemblance to the examples that populate docs and blogs on Redshift optimization. This is because real world SQL queries tend to be complex! This complexity often makes it difficult to convey information on the topic being discussed, so most documentation focuses on simple, isolated query clauses. But this leaves a gap as practitioners scale up the complexity of their queries and that brings us to the third top reason that many queries run much slower than they need to – too many query steps. A clustered columnar database like Redshift performs best when multiple parallel actions are occurring on the data at once. When data scale is large enabling such parallelism is critical, but it is possible to express your SQL in ways that defeat or minimize it.

Several scenarios may lead practitioners or upstream tools to generate SQL that is not optimized in this way. Usage of Common Table Expressions (CTE), auto-generation of SQL by ETL, Business Intelligence or Data Science tools and packages, and the need to have readable, supportable and maintainable code can all lead to queries that Redshift must decompose into a large number of query steps. While there are usually valid drivers for such usage, it is important to understand the significant performance impact that can result from having too many query steps.

Let’s look at how these structures can lead to poor query performance.

A single query can end up appearing as dozens to Redshift
A single query can end up appearing as dozens to Redshift

Each SELECT in a query creates a set of temporary results – really, an intermediate dataset – that are passed on to the next step of the query. While Redshift can optimize some of this away through its query planner, each SELECT should be seen as creating an intermediate result that can be made available to the next step. Therefore, each SELECT has a cost roughly proportional to the amount of data created. In our work with customers, it is not uncommon to observe queries with over 50 SELECTs, some of which are nested 20 deep. When such a query is executed, an enormous amount of intermediate result data is produced. If the extraneous SELECTs occur on large data sets, then the impact to query performance can be severe.

Text view of how one large query can appear to Redshift
Text view of how one large query can appear to Redshift

A common driver for such lengthy, serialized queries is the need for readability and maintainability of the SQL code and the need to be able to easily check the validity and expected behavior of intermediate data. These needs are quite real and should be given credence when designing and implementing new queries. However, it’s possible to take this to the extreme and doing so can result in very slow query speeds that don’t appear until data sizes grow in production. Consider this example:

WITH tab1 AS ( SELECT cola, colb, colc from big_table),
tab2 AS ( SELECT * from tab1 WHERE cola > 1000),
tab3 AS ( SELECT *, SUM(cola) AS suma, MAX(colb) AS maxb from tab2),
tab4 AS ( SELECT *, cola / suma AS percenta, colb / maxb AS weightb from tab3),

This is very readable. Each step has a defined purpose and new queries can be built flexibly by leveraging steps from this query. However, if big_table is large, this query will take a long time to run because: (1) a large quantity of distinct intermediate data must be generated between each step; (2) limited or no block rejection (see our first post in this series) occurs on these intermediate results; and (3) computations in distinct serialized steps will be serialized rather than occurring in parallel. In fact, all the steps above could be performed easily in one SELECT and the scan of the table could then achieve good block rejection as it processes the WHERE clause.

Symptoms of Excessive Query Steps

An easily recognized symptom that excessive query steps may be occurring is the appearance of “volt_tt” tables as Redshift executes your query. When Redshift needs to manage a very large number of steps on significant data segments, it will decompose the query into a number of sub-queries, each with its own resulting data table. Each of these will look like “CREATE TEMP TABLE volt_tt_53411c7e9018f …” in the console. Redshift needs to perform this decomposition to handle the largest of queries on the largest of data but we see it most often when the intermediate data size is artificially inflated by poor query construction.

Optimization Tips

Queries that create such volt_tt intermediates should always be reviewed. These are either linchpin queries that are central to your solution or, more likely, queries that need to be optimized to remove excessive SELECTS. Elimination of extraneous steps is fairly straightforward and entails flattening SELECTs. This is not to say that query should or can be flattened to a single SELECT nor that achieving the minimum number of SELECTs is the goal; rather that reducing the number of extraneous SELECTs will likely be needed to optimize query performance. One rule of thumb is to view each new GROUP BY set of columns as a new query. SELECTs acting on the same level of data aggregation can likely be combined. If after removing extraneous SELECTs, Redshift is still producing volt_tt intermediates, additional optimization is likely warranted. If you and your team find that you need help with removing extraneous query steps to speed up your queries, 47Lining’s team of experts offer Redshift “Blackbelt” Optimization services. Contact us today to discuss your needs.

Next up, the next tier…

— The Redshift Whisperer