Shuffle Partitions

Kasim Ali

2 min readDec 27, 2022

In the post, we will talk about how we can use shuffle partitions to speed up Spark SQL queries.

You will be learning about two types of transformations; narrow and wide.

Narrow Transformations.

Data required to compute the records in a single partition reside in at most, one partition of the parent DataFrame.

Examples:

SELECT
DROP
WHERE

Wide Transformations.

Data required to compute the records in a single partition may reside in many partitions of the parent DataFrame.

Examples:

DISTINCT
GROUP BY
ORDER BY

Wide transformations are also computationally expensive.

In Databricks, there is a feature that is called AQE (Adaptive Query Execution) which is enabled by default and does some level of tuning to our data.

You can disable this feature like this:

SET spark.sql.adaptive.enabled = FALSE

When configuring a cluster you will also see this:

Typing in ‘spark.sql.shuffle.partitions 8’ will set the number of partitions to 8.

This code will allow you to determine the number of partitions your Spark SQL will use.

You can change the number of partitions in the notebook too, but that means that you will have to configure this each time you have a new notebook. Setting it within the notebook will also mean users collaborating on your notebook will not have the same configuration.

However, setting it up when you configure your cluster means that any consequent notebook attached to that cluster will have these settings as default.

Following these steps, you will be able to increase the speed of your Spark SQL queries.

Do let me know what you think of this post. I am still a learner myself and I would love to hear your thoughts. You are more than welcome to message me on LinkedIn or Twitter.

Shuffle Partitions

Narrow Transformations.

Wide Transformations.

Written by Kasim Ali

No responses yet