In the post, we will talk about how we can use shuffle partitions to speed up Spark SQL queries.
You will be learning about two types of transformations; narrow and wide.
Narrow Transformations.
Data required to compute the records in a single partition reside in at most, one partition of the parent DataFrame.
Examples:
- SELECT
- DROP
- WHERE
Wide Transformations.
Data required to compute the records in a single partition may reside in many partitions of the parent DataFrame.
Examples:
- DISTINCT
- GROUP BY
- ORDER BY
In Databricks, there is a feature that is called AQE (Adaptive Query Execution) which is enabled by default and does some level of tuning to our data.
You can disable this feature like this:
SET spark.sql.adaptive.enabled = FALSE
When configuring a cluster you will also see this:
Typing in ‘spark.sql.shuffle.partitions 8’ will set the number of partitions to 8.
This code will allow you to determine the number of partitions your Spark SQL will use.
You can change the number of partitions in the notebook too, but that means that you will have to configure this each time you have a new notebook. Setting it within the notebook will also mean users collaborating on your notebook will not have the same configuration.
However, setting it up when you configure your cluster means that any consequent notebook attached to that cluster will have these settings as default.
Following these steps, you will be able to increase the speed of your Spark SQL queries.