Writing Data

Kasim Ali
4 min readJan 3, 2023

--

In this post, you will learn how to employ a write design pattern, and control partitions for database writes using REPARTITION and COALESCE hints.

Writing Data | Kasim Data

As you already know, partitions control the reads and writes that we can perform. We can alter these partitions using hints. The syntax will be different from what you have seen before.

In Spark, writing to a database involves considering several factors related to its distributed nature. One key concept to understand is partitioning, which involves dividing your data set into smaller pieces that can be distributed across a cluster for more efficient processing. Another important factor is the use of slots, also known as cores, which represent resources available for parallel computation. Optimizing these variables can help improve performance when working with databases in Spark.

As a general rule, the number of partitions should be a multiple of the number of cores. For instance, with 5 partitions and 8 slots, 3 of the slots will be underutilized. With 9 partitions and 8 slots, a job will take twice as long as it waits for the extra partition to finish.

Import the dataset.

CREATE OR REPLACE TEMPORARY VIEW fireCallsCSV
USING CSV
OPTIONS (
path "your path goes here",
header "true",
sep ":"
)

You can also switch the to the Python API to run command. You can do this using %python and then running sql(sql query here).

%python
df = sql("SELECT * FROM fireCallsCSV")

display(df)

This will run the query as an SQL query as expected:

In order to run a write, you will use the .write method on the DataFrame.

You might also consider using OVERWRITE mode to write over any data that has already been created.

There is also an append function that makes sure that you won’t have conflicting locations with someone else using the same workspace.

%python
df.write.mode("OVERWRITE").csv(username + "/fire-calls.csv")

Let’s also take a look at the file we just wrote:

%python
dbutils.fs.ls(username + "/fire-calls.csv")

You will notice that the output shows you your username followed by the .csv file which is actually a directory:

You can check how many partitions your data has by using the following:

%python
df.rdd.getNumPartitions()

Controlling Concurrency.

You have two options to control concurrency:

  • SELECT /** COALESCE(n) */
  • SELECT /** REPARTITION(n) */

COALESCE is a narrow type which reduces the number of data partitions unevenly across data.

REPARTITION is a wide type which increases the number of data partitions evenly across data.

Here is how we can implement this:

CREATE OR REPLACE TEMPORARY VIEW fireCallsCSV1p
AS
SELECT /** COALESCE(1) */ *
FROM fireCallsCSV

Let’s check the number of partitions we now have:

%python
sql("SELECT * FROM fireCallsCSV1p").rdd.getNumPartitions()

You could also increase the number of partitions:

%python
sql("SELECT * FROM fireCallsCSV8p").rdd.getNumPartitions()

You can now save the result:

%python
sql("SELECT * FROM fireCallsCSV8p").write.mode("OVERWRITE").csv(username + "/fire-calls-repartitioned.csv")

Summary

  • You can alter the number of partitions using:SELECT /*+ COALESCE(n) */ or SELECT /*+ REPARTITION(n) */
  • coalesce has a different use in SQL.
  • Files like CSV and Parquet might look like single files, but Spark saves them as a folder of different parts

Do let me know what you think of this post. I am still a learner myself and I would love to hear your thoughts. You are more than welcome to message me on LinkedIn or Twitter.

--

--