Spark Terminology

Kasim Ali
2 min readDec 26, 2022

--

In this post, we will discuss some important Spark Terminology that you will need to know so that you can better understand Spark Architecture.

Spark Terminology | Kasim Data

Spark Architecture.

As you already know a spark cluster is made of one driver and one or more executors. However, one way to keep costs down is to use spark local mode. This means the driver and executor are on the same physical machine. In this case, when the driver tells the executor what to do and then does the work on the same machine. This is the architecture that ‘Databricks Community Edition’ runs on.

Spark operates on a cluster of machines. Each of these machines can have multiple cores with each core having multiple threads. The smallest unit of parallelism is called a slot.

Based on what task is running, spark might not decide to use all of the slots available.

What is a partition?

A partition is a portion of a large distributed data set.

The number of partitions is determined by:

  • Size of data.
  • Underlying partitioning of data.
  • Cluster configurations.

Pokemon Analogy.

Imagine you are trying to buy 100 pokeballs. Which of the following do you choose?

  • Pick up all 100 in one trip?
  • Make 100 trips picking up 1 per trip?
  • Make 10 trips picking up 10 each time?

This is how we balance computation and communication and why every slot might not be used.

Do let me know what you think of this post. I am still a learner myself and I would love to hear your thoughts. You are more than welcome to message me on LinkedIn or Twitter.

--

--

No responses yet