Spark DataFrames

Kasim Ali
2 min readDec 1, 2022

--

In this post, we aim to explore what a Spark DataFrame is. After reading this you will be able to explain the differences between RDD and DataFrames in Spark.

You can find the video that these notes are based on here.

When Spark was first introduced, it only had the RDD API. A few years later, the DataFrame API was introduced which provided additional optimisations and features. Again, a few years later, the Dataset API was introduced which works with Java and Scala only. The DataFrame is actually the most popular API and the one that Data Scientists like myself, use the most often. However, let us go over some of the properties of RDD before we delve deeper into the DataFrame API.

Properties of RDD.

Resilient:

  • RDD is resilient, which means it is fault-tolerant.
  • DAG: Directed Acyclic Graph.

If we think back to our M&M counting example, let's say you invited 10 friends. If 1 friend decides to go on lunch we do not want to make the other 9 friends recompute their counts to take into consideration our 10th friend’s uncounted pile. Instead, we only recompute the 10th friend count who left for lunch.

Wow! This looks interesting.

Distributed:

  • Computed across multiple nodes.

Dataset:

  • Collection of partitioned data.

DataFrame:

  • DataFrames inherit all RDD properties plus metadata.
  • Highly optimised.
  • Always use DataFrames where possible.
  • Spark SQL commands execute against DataFrames.

Let's quickly go over some things that DataFrames are not.

Spark, is not a database:

  • It is a compute engine that can read from databases. The data is ephemeral, meaning lasting a short time.
  • DataFrames are not SQL tables, excel files, etc.
You can see the speed for RDD processes are much slower.

If you look at the image above, you can see the DataFrame far outperform RDDs.

Spark DataFrame Execution is done by something called Catalyst which can also tell you how to execute your query faster. This is because the DataFrame API specifies what you want to be done, and not how you want it to be done so Catalyst can optimise the ‘how’ for you. This is what we call ‘Declarative’.

Do let me know what you think of this post. I am still a learner myself and I would love to hear your thoughts. You are more than welcome to message me on LinkedIn or Twitter.

--

--

No responses yet