Why Distributed Computing?

3 min readDec 1, 2022

In this medium post, we explore why you would want to use distributed computing.

What is big data?

Businesses are seeing the value of data and the data is growing significantly each day. There will be an estimated 175 zettabytes of data by 2025.

The 5 V’s of big data:

Volume
Velocity
Variety
Veracity
Value

Big data can also describe data that is too large to fit on any single machine.

One of the benefits of ‘Apache Spark’ is that it accepts a range of languages to run queries. Data analysts can use SQL, data scientists can use R or Python and software engineers can use Scale or Java. In short, ‘Spark’ is an open-source tool for manipulating big data.

M&M Counting Example.

If you were given a small portion of M&Ms to count, it would be very easy to count them one by one.

However, the difficulty arises when you are given thousand or millions of M&Ms. How would you try to count all of these?

You could invite some of your friends and have them count a smaller portion just as you did when you were given a small portion. This is essentially how ‘Apache Spark’ works with big data.

In this example, you are the ‘Driver’ who is delegating work to your friends who we will call ‘Executors’. You as the driver, do not do any work and simply act as a manager in this process. Your friends as the executors will do the actual work.

A diagram of the relationship between the ‘Driver’ and ‘Executor’ or ‘Worker’.

Amdahl’s Law: Linear Scalability.

In computer science, the idea of distributing processes across different resources is called parallelism. Amdahl’s law can govern this idea of parallelism. Amdahl’s law states that the speed in parallelising tasks is a function of how much of the task can be computed in parallel.

On the X axis, you see the number of processors. On the Y axis, you see the Speedup. You can see that if we have a task that is 95% parallelisable we would see improvements after throwing 2000 or so processors at it. However, in contrast, if we can only parallelise 50% of our processes we see diminishing returns after using 2 or 4 processors. Spark works so well because it can provide linear scalability meaning you will observe a unit increase in performance for each machine we have.

When to use Spark?

We use Spark when we need to scale out a model because its data is too large to process on a single machine.

Or when we need to speed up the process because we can benefit from a faster result.

Do let me know what you think of this post. I am still a learner myself and I would love to hear your thoughts. You are more than welcome to message me on LinkedIn or Twitter.