add a little spark to your toolbelt

By | July 27, 2021

This tutorial will be about playing around with spark using a jupyter notebook. The Jupyter notebook allows you to create documents that contain live code, equations, visualizations and narrative text. 

We will use this notebook to become familiar with spark and demonstrate some big data concepts.

To keep costs low it will be installed on your local machine in a docker container. Typically spark is run in the cloud in a clustered environment like aws emr which can be very expensive.

Why spark?

You can do a lot of big data operations in spark that would be otherwise impossible with a real database. Spark can parallelize the data which allows it to be spread over a cluster of machines. Once parallelized you can perform actions on it that return a value or transform it with map, filter, join and more.

Create a directory and create a csv file

First create a directory on your machine using mkdir ~/Developer/pyspark-example and navigate to it using cd ~/Developer/pyspark-example .

Next create some data using google sheets. Navigate to and choose Personal. Select a blank worksheet and enter some data like below and name the file test1.csv.

Download the file as a csv to the directory you just created. We will use this file later in this tutorial.

Install docker on your mac

You need docker desktop for this tutorial. You can install it using the following instructions.

Is it running?

Type in docker version to see if docker is running properly. You should see both the client and a server listed with no errors.

Then run the following container to run spark. It will point the the current folder we just set up.

docker run -it --rm -p 8888:8888 -v $(pwd):/home/jovyan/work jupyter/pyspark-notebook

Once the container starts a url will be listed in the console that we can use to navigate to our jupyter notebook. In my case I navigated to

Your notebook will look like the following with your local directory mapped to the work folder.

Open the work folder and find the csv file we just created. The data in the folder will persist on your local computer as long as you continue to launch the container with docker run -it --rm -p 8888:8888 -v $(pwd):/home/jovyan/work jupyter/pyspark-notebook How cool is that!

We don’t have a notebook in our folder yet so select Python3 under the new drop down. The following will be displayed.

A notebook consists of cells where you can enter code, markdown or a heading. Enter the following code in the cell and press run.

mytext = 'Hello World'

If you get something like below you are all set!

Lets try some SQL

Now lets play around with the csv file we created earlier. In the next cell in the notebook type in the following and the contents of the csv file will be outputted as shown

import pandas as pd

Next we will use pyspark.sql and create a data frame with a header specified. Entering df you will show the data frame specification and will show the data in the table.

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Python Spark SQL basic example").getOrCreate()'header','true').csv('test1.csv')

RDDs and huge datasets

Big data processing power comes when you use a RDD, a Resilient Distributed Dataset. RDDs are immutable, and fault tolerant. But best part is they run on clusters so your data can be huge!

There are two types of operations that can be perfomed on a RDD.

  • Action, applied to a RDD and returns a value, examples count, and reduce.
  • Transformation, returns a new RDD, examples filter, and map.

Let’s create one as follows

from pyspark import SparkContext
sc = SparkContext.getOrCreate()
words = sc.parallelize (
   "spark vs hadoop", 
   "pyspark and spark"]

and perform the count() action on it. The results will be 8.

counts = words.count()

Next let’s perform a collect() transformation on the words. Collect returns all the elements of the RDD. This is commonly done after a filter, or map transformation so you can print out the result.

coll = words.collect()

Next let’s perform a filter() transformation as follows. Filter takes a lambda function. If you want to learn more about lamba functions in python check out

words_filter = words.filter(lambda x: 'spark' in x)
filtered = words_filter.collect()

Next let’s perform a map() transformation which applies a function to every element.

words_map = x: (x, 1))
mapping = words_map.collect()

and finally the reduce action will apply a function to every element and reduces it to a single value. So in the case of add operator it adds all the values together.

from operator import add
nums = sc.parallelize([1, 2, 3, 4, 5])
adding = nums.reduce(add)

A big data example with 1 trillion numbers :).

big_list = range(1000000000)
rdd = sc.parallelize(big_list, 2)
odds = rdd.filter(lambda x: x % 2 != 0)
evens = rdd.filter(lambda x: x % 2 == 0)
my_odds = odds.take(5)
my_evens = evens.take(5)
for p in my_odds:
for p in my_evens:

The power continues with dataframes and datasets

RDD was the primary user-facing API in Spark since its inception but there are now dataframes and datasets as well. A discussion of these will continue in future blogs.


There are other notebooks. Zeppelin is another you might want to check out. My next blog repeats what was done here in a Zeppellin notebook. Check that out here.


These are very simple examples but imagine if you had terrabytes of data. These operations would be impossible to do in a typical database. Have fun with Spark.