This tutorial will be about playing around with spark using a jupyter notebook. The Jupyter notebook allows you to create documents that contain live code, equations, visualizations and narrative text.
We will use this notebook to become familiar with spark and demonstrate some big data concepts.
To keep costs low it will be installed on your local machine in a docker container. Typically spark is run in the cloud in a clustered environment like aws emr which can be very expensive.
You can do a lot of big data operations in spark that would be otherwise impossible with a real database. Spark can parallelize the data which allows it to be spread over a cluster of machines. Once parallelized you can perform actions on it that return a value or transform it with map, filter, join and more.
Create a directory and create a csv file
First create a directory on your machine using
mkdir ~/Developer/pyspark-example and navigate to it using
cd ~/Developer/pyspark-example .
Next create some data using google sheets. Navigate to https://www.google.com/sheets/about/ and choose Personal. Select a blank worksheet and enter some data like below and name the file
Download the file as a csv to the directory you just created. We will use this file later in this tutorial.
Install docker on your mac
You need docker desktop for this tutorial. You can install it using the following instructions. https://docs.docker.com/docker-for-mac/install/
Is it running?
docker version to see if docker is running properly. You should see both the client and a server listed with no errors.
Then run the following container to run spark. It will point the the current folder we just set up.
docker run -it --rm -p 8888:8888 -v $(pwd):/home/jovyan/work jupyter/pyspark-notebook
Once the container starts a url will be listed in the console that we can use to navigate to our jupyter notebook. In my case I navigated to
Your notebook will look like the following with your local directory mapped to the work folder.
Open the work folder and find the csv file we just created. The data in the folder will persist on your local computer as long as you continue to launch the container with
docker run -it --rm -p 8888:8888 -v $(pwd):/home/jovyan/work jupyter/pyspark-notebook How cool is that!
We don’t have a notebook in our folder yet so select
Python3 under the new drop down. The following will be displayed.
A notebook consists of cells where you can enter code, markdown or a heading. Enter the following code in the cell and press run.
mytext = 'Hello World' print(mytext)
If you get something like below you are all set!
Lets try some SQL
Now lets play around with the csv file we created earlier. In the next cell in the notebook type in the following and the contents of the csv file will be outputted as shown
import pandas as pd pd.read_csv('test1.csv')
Next we will use pyspark.sql and create a data frame with a header specified. Entering
df you will show the data frame specification and
df.show() will show the data in the table.
from pyspark.sql import SparkSession spark = SparkSession.builder.appName("Python Spark SQL basic example").getOrCreate() df=spark.read.option('header','true').csv('test1.csv') df.show()
RDDs and huge datasets
Big data processing power comes when you use a RDD, a Resilient Distributed Dataset. RDDs are immutable, and fault tolerant. But best part is they run on clusters so your data can be huge!
There are two types of operations that can be perfomed on a RDD.
- Action, applied to a RDD and returns a value, examples count, and reduce.
- Transformation, returns a new RDD, examples filter, and map.
Let’s create one as follows
from pyspark import SparkContext sc = SparkContext.getOrCreate() words = sc.parallelize ( ["scala", "java", "hadoop", "spark", "akka", "spark vs hadoop", "pyspark", "pyspark and spark"] )
and perform the
count() action on it. The results will be 8.
counts = words.count() print(counts)
Next let’s perform a
collect() transformation on the words. Collect returns all the elements of the RDD. This is commonly done after a filter, or map transformation so you can print out the result.
coll = words.collect() print(coll)
Next let’s perform a
filter() transformation as follows. Filter takes a lambda function. If you want to learn more about lamba functions in python check out https://www.w3schools.com/python/python_lambda.asp
words_filter = words.filter(lambda x: 'spark' in x) filtered = words_filter.collect() print(filtered)
Next let’s perform a map() transformation which applies a function to every element.
words_map = words.map(lambda x: (x, 1)) mapping = words_map.collect() print(mapping)
and finally the
reduce action will apply a function to every element and reduces it to a single value. So in the case of
add operator it adds all the values together.
from operator import add nums = sc.parallelize([1, 2, 3, 4, 5]) adding = nums.reduce(add) print(adding)
A big data example with 1 trillion numbers :).
big_list = range(1000000000) rdd = sc.parallelize(big_list, 2) odds = rdd.filter(lambda x: x % 2 != 0) evens = rdd.filter(lambda x: x % 2 == 0) my_odds = odds.take(5) my_evens = evens.take(5) print("Odds") for p in my_odds: print(p) print("Even") for p in my_evens: print(p)
The power continues with dataframes and datasets
RDD was the primary user-facing API in Spark since its inception but there are now dataframes and datasets as well. A discussion of these will continue in future blogs.
There are other notebooks. Zeppelin is another you might want to check out. My next blog repeats what was done here in a Zeppellin notebook. Check that out here.
These are very simple examples but imagine if you had terrabytes of data. These operations would be impossible to do in a typical database. Have fun with Spark.