Spark Specifics

Introduction

Last week we learned a bit about maps, reduces and used Spark to solve a problem or two. But we kind of left it as a black box. Run some commands, stuff gets changed around a bit, run the right commands and you get an answer that seems reasonable faster than you could have without Spark. This week we are not going to go into too much more detail in terms of the problem solving. Maps and reduces are really and permutations thereof are where it is at. Instead we are going to look at reading content and working with partitions, the way data is grouped and divided in Spark.

Key Questions

  • What is the difference between data on disk and data loaded in to Spark?
  • What is the significance of partitions?
  • How many partitions do we typically want?
  • How can we see if are partitions are set up well? What does that even mean?

Assignment Overview

This week you can keep working on the assignment from last week if you were struggling with it. In addition there may be a few partition exercises to work on. Also start thinking about a final project.

Explore the Topics

Web Interfaces
This is a quick overview of connecting to the web interface of Spark which will be helpful in looking at it later.
Spark Partitions
Partitioning data is a generally very important concept. In Spark it is especially so. This exploration will go over partitions in Spark, what they are and how we can use them.
Spark IO
Last module we did some hand wavey work loading some data from BigQuery. It was pretty slick, we got data out of it without much effort, but that also required us to first load data into BigQuery. Lets look at other ways to get data into Spark.

Additional Resources

Setting Up Spark UI
Here is the documentation for the more secure SSH tunnel method of connecting to your instances. If you had sensitive data you might want to take this approach as it is more secure. For our purposes it really isn’t needed, but it is a valuable tool to have in your belt.
Spark Documentation
The Spark documentation, again, it is pretty good in terms of documentation.

Review

This module should have gotten you up to speed on the inner workings of Spark. As we move forward we will see there are some abstractions that make Spark easier to work with for day to day tasks, but in the end they are all limited by the topics we have covered in this module.