Spark Wrap Up

Introduction

This week you will start working on your final project. At the same time we will look at the SparkSQL interface (SQL just pops up everywhere huh?). This is a library that lets you use Spark in a similar fashion to SQL. It will automatically convert SQL like queries to the appropriate maps, reduces, collects and so forth in Spark. For tasks that SQL is well suited for, this is a great option. For tasks that are too complex for SQL, you might need to fall back on writing your own functions for maps and folds.

Key Questions

  • What is a Spark DataFrame?
  • In what situations would we want to use a DataFrame?
  • What are the limitations of a DataFrame?
  • What are the drawbacks to using DataFrames?

Assignment Overview

None. I mean its Spark, with SQL. You already did the Spark stuff and you did SQL awhile ago. Go work on your project instead. :)

Explore the Topics

Spark Dataframes
This exploration looks at dataframes in Spark. These are sets of columnar data that can be treated much the same as tables in SQL.
Spark Rows, Views and Schemas
This exploration takes a quick look at rows and views. These are important but not terribly complex topics in the land of Spark. We also take a quick look at schemas which are more complex but can be automatically generated.

Review

This sums up our coverage of Spark. This week was lighter in order to give you time to get started on your final project. If you want to use DataFrames in the project you are welcome to but you also don’t need to. They are just one more tool in the tool belt. They make some tasks a lot easier, those tasks being ones like you would do in SQL. They don’t really help for tasks when you need to do a complex set of operations or manipulate the number or kinds of attributes of your data.