Non-Relational Databases and Spark

Introduction

This week goes over the concept of non-relational databases simply to make you aware of their existence. You need not worry too much about the actual use of them, you just need to be aware of the potential issues you can run into when working with them. We also get into the technical details of getting started with Apache Spark. This will be a multi-week process and this week is designed to get you up and running. It is going to be a hard one, but the assignment is going to be iterative so you can get some feedback after your first attempt.

Key Questions

  • What is the difference between a relational and non-relational database?
  • What happens when a database isn’t consistent?
  • What happens when a database uses eventual consistency?
  • What are the map and reduce functions in Spark?
  • How can you work on groups in Spark? Why should you try to avoid it?
  • How do you run your Spark jobs on Google Dataproc?

Assignment Overview

You will use Spark to do some analysis on a large data set. You will need to get as far as you can and then document where you got to, and what problems you ran into. Then next week you can pick up after getting some additional help and feedback.

Explore the Topics

CAP Examples
This exploration looks at a few different database systems in terms of the CAP theorem.
Brewer's CAP Theorem
This goes over the limitations of database engines at a fairly high level. There is a lot more nuance these days with database engines, but these tradeoffs still exist.
Hello Spark
This section goes over the basics of using Apache Spark on Google Dataproc. It is going to be the most technical thing you have had to do so far.

Additional Resources

Spark Docs
This is the official home of Spark documentation. You will probably want to look over things here. In particular pay attention to the API docs for Python.
The BigQuery Spark Tutorial
This is what I modeled the BigQuery connection to Spark on. The configuration bits should simply be copied and the values changed where appropriate. The rest of the documentation in this area may be of some use.

Review

Hopefully you have made some progress. For many I expect is the hardest week in the term because there isn’t really a way to divide this up into pieces. Either you can do some stuff in Spark and make it run on Dataproc, or it just fails and you don’t get anything and it takes awhile to iterate. So have patients and remember, the message boards are you friend. Please head there before you pull you hair out trying to figure out how to make it run.