Odds and Ends at the End

Introduction

You made it, this is the end! For many of you around 6 months ago simply opening a file and reading the first line might have been a challenge. Now you have probably simultaneously opened several thousand files on a few different computers potentially spread across the United States and have done some calculations on 10s of GB of data. At the end of the last class the largest data you worked on was probably around 500KB. This is around the same difference as the mass of a paperclip compared to the mass of a person. So that is a pretty big difference!

In this module we are going to look at a few odds and ends when it comes to big data. It is nothing that you will need to complete the remainder of your work for this class but you might find it interesting or useful in the future.

Key Questions

  • How can R be used with Spark?
  • What are some current developments in the area of distributed computing and big data?

Assignment Overview

No additional assignment this week. Keep on working on your final project.

Explore the Topics

SparkR
SparkR is the R implementation of Spark. This module goes into the very basics of this implementation.
Local Parallelization
We have extensively covered distributed computing. This section will look at what you can do locally to work on bigger data sets.

Additional Resources

SparkR
This is the documentation for SparkR. There are other options for distributed R programming, but they are not going to have the same sort of managed environments as readily available as Spark.
gpuR
This is an R package specifically for GPU processing. This can greatly speed up specific kinds of computational tasks where one operation needs to be applied many times to a large set of data.

Review

You made it through! This class tried to blend a bit of depth and breadth into the area of cloud computing and big data. We looked at a wide assortment of tools out there and went into depth in some. When you move into doing this in the professional or academic world expect to go through something of the same learning process as we did here, but with whatever tools your employer or department uses. You will never know all the tools, but you hopefully will have developed some habits to help you better learn to use new tools and to debug problems when they arise. In addition you should have a good idea of what sorts of problems your tools will be solving.