Introduction to Big Data

Introduction

So far the topics we have covered have been more general in their nature. We talked about SQL, which works on smaller databases, but those same engines don’t do as great when you start getting up into the terabyte or petabyte range. But there are some SQL like databases that do quite well with larger data sets. Likewise data cleaning and visualization need to happen for all sizes of data sets.

But we start to run into new challenges when the size of our data begins to exceed our computers main memory or even more problematic when it begins to exceed the storage space of our computer and we need to host data across several computers. This section will get us started looking at some of these issues.

Key Questions

  • What is the definition of Big Data?
  • What additional concerns must we consider when working with large data sets?
  • What are some common tools available to data scientists to work with large data sets?
  • What are the major differences in dealing with static vs changing data sets?

Assignment Overview

This week we will be uploading data into Google BigQuery in prepreation for doing work on it in upcoming weeks. In order to get the data there you will need to use a variety of cloud based tools.

Explore the Topics

Big Data Overview
This exploration discusses the idea of big data broadly, goes over a bit of the history and identifies some of the problems.
Big Data Product Examples
This module looks at some offerings in the area of big data and discusses which niches they fill and when you might want to use them.
Data Storage Walkthrough
This exploration will guide you through the process of getting a rather large data set into the Google Cloud Platform so that you are ready to do some work on it.

Additional Resources

Google Cloud Products
Most of the tasks we are doing this week are very standard tasks for the various Google products. Their tutorials and documentation should generally cover the topics quite well.

Review

This transition is always a little rough. It is like when you started using external libraries in Python. To begin with you need to know a few dozen keywords, then you introduce libraries and now there are hundreds of libraries to pick from all with dozens of functions you might use. Now you are moving on from libraries to sets of external tools. The amount of stuff you have to work with will continue to grow at a rather non-linear rate for awhile.

The Google products are some of the easier ones to use so hopefully you can get some experience reading documentation in a little more welcoming environment than you might find with other tool sets (ahem, I am looking at you Amazon). And like always, please come to the forums with questions!