Big Data Overview

Introduction

I am not aware of any specific origin for the term “Big Data”. I was able to find a talk at a conference from the late 90s that referenced Big Data growing at a rate of around 2x/year. That pace has fallen quite a bit with recent emphasis on SSDs for consumer storage. These are Solid State Drives, they don’t have a spinning platter so data can be read and written much faster. However they tend to be more expensive per unit of storage.

Around this same time cloud storage options really started to take off. From services like Google Drive that anyone with a Google account can access to things like Amazon S3 storage buckets for storage essentially without limit for more involved applications the options for cloud storage have expanded very rapidly.

So you might look at graphs of hard drive size over time and think that the rate of growth of computer storage has started to slow but in reality it is simply moving to data centers which have rack upon rack of hard drives, both solid state and rotating platter for storing data in the cloud.

With all of this growth in data storage options, the ways in which we can process large data sets has grown quite a bit too and we are going to take a look at the problems we might run into and some tools that exist to deal with those problems.

Transferring Data

This can be a fairly significant bottleneck and may not be something you have thought about. Many people have reasonably fast broadband internet. You can watch HD Netflix, no problem, you might be able to download a 30GB game in less than an hour, but for most connections these speeds are not symmetric, that is, you can download much faster, even an order of magnitude faster, than you can upload.

This can pose quite a bit of a problem if you want to work on big data from home as a project or small business. As an example, I was downloading data from a service that archives its data set daily. The data set is around 30Gb. With my internet connection I was able to download it within an hour or so, but it took close to a full day to upload it to a separate server to do work on it. Were the data set much larger or my connection much slower, they would be producing data faster than I could relay it to a service to do work on it.

There are several options to deal with this. The simplest is to buy faster internet, another, probably better and more scalable option is to work remotely on a server. Google offers App Engine and Compute Engine which let you run applications or work directly on machines running in Google servers. This cuts out the need to relay data from your location and allows access to much faster connections.

And, on the most extreme end of things are options like the Google Transfer Appliance. You might think that this is one of those kind of goofy names where they make it sound like one thing when it is another, totally unrelated thing… But no, Google actually sends you, in the mail a desktop computer sized piece of equipment that you plug into your server. You load it with data and then ship it back to Google. The appliances are in the range of hundreds of terabytes.

All of these options center around the idea that you have a single set of data that you want to work with. Another option that we won’t cover in too much detail in this class is the idea of working with a data stream. In this case you create an application that consumes data as it comes in and either in real time or in periodic jobs rather than working on the entire set at once. An example might be monitoring securities or currencies to automatically trade them. It would not be feasible to run calculations on a long history of data but it would be possible to track current trends in real time and make quick decisions based on that analysis.

Storing Data

The next piece of the puzzle we are going to discuss in the storage of data. The largest hard drives are in the range of 10 to 20TB and many motherboard support in the neighborhood of 8 hard drives. So an expensive desktop computer is going to have a couple hundred terabyte capacity. Keep in mind however that when dealing with large data sets like this redundancy becomes very important. So you likely would want to at least have data duplicated across hard drives cutting that capacity in half.

In terms of scale a terabyte is around 2,000 hours of CD quality audio. You could probably still manage to squeeze the electronic content of the Library of Congress into a home computer designed specifically for data storage, but it would be rather expensive with no redundancy.

As we move up to petabyte scale, that is 1,000 terabytes, we move into the range of large servers. For example the game World of Warcraft requires several petabytes of storage to maintain the state of the game. The Large Hadron Collider creates around 15 petabytes of data a year.

At that scale storing data in one machine becomes a physical impossibility, one simply cannot cram that many hard drives into a computer case. Instead it gets distributed among several networked computers and more complex file management systems are required. These are where things like Apache’s Hadoop Distributed File System (HDFS) come into play, it is a file system that handles spreading files across multiple computers. It is designed to work with cheap hardware with the assumption that nodes will fail and that is OK.

Then, moving up the chain there are services offered by cloud providers like Google and Amazon which use higher performance hardware and provide faster access and redundancy, but typically at a higher price. This generally captures the range of options one might encounter in terms of data storage.

Data Processing

This is probably the most interesting piece of all of this, or at least the most interesting of the bunch for data scientists. When it comes to doing computations on large data we run into a variety of issues and bottlenecks. The first and probably most obvious issue is that there are simply a whole lot of calculations to do if you have a whole lot of data.

This is approached by writing good, efficient code and then by spreading work across multiple computers and doing the computations in parallel where possible. Some computations are really easy to do in parallel like calculating a sum. A bunch of computers can calculate sums of different sets of numbers then add those sums together in the end.

Other operations are not so easy, if you had a system of independent agents in a simulation where the actions of one might affect the actions of another. Things like navigation or Newton’s method cannot be calculated in parallel because each step is based on the previous step so they cannot be worked on independently.

So some problems we can spread across multiple workers, others we cannot.

But that is not the limit of our problems. When computers do work they like to use RAM to do so. RAM is much faster to access than data on a hard drive, but even the largest servers typically only have RAM measured in hundreds of gigabytes, that is only a fraction of the size of large data sets. So we can divide up solutions somewhat broadly into two categories. Those which try to keep the workflow in memory and those which write the work back to disk after each iteration. Apache Spark is an example of a tool that tries to keep data in memory, and Hadoop is an example that writes results back to disk between every phase. The former is better for smaller data sets or processing of streaming data, the latter is better for very large static data sets.

Review

This should give you a good bit of background in the sorts of questions you will need to ask when figuring out how to work with a data set. You need to consider the logistics of getting your data to someplace you can work on it. You also need to consider where you will store it and finally you need to figure out where you can work on it.

This all requires quite a bit more planning than you would run into on a small home project. If you end up going through several days of moving data around to get all your data on a server and then realize you need a server that has more RAM to do the work you need, you are going to be in a fair bit of trouble. Likewise if you find out that you want to actually distribute the data across several computers for parallel processing.

You really need to devise a plan in advance because mistakes can cost days of work whereas the same sort of mistake on a small home project might quickly yield an error that can be debugged in minutes.