Big Data Product Examples

Introduction

This section is designed to give you a little bit of depth into a few specific tools. These are not a comprehensive set of tools, instead it is a small sample that represent a few kinds of tools you might encounter. We will be looking at Google BigQuery, a massively parallelizable data store, Google Cloud Dataproc, a managed provider of Adobe Spark and Hadoop. Basically it is an easy to way to get Spark and Hadoop instances up and running without having to manage the servers yourself. And Spark and Hadoop, two services to process large data sets, Spark being primarily in memory and Hadoop being primarily disk based.

Initial Data Processing

Talking about the initial data processing is a little beyond the scope of this class. As you get more experience you might start working with data sources directly, but that is often more in the realm of programmers to actually connect physical sensors, server logs or other data source up to data storage initially.

These services do something similar to what you have seen with Google Cloud Dataprep where they take data and convert it into a usable form but they tend to do it in real time rather than in a batch and it involves knowledge of both the supplier and the consumer of data.

Google Cloud Dataflow

This is one example of such a product. It allows you to take in batches or streams of data then proceed through a pipeline which produces well structured results. When you run a Google Cloud Dataprep job it actually gets run using Google Cloud Dataflow but the creation of that pipeline is automated for you.

Amazon Kinesis

This is another such product, it allows you to hook up to various data streams and collect data from those streams. For example you could see where incoming traffic to a website is coming from using Kinesis, it would communicate with the server and be notified when a visitor comes from an external site, it would note the site and produce it as output for analysis.

Very little analysis is done but raw signals (eg. page visits) are turned into data that can be analyzed later and this is the idea behind these processes that do the first pass of data processing.

Managed Cloud Storage

The next sort of level in the cloud data system is storage. Once the data is initially processed it needs to be stored somewhere. Managed storage is a cloud offering where the company providing the storage handles the operating system, data replication and hardware for you. Typically they provide various interfaces to upload or access data but you have very little say in exactly how that data is stored. Sometimes they even keep that sort of information a trade secret.

There is a fair bit of variety in how data is stored and the layers between data storage and data access can get a little blurry. But on the clearly unstructured definitely storage side of things you have products like Amazon S3 or Google Cloud Storage.

Google Cloud Storage

Google Cloud Storage gives you storage buckets, these generally determine what project is associated with data and are used for organization and billing. At the bucket level you can decide if you want fast (more expensive) access or slow (cheaper access). You can also do things like pick a region that you data will be stored in.

The things you store are referred to as objects. These are rather different than the objects you learned about in your programming class. Essentially they are just files, but they also have some associated meta-data associated with them that other tools can use when doing data analysis.

Another important factor to consider is data mutability. Some data storage options allow for very quick updates of data, others not so much. Google Cloud Storage is one of those ‘not so much’ options. It is actually about as bad as it gets. Objects uploaded to Google Cloud Storage are immutable. Once an object is uploaded it cannot be modified. A new version of the file needs to be uploaded to replace the old version if you want to make a change. This might impact if you want to split files up into many smaller files or keep them as a single large file.

Amazon S3

Amazon’s S3 is a similar product to Google’s Cloud Storage. It allows you to store objects, you can pay for various levels of access frequency and the objects are immutable. There are very few notable differences between the two offerings. Probably the main deciding factor in choosing between these sorts of products will be price in the region you want to store you data and what other tools you will be using in conjunction with them. If you are in the Amazon ecosystem generally you might find it better to stick with S3 and if you are generally using Google products, Cloud Storage may be the way to go.

Managed Structured Cloud Storage

The next topic to look at is storing structured data. This is probably where you will spend most of your time. The previous storage options are where you might stage data that you are uploading to convert to structured data or where you might write streaming raw data as it comes in prior to processing it in a batch. These options range from the affordable Cloud SQL and Bigtable to the fairly pricy Cloud Spanner.

Amazon Aurora and Google Cloud SQL

These products are really similar to a traditional relational database (like the MySQL database you used earlier). They are affordable in terms of pricing, you would have no problem using them quite a bit on the student grant that these sites offer. However they don’t scale well to huge data sizes. Use these when you need a basic relational database and you don’t want to host your own.

Google Bigquery and Amazon Redshift

These are massively parallel SQLish data storage options. Basically you store columnar data in rows and you can query that data very very quickly. You can either be billed by the hour which is good for businesses that are constantly using the data or you can be charged monthly for storage and then billed based on the number of bytes scanned when making a query.

These databases typically do not have the same sort of robust relational enforcement that traditional relational databases do, but they can process queries much faster. So they won’t enforce foreign keys, but you can still do a join on related data.

Other Options

There are a lot of other storage options for structured data but they are going to be more application specific. There are databases optimized for storing graphs (nodes and edges), others that are designed to work with Hadoop, some that offer incredibly quick read and write access. You can generally expect to pay more for the more application specific options, but sometimes that cost is well worth it.

In general the way you will find yourself working on one of these databases is to be on a team that is already using it. Alternatively you might find that a more standard database has some limitations that are really getting in the way of your work and you research and find one of the other databases is a better fit. In any case, don’t go looking to use one of these other more specialized tools unless you need it mainly due to cost reasons.

Data Analysis Tools

Here things get yet even murkier. The main data analysis tools we will be talking about are Spark and Hadoop. These are both data analysis tools made by Apache. They are not statistics packages in the sense of R or SAS which have a bunch of built in statistical power. Instead they are tools that let you map operations onto data and spread that work out over many worker computers and then compile those results.

The also have some closely related products that do some magic in terms of distributing data across lots of nodes to make it faster to process. All of that said you won’t generally see Spark or Hadoop cloud offering directly. Instead you will find managed services that make it easy to run these jobs.

Google Cloud Dataproc

This is a fully managed service for running the previously mentioned Spark and Hadoop. This tool lets you easily create Spark or Hadoop jobs, provision workers and generally control the process of running these jobs. It uses a variety of other Google Cloud Platform tools along the way. As I said earlier, this can be a major factor in deciding what tools to use. For example, it will use a Google Cloud Storage bucket to store setup and configuration information and it can read data from a variety of sources, like Bigtable and Bigquery.

Amazon EMR

This is Amazon’s counterpart to Dataproc. It coordinates a bunch of Amazon tools to let you easily spin up Hadoop or Spark jobs. It runs the job on Amazon EC2 instances and leverages some of their storage as well. It is pretty similar to Cloud Dataproc and again you should probably pick an option based on what the rest of your data stack looks like.

General Purpose Computing

Google Compute Engine

Basically this lets you run a full operating system on Google’s servers. You can pick the OS, there are a variety of images you can pick, they can come preconfigured with a bunch setups. Commonly you see them set up as web servers with a database and server for webpages set up already. One common setup is LAMP: Linux, Apache, MySQL and PHP another is Node.js and MongoDB.

If you need to do some general basic command line work on their servers, this is the tool to use.

Amazon EC2

It is almost the exact same thing as Compute Engine. The configuration is a little different and there are a lot more images available because it is a more popular service but that is about it.

Review

Whew, that was quite the list. And this is maybe 10% of the services Amazon and Google provide. I only talk about them here because they are the most popular providers, Amazon probably being the most popular, but Google is a little more user friendly to get started with.

We will probably end up using most of the Google tools for the work that will be going on in the rest of the class. But if you want to take the plunge and use the Amazon services instead, that should be fine if you talk it over with your instructor first.