Data Storage Walkthrough

Introduction

We are going to grab some data from ADSBExchange. This is a website where people share aircraft location data. We are going to get the data for January 15th, 2018. In the end we should have a set of JSON files in a folder that we can import into Cloud Dataprep.

Planning!

Lets consider the steps we are going to need to go through here. This is often better to work backwards through.

  • Upload files Dataprep can deal with to Dataprep
    • It turns out is can deal with JSON files, so thats good
  • Get a collection of JSON files in a place Dataprep can deal with them
    • We can’t easily select 1,440 files manually to upload via the GUI
    • Even if we could, lets pretend like we want to do this with many days worth of data, so lets figure out a more scalable solution
    • Dataprep can easily access files in Google Cloud Storage, so lets get a folder of JSON files there
      • I know we want a folder because you can’t easily select multiple files, but you can select an entire directory in Dataprep
  • We need a bucket of JSON files for the day in Google Cloud Dataprep
  • We need to upload more than 1,000 JSON files to Google Cloud Dataprep
    • Our home computer isn’t going to have the bandwidth to do this, lets figure out another option
  • We will need to download the zip and unzip it.
  • This isn’t a computationally difficult operation, we can use Google Compute Engine to do it!
  • We need to provision a Google Compute Engine instance and get the files there
    • It is a zip file, so we will need to install unzip if it is not already installed
    • We will also need to instal gsutil if it isn’t installed to access our Google Cloud Storage bucket

That pretty much lays out the flow. The things I learned when I went through this were that there is no way to unzip a file in Google Cloud Storage or Dataprep, so files need to be uploaded individually. They need to be uploaded to a folder because Dataprep can’t easily select all the files in the root directory of a bucket. It is way easier to set up the Compute Engine permissions in advance.

Extra Tools

unzip
This is the unix tool to deal with zip files. It isn’t installed by default usually.
wget
This is the unix tool to download files. wget url will download the contents of the url. It is pretty straight forward
Unix command line
If you haven’t used it before there are quite a few good tutorials out there. Basically you will need the cd command to change directory, the ls command to list the contents of the directory, cp to copy things, mv to move things, rm to remove things and mkdir to make directories. That is pretty much all we will need for this journey.

The Walk Through of Tools

This intentionally does not cover all the steps. It should get you through most of the stuff that is really unfamiliar and it should show you all the tools you need to get the job done. But it does not show how to actually solve the problem.

Google Compute Engine

This is the first step, we need to spin up a Google Compute Engine instance, download the files we want and decompress them.

The above command line tools will be useful here.

Google Cloud Storage

This is where we want the data to end up, once it is here we can pretty easily get access to it via Cloud Dataprep.

You have options on how to work with this, either via the command line or via the GUI.

Google BigQuery

We won’t be doing extensive work with BigQuery, it is pretty similar to SQL so there is not a whole lot to learn about it, but we will be storing quite a bit of data here, so learning how to export jobs from Dataprep to BigQuery is an important tool in your toolbelt.

Review

This should do a fair bit to get you through the process of uploading data to the cloud. It is purposefully a little vague. I expect that you will run into roadblocks and you should spend some time trying to solve them yourself, that is a great deal of what you will be doing in the future working with data, but don’t spend too long. At some point the right answer is to ask for help and you should not be afraid of doing that.