We are going to grab some data from ADSBExchange. This is a website where people share aircraft location data. We are going to get the data for January 15th, 2018. In the end we should have a set of JSON files in a folder that we can import into Cloud Dataprep.
Lets consider the steps we are going to need to go through here. This is often better to work backwards through.
unzip
if it is not already installedgsutil
if it isn’t installed to access our Google Cloud Storage bucketThat pretty much lays out the flow. The things I learned when I went through this were that there is no way to unzip a file in Google Cloud Storage or Dataprep, so files need to be uploaded individually. They need to be uploaded to a folder because Dataprep can’t easily select all the files in the root directory of a bucket. It is way easier to set up the Compute Engine permissions in advance.
zip
files. It isn’t installed by default usually.wget url
will download the contents of the url. It is pretty straight forwardcd
command to change directory, the ls
command to list the contents of the directory, cp
to copy things, mv
to move things, rm
to remove things and mkdir
to make directories. That is pretty much all we will need for this journey.This intentionally does not cover all the steps. It should get you through most of the stuff that is really unfamiliar and it should show you all the tools you need to get the job done. But it does not show how to actually solve the problem.
This is the first step, we need to spin up a Google Compute Engine instance, download the files we want and decompress them.
The above command line tools will be useful here.
This is where we want the data to end up, once it is here we can pretty easily get access to it via Cloud Dataprep.
You have options on how to work with this, either via the command line or via the GUI.
We won’t be doing extensive work with BigQuery, it is pretty similar to SQL so there is not a whole lot to learn about it, but we will be storing quite a bit of data here, so learning how to export jobs from Dataprep to BigQuery is an important tool in your toolbelt.
This should do a fair bit to get you through the process of uploading data to the cloud. It is purposefully a little vague. I expect that you will run into roadblocks and you should spend some time trying to solve them yourself, that is a great deal of what you will be doing in the future working with data, but don’t spend too long. At some point the right answer is to ask for help and you should not be afraid of doing that.