So far the topics we have covered have been more general in their nature. We talked about SQL, which works on smaller databases, but those same engines don’t do as great when you start getting up into the terabyte or petabyte range. But there are some SQL like databases that do quite well with larger data sets. Likewise data cleaning and visualization need to happen for all sizes of data sets.
But we start to run into new challenges when the size of our data begins to exceed our computers main memory or even more problematic when it begins to exceed the storage space of our computer and we need to host data across several computers. This section will get us started looking at some of these issues.
This week we will be uploading data into Google BigQuery in prepreation for doing work on it in upcoming weeks. In order to get the data there you will need to use a variety of cloud based tools.
This transition is always a little rough. It is like when you started using external libraries in Python. To begin with you need to know a few dozen keywords, then you introduce libraries and now there are hundreds of libraries to pick from all with dozens of functions you might use. Now you are moving on from libraries to sets of external tools. The amount of stuff you have to work with will continue to grow at a rather non-linear rate for awhile.
The Google products are some of the easier ones to use so hopefully you can get some experience reading documentation in a little more welcoming environment than you might find with other tool sets (ahem, I am looking at you Amazon). And like always, please come to the forums with questions!