Class Overview

Introduction

This section will go over the overall course trajectory and help set up expectations. When this section is done you should have a good idea of what you can expect out of the class and what the class will expect out of you.

Course Layout

Learning Material

Every week there will be a overview of the upcoming content for the week, a list of key questions. These are questions which you should be able to answer by the end of the week. If you can’t answer them you should seek help on the message boards. There will be an assignment overview. This will give you an idea of what you are doing during the week. There will be a section to explore the topics this is where you will find the course material. Depending on the week this might be the primary source of material for the week (like this week) or it might be more of a place to help with the additional resources which provide third party documentation or information on the topics covered in the week. Finally there is a review where you should reflect on what you learned over the week and think about how it fits into the larger picture.

Graded Coursework

Graded coursework will all be listed in Canvas modules. If you are unsure what work needs to be done for a grade, it will all be listed in the Canvas modules. You need to pay a little extra attention to assignments which span multiple weeks, in Canvas they will be listed in the week they are assigned, not the week they are due.

Incoming Expectations

Coming into this class you should be comfortable with basic tasks on the command line. Opening files, navigating directories, running a command with arguments supplied to it. If you don’t know how to do these things you should learn to do them promptly.

You should also be proficient with basic Python programming. You should be able to write a class that has properties and methods. You should be able to make instances of those classes, call those methods and be able to do things like deal with a collection of classes or write functions that accept instances of classes as arguments.

You should also know how to use 3rd party libraries. There will be a lot of third party libraries used in this course. So you should be comfortable using something like pip to install 3rd party packages and know how to import and use them in a Python program.

Proper Course Expectations

This course deals with content that will be out of date within months of when you learn the material. That is just sort of a fact of life when it comes to the landscape of big data which is an exceptionally hot topic right now. So what does that mean for this course?

This course will not directly teach you how to use most of the tools. It will leverage tools which are currently in use to demonstrate common principles, but it is quite likely that between the terms in which this course is offered the best tools to do a job may have changed.

What this course will do is point you to the right material to learn a tool. This is almost always material provided directly by the creator of the tools. For example, the Google Cloud platform that we will be using has a whole host of tutorials to help you learn their tools.

This course will also present to you a place to practice and to get help when the tools don’t work like you expect. This is where you will get the most value out of the course.

As you go through Googles documentation you will find that things don’t work like you expect or you will find parts of the tutorials confusing or unclear. You have an environment of other people learning the same tools, for the same reasons, at the same time and an instructor who is experienced in helping students get past these sorts of roadblocks.

To reiterate, the course will teach you the underlying concepts which are important in big data management. There are certain concepts that don’t really change when it comes to the way data is managed. There are certain algorithms that are unlikely to change much from year to year. The course will try to demonstrate these with current tools. But it will be up to you to learn how to use these current tools. And it will be up to you to actively seek help when you have trouble. The course is not going to walk you through the use of all of the tools.

Course Overview

Databases

At the heart of all of this is the concept of databases. These are structured stores of related data. The course will begin by talking about SQL, a traditional relational database that has been around for decades. You will almost certainly interact with a SQL database at some point if you are working with data. Beyond that many other databases which are suited for different purposes will use syntax similar to SQL or otherwise try to make it seem familiar by relating it to SQL.

Next we will talk about the constraints that exist when using SQL and look at why there have been a whole wave of database that describe themselves as basically not SQL. In general these are the sorts of databases you might be looking at if you are dealing with hundreds of millions or billions of records.

Finally we will talk about cleaning up data to get it into a database. With millions of records manually doing things like checking capitalization of names or validating addresses is not reasonable. We will look at tools and practices to help with automating this task of getting data into a format where it is consistent and can easily be put into a database.

Data Visualization

Next we will have a small unit on data visualization. There are hundreds of techniques and tools that can be used. We will look at a small subset of tools specific to Python that can be used to visualize data. Beware that this is one of those sections that relies pretty extensively on installing and using more complex third party libraries.

NoSQL Databases

This section will really go into some depth on using non-relational databases, these are the alternatives to SQL that were mentioned in the databases section. This is essentially the bread and butter of big data. These databases let us more rapidly access data, they also let us more easily distribute our data over multiple servers so we can have several computers working on it at once. This section will go into how we can leverage these features to do work faster.

Map Reduce

This is a major algorithm used to process data very rapidly. There are certain problems that can be divided up and sent to multiple computers and then later collected and combined into a final result. This will require a similar amount of Python programming as the data visualization section but will conceptually be a bit more challenging.

Projects

There will be two projects to let you experiment with your own datasets and tools which the course may not have covered in the level of depth that you had hoped for. These projects will require taking a data set, finding out something interesting about the set as a whole and doing a visualization of the dataset. The first will be a small data set that can reasonably be stored in a relational database and analyzed on a single computer. The final project will need to be on a much larger data set.

Assignment Expectations

Code submitted is expected to run without errors. If we execute your code and it throws an error without generating any output you will not get any credit for the code. If your code partially works, you should submit only the working portions that at least produce some output. You should comment out any code that is causing errors.

In other words, you should run your code right before you turn it in and make sure that when you do you do not get any errors. If you do, you should remove or fix the code that throws the errors until it is free from errors.

A program which produces at least some output will fair better than a program which produces none regardless of the underlying code.

Non-code assignments should be well presented. They should be in a PDF format that is well laid out and easy to read. Text supporting a visualization should be close to the visualization in the document. There should be proper headings and some organizational structure so a reader can quickly find the important information.

Some of the grading can be objective when it comes to correctness, but some of the content in this class is more subjective in nature, especially when it comes to the projects where you have more freedom in what to do.

Review

At this point you should have a pretty good idea of the content in the class. You should consider what sort of class load and work load you have this quarter and if it is reasonable to fit this challenging CS course into the mix. It is not unusual for a 4 credit CS class to average around 20 hours of work a week. The end products should be impressive and exciting but it will be hard work producing them.