Data Formats and Data Wrangling

Introduction

When working with data sets there are a few sorts of data formats you are most likely to encounter. In no particular order there are CSV/TSV files, standing for Comma or Tab separated values, XML standing for eXtensible Markup Language and JSON standing for JavaScript Object Notation.

These different formats have different requirements in how data is stored so you sometimes need to do some extra work to switch between them.

In addition you may often find data sets which have inconsistent formatting internally or have missing or incomplete data. We will try to address how to work with or around these issues in this section.

Key Questions

  • What are the major differences between the major data format methods?
  • What are the major facts one should consider when deciding on data format?
  • What are some options for dealing with missing data?
  • What are some tools and techniques to deal with inconsistently formatted data?

Assignment Overview

This week you are going to start exploring your own (hopefully) data set. The task for the week will be to get it into a state where you will be ready to do things like run programmatic analysis or visualize your data. This is going to require having it loaded into a local database or in a format that a tool can parse it easily.

Explore the Topics

Data Formats
This exploration will look at some of the more common data formats you are likely to encounter.
A Data Wrangling Case Study
This looks at one particular data wrangling project that involves starting from hundreds of thousands of poorly standardized text files and ending with a list of colors that make the best rat poison.

Additional Resources

JSON.org
A good reference for JSON structure. Includes a syntax reference and examples along with a number of libraries to parse JSON in different programming languages.
Microsoft XML Example
XML is another popular format, this page has some good samples of what XML looks lie.
SQLite3
This is the site for SQLite. This is independent of any platform so you can see what syntax looks like and what data types are supported. If you wanted to see how to use it in Python you would need to look at the Python documentation.
Python CSV and JSON Libraries
These are going to be very useful in helping you tackle formatting data quickly and easily.

Review

So this was a bit of a non-standard week. There will be a couple more like this which are more focused on your project. But this one focused on my project. Hopefully you can see what a process looks like and take from it pieces that might be helpful for you.