Data Formats

Introduction

As mentioned in the section introduction there are 3 major kinds of data storage formats we are going to look at. CSV/TSV which use commas or tabs to seperate attributes and newline characters to represent each datum. XML which uses <key>value</key> formatting to structure data, this is very similar to HTML. And then there is JSON which you have seen in Python which uses various characters to represent objects, arrays and elements. All of these data types have their benefits and drawbacks.

CSV/TSV

CSV/TSV is probably the simplest data format to grasp. It stores tabular data and each column is separated by either a tab or a comma. There is no way to define data types, you need to rely on the source of the data to provide instruction on what the meaning and type of data is in each column or you need to intuit it yourself.

first_name,last_name,phone
Justin,Wolford,(123)456-7890
Alice,Alison,(234)567-8901
Bob,Bobberson,(456)789-0123

This is pretty straight forward. Maybe not the easiest to read, but each row represents a person with their first name, last name and phone number. Nothing is ambiguous, we have smooth sailing ahead with this data set!

If that is smooth sailing, let us visit a boat with a missing rudder which is quickly sinking in a hurricane. Suppose we want to store a list of E.E. Cummings poems. The things we want to store are the title, the text and the year.

This is !blac by E. E. Cummings

!blac
k
agains
t

(whi)

te sky
?t
rees whic
h fr

om droppe

d
,
le
af

a:;go

e
s wh
IrlI
n

.g

I promise, I did not just let my cat loose on my keyboard. There are empty lines. There is a line with just a , (!!!).

There are more typical poems as well like even a pencil has fear to

even a pencil has fear to
do the posed body luckily made
a pen is dreadfully afraid
of her of this of the smile's two
eyes.too, since the world's but
a piece of eminent fragility.
Well and when-Does susceptibility
imply perspicuity, or?
shut
up.
Seeing
seeing her is not
to something or to nothing as much as
being by her seen, which has got
nothing on something as i think,
did you ever hear a jazz
Band?

or unnoise men don't make soup who drink.

I hope you can quickly see the disaster of string parsing that we are about to encounter. We use commas and newlines to convey yet commas and newlines are used liberally in these poems. In essentially all modern programming languages we have specific rules about how we escape special characters. Things like ' or " have special meaning so we use \' or \" when we want the literal character to show up.

CSV and TSV formats do not specify any hard rules. Many libraries will use conventions similar to the language they are implemented in but you need to be careful because there are additional special characters like , that are not a special character in Python, but are in a CSV file. The same for the tab character in a TSV.

Other conventions are to use doubles of quote characters like "" to represent a single instance of the " character. The point of all of this is that one of the most challenging things you will run into when working with CSV or TSV files is dealing with character escaping. Where possible you should use the same library to both encode and decode something from CSV or TSV so that you are most likely to not run into issues or change data accidentally.

One major issue that is an upshot of this is that trying to read a CSV file as a human can be really difficult and because there are not great standards for CSV or TSV files it might be hard to load your data into a program like Excel that would let you quickly view the data in a more reasonable to read format to get your head around what is going on.

XML

Extensible Markup Language or XML has been around for since the mid 90s. It uses tags to define or structure data. There are three kinds of tags

<start-tag>
</end-tag>
<empty-tag />

Start and end tags share the same name but the end tag is prefixed with a /. Here is a simple example:

<person>
    <first-name>
    Justin
    </first-name>
    <last-name>
    Wolford
    </last-name>
    <phone-number>
    (123)456-7890
    </phone-number>
</person>

This is the same data that could be stored in a single line in our CSV. So we can see it is a bit (a lot) more verbose. In this case we have tags that make up several elements. We have a person element, a first name element, last name element and a phone number element. We can also do things like have image elements. It would look like this:

<img src="kitten.jpg" alt="Kitten" />

This element could represent an image. It has attributes within the tag that provide additional metadata, in this case it provides the source of the image and the alternate text to display for screen readers or as hover text.

Unlike CSV or TSV files XML has a great deal of specification around how it is written and read, what characters are permitted and how to escape characters.

What is nice about formats like XML is that you have a lot more flexibility. In between tags almost all characters are valid. The only characters you are going to have issues with are the < and > characters, and as we will see in a second, the & character.

Like Python, we have a way for displaying characters that would otherwise be problematic. We use the following strings to represent single characters < represents the less than symbol. > the greater than and & to represent the ampersand symbol. The reason we need that last one is because XML decided to use it to denote the start of an escape sequence.

With that in mind our structure for these poems would be quite simple

<poem>
    <author>
    E. E. Cummings
    </author>
    <text>
!blac
k
agains
t

(whi)

te sky
.
.
.
    </text>
</poem>

The only thing we need to be careful of is preserving whitespace. A lot of poetry has intention indentation so we want to make sure that the text in the text field preserves whitespace and that we don’t insert tabs for readability. Whereas for the author we may want to leave tabs in to preserve readability and trim them upon using the content of the author tag.

Using XML is never that easy. It is also generally not that hard. It can just take up a very large amount of space for a very small amount of data. Remeber that every character is a byte or two depending on encoding and that data needs to be saved. So if you data set consists of very small data but your XML tags are very large, you could easily end up multiplying your file size by a factor of 10 or more.

JSON

I do a lot of web development and work with a lot of web APIs (An API is a service that sends data over the web in this context). In that particular use case JSON is the most commonly used datatype. I find that it is generally a good balance between XML and CSV. There is a standard and it can reasonably be summarized in one page.

XML let you store whatever structure you want and lets you define whatever tags you want. This is very flexible. JSON is a little more strict. You can store objects, arrays and elements. With elements being strings, numbers, booleans or null. You can’t just define a tag called <image> and be on your way.

When you printed out dictionaries in Python you saw notation that looks really similar to JSON.

{"people":[
{"first_name":"Justin", "last_name":"wolford", "phone":"(123)456-7890"},
{"first_name":"Alice", "last_name":"Alison", "phone":"(234)567-8901"}
]}

JSON is primarily a collection of objects stored as key value pairs. So here we have a object with one attribute people which is a list of other objects, those objects contain three properties first_name, last_name and phone.

This is pretty much analogous to having a dict containing a list of dicts in Python.

We see that we can’t add meta data to tags like we can in XML, but we can store stuff in around half of the space because we don’t need closing tags. Objects and lists are also denoted with a single starting and ending character.

So it is a little less flexible than XML, but I would argue that makes it easier to read. There are only so many options.

However, note that we need to repeat property names, so in our list of people, each object has a string for each property, "first_name" for example. In a CSV file we would need to list the column name only once. So JSON tends to be quite a bit larger than a CSV file, but it is well defined with standards for how strings are to be escaped. You can have a high level of confidence than if you encode a JSON file on one machine, someone else will be able to decode it on another with little to no configuration. The same cannot be said for CSV where options need to be specified for things like header rows, escape characters and newline characters.

Activity

There are lots of online tools that will check JSON and XML syntax. To make sure you have a basic understanding of the syntax of JSON go here and create a simple JSON object by hand with a couple properties and an array and make sure it validates.

Do the same with XML here. For XML have a couple nested tags and include at least one attribute on the tag.

As CSV does not really follow a standard there is not much you can do to make sure you get that syntax.

Review

This should introduce you to the major data formats you are likely to encounter in your day-to-day work with data sets. You should understand, to some extent, the advantages and disadvantages of these formats and have an idea as to when one might be preferable to others.