A Data Wrangling Case Study

Introduction

Now for something a little different. So far there has been a pretty clear one to one correlation between tools and tasks. I could teach a tool and give you a task to do with it. Learn how a constructor works, make a class with a constructor.

Data wrangling is a different beast. It is going to feel a bit like art class rather than math class… except maybe a bit less fun. You should already have learned the tools: loops, regular expressions, reading and writing from files. There might be a few variations on those general ideas that you will need to use like a JSON or CSV specific file reader, but thats about it.

Instead it is something that comes with practice and experience. So that is really what this week is going to be about. I can show you what my workflow looked like for a particularly nasty data set consisting of a couple hundred thousand Material Safety Data Sheets, it will illustrate a variety of tools that got used, but ultimately you will get a lot more out of it by going out and doing this work yourself. So let the the following be a bit of a guide as to the sort of track you might want to take, but really you should be experimenting with your own data set and thats going to potentially look a lot different than what I am doing here.

Initial Work on the Server

Orientation

I did my work on an archive of 230,000 MSDSs. This dataset was stored as a hard drive image on Amazon Cloud Services, a product fairly similar to Google and had around 1.9GB of data.

The first step was to explore the format and decide what I want to do with it. Usually I would get handed a set of data and a task, but all of the work I do of this nature is proprietary so I can’t really use it as an example.

Initial Processing

The data came as hundreds of thousands of text files. If it had come as some sort of more structured data type I probably could have simply uploaded it directly to some variety of table based tool on Google and done my work there more quickly. But I wasn’t so lucky. It came as messy text files.

So the first task was to convert something that looks like this into a few properties I was interested in looking at. Being Material Safety Data Sheets it seems like an interesting task would be to find the most dangerous chemicals on the list and because that isn’t hard enough, lets break it down by color too!

So that was the mission I put out for myself. Maybe there was more interesting stuff to pull out of here, maybe not. Either way I had my quest.

Regular Expressions

The obvious tool for the job was going to be regular expressions

'.*Product ID: ?(.*?)\n.*MSDS Number: ?([A-Z]+).*LD ?50.*?RAT.*?
(>? ?[\d,.]+ ?M?[GL]/KG).*Appearance and Odor:(.*?)\n'

This one in particular was what I settled on after a few runs. I will talk about what did and didn’t work here.

.*Product ID: ?(.*?)\n

This grabbed everything on the line after Product ID:. That reliably gave me the market name of the substance, PALCO 840 for example.

.*MSDS Number: ?([A-Z]+)

This reliably grabbed the MSDS number which was always 5 upper case characters that came after the MSDS Number: prefix.

These portions worked very well.

.*LD ?50.*?RAT.*?(>? ?[\d,.]+ ?M?[GL]/KG)

This portion worked less well that I had initially thought. The lines I wanted to capture often looked something like

LD50:(ORAL,RAT) 470 MG/KG.

But sometimes they had other words in between, sometimes there were no spaces, sometimes there were commas or decimal points, sometimes it was ML instead of MG. This regular expression actually did a pretty good job with all of that.

Another factor I had to capture was some LD50 measurements were >5000 MG/KG meaning that they were unable to determine a value that was lethal to half of rats. This is different than actually measuring a value of 5000 MG/KG so that is why the > character is in the regex.

So where this went wrong most of the time is due to omissions of the ?. This is the optional modifier. It is also the non-greedy modifier. Take the string AAABAB. .*B will match the whole string, AAABAB because it will match as much as possible. .*?B will match AAAB because that is the shortest match from the beginning of the string. As soon as it hits the first B the pattern is satisfied and the search is done.

So the problem was that (>? ?[\d,.]+ ?M?[GL]/KG) was not followed by a ?. So if at some point later in the file there was another MG/KG reference, it would capture that in the group rather than the measurement that was closest to the LD50 RAT. That got me some bad data.

Finally

Appearance and Odor:(.*?)\n

captured the appearance and odor line, this worked as expected but would prove to be difficult to analyse later.

The rest of the contents of this wrangling script dealt with converting the results to JSON. This was accomplished by making each MSDS a dict with these captured groups and appending them all to a list. That was written to a file with the following code

with open('msds_ld50.json', 'w') as fp:
    json.dump(msds_list, fp)

This opens the file msds_ld50.json for writing, passes that pointer to json.dump along with the list of MSDSs. That will write the contents of the object to a file, in this case a list of around 30k MSDSs that matched the regular expression. Ones that didn’t match probably didn’t have an LD50 listed.

Conversion

A minor step along the way was to convert it to human readable JSON for my benefit and to CSV to upload it to Google Cloud DataPrep.

import json
import csv

msds_list = []

with open('msds_ld50_with_colors.json', 'r') as fp:
    msds_list = json.load(fp)

with open('msds_readable.json', 'w') as fp:
    json.dump(msds_list, fp, indent=2)

with open('msds_ld50.csv', 'w', newline='') as fp:
    fieldnames = msds_list[0].keys()
    writer = csv.DictWriter(fp, fieldnames)
    writer.writeheader()
    for msds in msds_list:
        writer.writerow(msds)

This code is pretty uneventful. Again we are opening a variety of files, those with r are opened for reading, w for writing. json.load is the opposite of json.dump, in that it returns a JSON object from a file. We can write it as human readable just by including the indent argument and passing it a positive integer. This will add newlines and that much indentation where needed.

The CSV writer takes a little more, you can pass DictWriter a file pointer and a list of field names and it will return a writer. Then you can call its writerow method and pass it a dict with the the keys listed in the fields and it will write them to a row.

After this it was off to Cloud Dataprep.

Colors

A quick aside, I wanted to do some work with the colors of the substances. This would mean that I would need to identify what color things were. The list of color names I had on hand was from XKCD’s color survey. A humorous read if you need a break.

I used the list of colors as a pattern to match against in the substance descriptions. That let me extract a list of matches for each substance.

Dataprep

Google Cloud Dataprep is a tool to help with organizing structured data. It is reasonably well documented and I am not an expert on it. But it is very easy to experiment.

The important part was that I had rows that looked like this

LD50_Rat,MSDS_Number,Appearance and Odor,colors,Product_ID
"> 5,000 MG/KG",BPBQL,PALE YELLOW VISCOUS LIQUID - FAINT ODOR,"['PALE', 'PALE YELLOW', 'YELLOW']",FLDCON ANTISCALANT 100-PART C

and I needed to get it into a form where I could easily do some math on the LD50.

I ended up converting that into this JSON

{
    "MSDS_Number": "BPQLQ",
    "LD50_Rat1": "20",
    "LD_Greater_Than": true,
    "LD50_Rat2": "G/KG",
    "Appearance_And_Odor": "DARK BROWN LIQUID WITH SLIGHTLY AROMATIC ODOR",
    "Colors": "[\"DARK\", \"DARK BROWN\", \"BROWN\"]",
    "Product_ID": "GPW-A, POLYMERIC MDI GUARDPAK-W COMPONENT A"
}

It is best seen in the video as it was a lot of steps to get there. It also shows using corrected LD50 values to replaces ones with a decimal point which were broken.

SQLite

So now I had a nice list of JSON broken down to almost what I was looking for. I would have to do a little more work to get it to be proper JSON, but it was not too bad and is detailed in the video.

Table Creation

The next step was to get it loaded into a SQL database. I didn’t want to use the ONID database as that is difficult to access from a Python program if you are off campus. So I used SQLite, a version of SQL that misses a lot of features, but is 100% stored in a single file and very easy to access from Python.

import sqlite3

conn = sqlite3.connect('msds.db')
c = conn.cursor()

c.execute('''CREATE TABLE IF NOT EXISTS msds
            (MSDS_Number text PRIMARY KEY NOT NULL,
            LD50_Rat1 real,
            LD50_Rat2 text,
            LD_Greater_Than integer,
            Product_ID text)''')          
conn.commit()
conn.close()

This is an example of creating a table. You open a connection to a file. Then you get a cursor from that connection, which is how you interact with the database. The execute method of the cursor will execute SQL. Then you call the connections commit method to actually save your work to disk and finally close the connection.

Inserts

Inserts are a little more interesting.

c.execute("INSERT INTO msds VALUES (?,?,?,?,?)",
(msds["MSDS_Number"],float(msds["LD50_Rat1"]),msds["LD50_Rat2"],int(msds["LD_Greater_Than"]),msds["Product_ID"]))

The INSERT syntax is the same, I don’t specify column names here because I am just inserting into every column in the order they were defined. Then I use ? to be placeholders for things I am going to be changing in the query. I provide a single tuple that has one value for each question mark. This helps prevent people from injecting baq SQL code and is good practice. I am not even going to show you the bad way to do it. Do it this way.

Selects

Selects are quite similar to any other SQL syntax.

for row in c.execute('''SELECT color.Color, AVG(LD50_Rat1) AS AVGLethalMG FROM color
                        INNER JOIN msds_color
                        ON color.Color = msds_color.Color
                        INNER JOIN msds
                        ON msds_color.MSDS_Number = msds.MSDS_Number
                        WHERE msds.LD_Greater_Than = 1 AND LD50_Rat2 = "MG/KG"
                        GROUP BY color.Color ORDER BY AVGLethalMG ASC
'''):
    print(row)

This runs a fairly complex SQL query to average the LD50 of substances by their color. Notice you can just use a for loop to loop over the results of the execute call, the rows are returned as tuples.

With that we know that if you have pet rats you might want to keep them away from things which are ‘blue violet’ colored or which are ‘steel’ colored. Those substances seem to be particularly fatal to rats. On the other hand it seems like they could probably eat about a quarter of their body weight in ‘tan brown’ substances and be OK.

Parting Thoughts

The code I wrote for this could have been a lot cleaner. I could have abstracted stuff and made things reusable. But for the most part I don’t really expect to be parsing MSD Sheets to figure out what color substances are the most fatal to rats any time soon.

This sort of work isn’t really about making elegant code. It is about getting data into a format where you can do interesting work. Don’t get hung up on making pretty solutions.

Do get hung up on testing often to make sure your code is doing what you expect it to be doing. It is really easy to make it a long way through a multi-step process and realize you made an error back at step 1 that makes you run all the subsequent steps again.

Activity

Go do stuff like this on data you are interested in. And keep doing it. That is how you will get better at it.