Dating Tips for Data Analysts
Data is like dating.
As a freshmen, all data looks good if you get a chance to lay your hands on it. With a certain maturity however you learn to look for the qualities that promise a fulfilling and sustainable relationship.
I once got a psychophysical data set to analyze, and when I opened it it looked like this:
PK\x03\x04\x14\x00\x00\x00\x00\x00\xce\xad\xca<X\xdb\xba\xc7\xafx\x00\x00
\xafx\x00\x00\x17\x00\x00\x00QuickLook/Thumbnail.jpg\xff\xd8\xff\xe0\x00
\x10JFIF\x00\x01\x01\x00\x00\x01\x00\x01\x00\x00\xff\xe2\x19\xfcICC_PROFILE
\x00\x01\x01\x00\x00\x19\xecappl\x02\x10\x00\x00mntrRGB XYZ \x07\xda\x00\x05
It was a Numbers document. I was on a Linux machine. It could have been some obscure proprietary MatLab file, or some esoteric format that whoever gather the data made up in his lunch break. Lesson learned:
1. There’s no data like human readable data.
Then, very often I get data that remind be of the cryptic exercises I had to solve in my second year:
5.1,3.5,1.4,0.2,setosa
4.9,3.0,1.4,0.2,setosa
4.7,3.2,1.3,0.2,setosa
4.6,3.1,1.5,0.2,setosa
Data connoisseur will easily identify this as the first four samples of the Iris data set, but in reality this is fatal! In the best cases, there is some description or column labels, but more often than humanly bearable I had to look into the code that collected the data to find out what I am actually dealing with. At the very least, your data set should look like this:
sepal length, sepal width, petal length, petal width, species
5.1, 3.5, 1.4, 0.2, setosa
Even better, tell me who recorded the data, when, and what for. Hence my second rule of sustainable data happens to be identical to the second commandment of the Zen of Python:
2. Explicit is better than implicit.
Now, just like your teenage relationships, no two data sets are alike. In reality, the perfect case where your data fits into a big, N-dimensional matrix will almost never happen. You’ll have to deal with missing values, with values of variable length and dimensionality and ƒüṉǹƴ ểŋcǿƋıȵɠş. Make sure to pick a data format that can deal with all your possible special needs:
3. Blessed are the flexible, for they shall not be bent out of shape.
Which is why I have come to store all my data as JSON. So our beloved iris set would look like this:
[
{
'sepal_length': 5.1,
'sepal_width': 3.5,
'petal_length': 1.4,
'petal_width': 0.2,
'species': 'setosa'
},
{
'sepal_length': 4.9,
'sepal_width': 3.0,
'petal_length': 1.4,
'petal_width': 0.2,
'species': 'setosa'
},
...
]
Too explicit, you say? Wastefully redundant? Yes. But honestly: who cares? Usually, time is the more limiting factor than storage space. And if it comes down to it, we can always compress it with standard algorithms (the original dataset is 4500 Bytes, compared to 16500 for the JSON-formatted data. But the zipped JSON file is only marginally larger than the zipped original file (1400 Bytes versus 1000). JSON can reflect virtually any complexity we may encounter in data sets. And there are already tools to parse and generate JSON files in pretty much any useful programming language.
Which brings me to a fundamental problem of data analysis: even the most explicit data doesn’t analyze itself. And being a generally sane person with a finite amount of time and patience, I like to use tools for that. Being a python monogamist, I found myself writing a few very similar lines over and over again:
controls = [sample for sample in data if sample['condition'] is 'control']
Essentially, filtering the data set. I did this so often indeed that I ended up writing utility methods for that, and eventually, these methods matured into a slick module for super-rapid pre-processing of my datasets. It’s called knyfe, and it’s now open-source. If you’re dealing with data, you may want to check it out. It allows you to do stuff like that:
>>> cereals = knyfe.Data("examples/cereals.json")
>>> print set(cereals.manufacturer)
set(['Kelloggs', 'Nabisco', 'Ralston Purina', 'Quaker Oats', 'Post', 'General Mills'])
>>> kellogs_products = cereals.filter(manufacturer="Kellogs")
>>> kellogs_products.export("kellogs.xls")
and looks great in an iPython console:






