Python on love:
import this
love = this
this is love, love is not True or False, love is love
# (True, True, True)
Python on love:
import this
love = this
this is love, love is not True or False, love is love
# (True, True, True)
Data is like dating.
As a freshmen, all data looks good if you get a chance to lay your hands on it. With a certain maturity however you learn to look for the qualities that promise a fulfilling and sustainable relationship.
I once got a psychophysical data set to analyze, and when I opened it it looked like this:
PK\x03\x04\x14\x00\x00\x00\x00\x00\xce\xad\xca<X\xdb\xba\xc7\xafx\x00\x00
\xafx\x00\x00\x17\x00\x00\x00QuickLook/Thumbnail.jpg\xff\xd8\xff\xe0\x00
\x10JFIF\x00\x01\x01\x00\x00\x01\x00\x01\x00\x00\xff\xe2\x19\xfcICC_PROFILE
\x00\x01\x01\x00\x00\x19\xecappl\x02\x10\x00\x00mntrRGB XYZ \x07\xda\x00\x05
It was a Numbers document. I was on a Linux machine. It could have been some obscure proprietary MatLab file, or some esoteric format that whoever gather the data made up in his lunch break. Lesson learned:
Then, very often I get data that remind be of the cryptic exercises I had to solve in my second year:
5.1,3.5,1.4,0.2,setosa
4.9,3.0,1.4,0.2,setosa
4.7,3.2,1.3,0.2,setosa
4.6,3.1,1.5,0.2,setosa
Data connoisseur will easily identify this as the first four samples of the Iris data set, but in reality this is fatal! In the best cases, there is some description or column labels, but more often than humanly bearable I had to look into the code that collected the data to find out what I am actually dealing with. At the very least, your data set should look like this:
sepal length, sepal width, petal length, petal width, species
5.1, 3.5, 1.4, 0.2, setosa
Even better, tell me who recorded the data, when, and what for. Hence my second rule of sustainable data happens to be identical to the second commandment of the Zen of Python:
Now, just like your teenage relationships, no two data sets are alike. In reality, the perfect case where your data fits into a big, N-dimensional matrix will almost never happen. You’ll have to deal with missing values, with values of variable length and dimensionality and ƒüṉǹƴ ểŋcǿƋıȵɠş. Make sure to pick a data format that can deal with all your possible special needs:
Which is why I have come to store all my data as JSON. So our beloved iris set would look like this:
[
{
'sepal_length': 5.1,
'sepal_width': 3.5,
'petal_length': 1.4,
'petal_width': 0.2,
'species': 'setosa'
},
{
'sepal_length': 4.9,
'sepal_width': 3.0,
'petal_length': 1.4,
'petal_width': 0.2,
'species': 'setosa'
},
...
]
Too explicit, you say? Wastefully redundant? Yes. But honestly: who cares? Usually, time is the more limiting factor than storage space. And if it comes down to it, we can always compress it with standard algorithms (the original dataset is 4500 Bytes, compared to 16500 for the JSON-formatted data. But the zipped JSON file is only marginally larger than the zipped original file (1400 Bytes versus 1000). JSON can reflect virtually any complexity we may encounter in data sets. And there are already tools to parse and generate JSON files in pretty much any useful programming language.
Which brings me to a fundamental problem of data analysis: even the most explicit data doesn’t analyze itself. And being a generally sane person with a finite amount of time and patience, I like to use tools for that. Being a python monogamist, I found myself writing a few very similar lines over and over again:
controls = [sample for sample in data if sample['condition'] is 'control']
Essentially, filtering the data set. I did this so often indeed that I ended up writing utility methods for that, and eventually, these methods matured into a slick module for super-rapid pre-processing of my datasets. It’s called knyfe, and it’s now open-source. If you’re dealing with data, you may want to check it out. It allows you to do stuff like that:
>>> cereals = knyfe.Data("examples/cereals.json")
>>> print set(cereals.manufacturer)
set(['Kelloggs', 'Nabisco', 'Ralston Purina', 'Quaker Oats', 'Post', 'General Mills'])
>>> kellogs_products = cereals.filter(manufacturer="Kellogs")
>>> kellogs_products.export("kellogs.xls")
and looks great in an iPython console:

The @property syntax may be syntactic sugar, but offers a few surprisingly edgy use cases:
class Lady(object):
def __init__(self, age = 42):
self._age = age
@property
def age(self):
return self._age - 10
madonna = Lady(age=53)
print madonna.age
43
If you’re working in the natural sciences, sooner or later you will have to learn how to typeset in LaTeX. There are a number of LaTeX editors out there, but I’m a bit old fashioned and have a profound dislike for obese IDEs, so I usually stick to my coder’s editor of choice (lately Sublime Text 2). However, there are some tools that have become a fixed part of my workflow, and for good reasons.
I don’t have a strong preference for Mendeley over or any other article management system (such as Papers), other than that it’s free. But the one feature your reference management should have is automatically generating a .bib file containing all your articles. Put that file on your PATH and reference any paper from any article you’re writing.
VisionObjects have a great demo that instantly converts your handwriting to LaTeX formulas. It works with an HTML5 canvas, so you can use it on your tablet as well.

You can do all your presentation slides in LaTeX using the Beamer package, but unless your slides are packed with formulas I’d hardly recommend doing so. If you just need to insert a formula every now in then, LaTeXIT is an incredibly useful little tool for OS X: it’s a graphical front to LaTeX that renders your equation into the application window, from where you can just drag and drop it into Keynote or Powerpoint to insert it as a PDF.

(Image copyright lies with Pierre Yves Chatelier)
An online alternative with the same feature set is CodeCog’s equation editor, however it occasionally comes up with, eh, non-standard interpretations of your equations.
Well, you knew that sooner or later some slippery innuendo would come. Anyway, ShareLatex is a collaborative LaTeX editor. Think EtherPad with integrated LaTeX compiler and syntax highlighting. Free.
If you think something is missing here let me know and I’ll update the article.
Busted Tees had this nice shirt today, I made it to a widescreen wallpaper for your personal enjoyment.
1920 x 1080 | 1440 x 900 | 1280 x 1024 | 1280 x 800 | 1024 x 1024 (for iPad)
If you’re a Mac user and productivity geek, you’ll already know NVAlt (née Notational Velocity). I won’t go into detail why it is the best not taking tool around (other people have done so already). Bottom line is: NVAlt is a text editor that lets you store your notes as plain text files in a Dropbox folder and offers full-text search through all your note. And it’s dead simple.
Better still: you can share that Dropbox folder with your friends and have effectively a private wiki. That got me the idea of managing all my recipes with NVAlt as well (I was using SousChef before, using only 15% of its feature set). And share them with friends who love cooking as much as I do.

On this year’s Chaos Communication Congress I had the opportunity to try out a new angle on sensory substitution with a short lightning talk.
Abstract: A brain hack is identifying a neuronal mechanism that was evolutionary (probably) developed to do one thing, and then exploiting or hijacking this mechanism to do another. While not being a novel idea as such, advances in neuroscience have allowed us to model a number of such mechanisms that are particularly prone to being “hacked”. What annoys me however, is that if you google for “brain hacks” you will find loads of “brain improvement tips”, yet no satisfactory explanation how and why these so-called hacks shall work. In this talk, I will present one exemplary model mechanism, how we hacked it, and the surprising results we found.
There’s only that much information and ideas you can fit into 6 minutes (even if you’re a ruthlessly fast talker as I am), but it features Luis Armstrong and an Orc wielding a sledgehammer to teach Latin grammar (thanks to Nico for supplying the superb visuals). It’s also a Pecha Kucha Talk, meaning I had 20 slides which were shown for 20 seconds each, precisely. Considering that it didn’t take me much longer per slide to put the talk together it went off quite well:
The folks at Instagram posed an interesting challenge last week. Basically, your task was to take an image like this:

and undo the random shredding process with a few lines of code in a language of your choice. Bonus points for determining the width of the shreds from the image. In their blog post, they wrote it took them 150 lines of code in python. So I thought a reasonable challenge would be be to solve the task in 15 lines of code (comments omitted; I will explain the code later in this post).
import PIL.Image, numpy, fractions
image = numpy.asarray(PIL.Image.open('TokyoPanoramaShredded.png').convert('L'))
diff = numpy.diff([numpy.mean(column) for column in image.transpose()])
threshold, width = 1, 0
while width < 5 and threshold < 255:
boundaries = [index+1 for index, d in enumerate(diff) if d > threshold]
width = reduce(lambda x, y: fractions.gcd(x, y), boundaries) if boundaries else 0
threshold += 1
shreds = range(image.shape[1] / width)
bounds = [(image[:,width*shred], image[:,width*(shred+1)-1]) for shred in shreds]
D = [[numpy.linalg.norm(bounds[s2][1] - bounds[s1][0]) if s1 != s2 else numpy.inf for s2 in shreds] for s1 in shreds]
neighbours = [numpy.argmin(D[shred]) for shred in shreds]
walks = [sequence(neighbours, start) for start in shreds]
new_order = max(walks)[1]
Without blank lines, that’s exactly 14 lines of code (admittedly of varying beauty and elegance), including imports.