The Portwem Preludium

Portwem, n.: A Martian instrument that works by inducing currents
into electric circuits, unperceivable by human beings.

Python on love:

import this
love = this
this is love, love is not True or False, love is love
# (True, True, True)

Dating Tips for Data Analysts

Data is like dating.

As a freshmen, all data looks good if you get a chance to lay your hands on it. With a certain maturity however you learn to look for the qualities that promise a fulfilling and sustainable relationship.

I once got a psychophysical data set to analyze, and when I opened it it looked like this:

PK\x03\x04\x14\x00\x00\x00\x00\x00\xce\xad\xca<X\xdb\xba\xc7\xafx\x00\x00
\xafx\x00\x00\x17\x00\x00\x00QuickLook/Thumbnail.jpg\xff\xd8\xff\xe0\x00
\x10JFIF\x00\x01\x01\x00\x00\x01\x00\x01\x00\x00\xff\xe2\x19\xfcICC_PROFILE
\x00\x01\x01\x00\x00\x19\xecappl\x02\x10\x00\x00mntrRGB XYZ \x07\xda\x00\x05

It was a Numbers document. I was on a Linux machine. It could have been some obscure proprietary MatLab file, or some esoteric format that whoever gather the data made up in his lunch break. Lesson learned:

1. There’s no data like human readable data.

Then, very often I get data that remind be of the cryptic exercises I had to solve in my second year:

5.1,3.5,1.4,0.2,setosa
4.9,3.0,1.4,0.2,setosa
4.7,3.2,1.3,0.2,setosa
4.6,3.1,1.5,0.2,setosa

Data connoisseur will easily identify this as the first four samples of the Iris data set, but in reality this is fatal! In the best cases, there is some description or column labels, but more often than humanly bearable I had to look into the code that collected the data to find out what I am actually dealing with. At the very least, your data set should look like this:

sepal length, sepal width, petal length, petal width, species
5.1, 3.5, 1.4, 0.2, setosa

Even better, tell me who recorded the data, when, and what for. Hence my second rule of sustainable data happens to be identical to the second commandment of the Zen of Python:

2. Explicit is better than implicit.

Now, just like your teenage relationships, no two data sets are alike. In reality, the perfect case where your data fits into a big, N-dimensional matrix will almost never happen. You’ll have to deal with missing values, with values of variable length and dimensionality and ƒüṉǹƴ ểŋcǿƋıȵɠş. Make sure to pick a data format that can deal with all your possible special needs:

3. Blessed are the flexible, for they shall not be bent out of shape.

Which is why I have come to store all my data as JSON. So our beloved iris set would look like this:

[
    {
        'sepal_length': 5.1,
        'sepal_width':  3.5,
        'petal_length': 1.4,
        'petal_width':  0.2,
        'species':      'setosa'
    },
    {
        'sepal_length': 4.9,
        'sepal_width':  3.0,
        'petal_length': 1.4,
        'petal_width':  0.2,
        'species':      'setosa'
    },
    ...
]

Too explicit, you say? Wastefully redundant? Yes. But honestly: who cares? Usually, time is the more limiting factor than storage space. And if it comes down to it, we can always compress it with standard algorithms (the original dataset is 4500 Bytes, compared to 16500 for the JSON-formatted data. But the zipped JSON file is only marginally larger than the zipped original file (1400 Bytes versus 1000). JSON can reflect virtually any complexity we may encounter in data sets. And there are already tools to parse and generate JSON files in pretty much any useful programming language.

Which brings me to a fundamental problem of data analysis: even the most explicit data doesn’t analyze itself. And being a generally sane person with a finite amount of time and patience, I like to use tools for that. Being a python monogamist, I found myself writing a few very similar lines over and over again:

controls = [sample for sample in data if sample['condition'] is 'control']

Essentially, filtering the data set. I did this so often indeed that I ended up writing utility methods for that, and eventually, these methods matured into a slick module for super-rapid pre-processing of my datasets. It’s called knyfe, and it’s now open-source. If you’re dealing with data, you may want to check it out. It allows you to do stuff like that:

>>> cereals = knyfe.Data("examples/cereals.json")
>>> print set(cereals.manufacturer)
set(['Kelloggs', 'Nabisco', 'Ralston Purina', 'Quaker Oats', 'Post', 'General Mills'])
>>> kellogs_products = cereals.filter(manufacturer="Kellogs")
>>> kellogs_products.export("kellogs.xls")

and looks great in an iPython console:

knyfe in an iPython shell

May 7

Python for Ladies and Chauvinists

The @property syntax may be syntactic sugar, but offers a few surprisingly edgy use cases:

class Lady(object):
    def __init__(self, age = 42):
        self._age = age
    
    @property
    def age(self):
        return self._age - 10
    
madonna = Lady(age=53)
print madonna.age
43
Nerds: you&#8217;re welcome. Everybody else: I&#8217;m sorry.

Nerds: you’re welcome. Everybody else: I’m sorry.

Feb 5

Using LaTeX like it’s 2012

If you’re working in the natural sciences, sooner or later you will have to learn how to typeset in LaTeX. There are a number of LaTeX editors out there, but I’m a bit old fashioned and have a profound dislike for obese IDEs, so I usually stick to my coder’s editor of choice (lately Sublime Text 2). However, there are some tools that have become a fixed part of my workflow, and for good reasons.

Mendeley

I don’t have a strong preference for Mendeley over or any other article management system (such as Papers), other than that it’s free. But the one feature your reference management should have is automatically generating a .bib file containing all your articles. Put that file on your PATH and reference any paper from any article you’re writing.

Handwriting to LaTeX

VisionObjects have a great demo that instantly converts your handwriting to LaTeX formulas. It works with an HTML5 canvas, so you can use it on your tablet as well.

VisionObjects LaTeX recognition

Equations in presentations

You can do all your presentation slides in LaTeX using the Beamer package, but unless your slides are packed with formulas I’d hardly recommend doing so. If you just need to insert a formula every now in then, LaTeXIT is an incredibly useful little tool for OS X: it’s a graphical front to LaTeX that renders your equation into the application window, from where you can just drag and drop it into Keynote or Powerpoint to insert it as a PDF.

LaTeXiT

(Image copyright lies with Pierre Yves Chatelier)

An online alternative with the same feature set is CodeCog’s equation editor, however it occasionally comes up with, eh, non-standard interpretations of your equations.

Throw a LaTeX party

Well, you knew that sooner or later some slippery innuendo would come. Anyway, ShareLatex is a collaborative LaTeX editor. Think EtherPad with integrated LaTeX compiler and syntax highlighting. Free.

If you think something is missing here let me know and I’ll update the article.

Busted Tees had this nice shirt today, I made it to a widescreen wallpaper for your personal enjoyment.

1920 x 1080 | 1440 x 900 | 1280 x 1024 | 1280 x 800 | 1024 x 1024  (for iPad)

Busted Tees had this nice shirt today, I made it to a widescreen wallpaper for your personal enjoyment.

1920 x 1080 | 1440 x 900 | 1280 x 1024 | 1280 x 800 | 1024 x 1024 (for iPad)

Jan 3

Sharing Recipes with NVAlt

If you’re a Mac user and productivity geek, you’ll already know NVAlt (née Notational Velocity). I won’t go into detail why it is the best not taking tool around (other people have done so already). Bottom line is: NVAlt is a text editor that lets you store your notes as plain text files in a Dropbox folder and offers full-text search through all your note. And it’s dead simple.

Better still: you can share that Dropbox folder with your friends and have effectively a private wiki. That got me the idea of managing all my recipes with NVAlt as well (I was using SousChef before, using only 15% of its feature set). And share them with friends who love cooking as much as I do.

NVAlt Recipes Screenshot

Read More

Brain Hacks: Retrofitting the 6th sense

On this year’s Chaos Communication Congress I had the opportunity to try out a new angle on sensory substitution with a short lightning talk. 

Abstract: A brain hack is identifying a neuronal mechanism that was evolutionary (probably) developed to do one thing, and then exploiting or hijacking this mechanism to do another. While not being a novel idea as such, advances in neuroscience have allowed us to model a number of such mechanisms that are particularly prone to being “hacked”. What annoys me however, is that if you google for “brain hacks” you will find loads of “brain improvement tips”, yet no satisfactory explanation how and why these so-called hacks shall work. In this talk, I will present one exemplary model mechanism, how we hacked it, and the surprising results we found.

There’s only that much information and ideas you can fit into 6 minutes (even if you’re a ruthlessly fast talker as I am), but it features Luis Armstrong and an Orc wielding a sledgehammer to teach Latin grammar (thanks to Nico for supplying the superb visuals). It’s also a Pecha Kucha Talk, meaning I had 20 slides which were shown for 20 seconds each, precisely. Considering that it didn’t take me much longer per slide to put the talk together it went off quite well:

Violence

  • Her: "Manuel, violence is not an answer!"
  • Me: "What is not an answer?"
  • Her: "Viol - I hate you."

Instagram Unshredding: 15 lines of code

The folks at Instagram posed an interesting challenge last week. Basically, your task was to take an image like this:

and undo the random shredding process with a few lines of code in a language of your choice. Bonus points for determining the width of the shreds from the image. In their blog post, they wrote it took them 150 lines of code in python. So I thought a reasonable challenge would be be to solve the task in 15 lines of code (comments omitted; I will explain the code later in this post).

import PIL.Image, numpy, fractions
image = numpy.asarray(PIL.Image.open('TokyoPanoramaShredded.png').convert('L'))
diff = numpy.diff([numpy.mean(column) for column in image.transpose()])
threshold, width = 1, 0

while width < 5 and threshold < 255:
    boundaries = [index+1 for index, d in enumerate(diff) if d > threshold]
    width = reduce(lambda x, y: fractions.gcd(x, y), boundaries) if boundaries else 0
    threshold += 1

shreds = range(image.shape[1] / width)
bounds = [(image[:,width*shred], image[:,width*(shred+1)-1]) for shred in shreds]
D = [[numpy.linalg.norm(bounds[s2][1] - bounds[s1][0]) if s1 != s2 else numpy.inf for s2 in shreds] for s1 in shreds]
neighbours = [numpy.argmin(D[shred]) for shred in shreds]
walks = [sequence(neighbours, start) for start in shreds]
new_order = max(walks)[1]

Without blank lines, that’s exactly 14 lines of code (admittedly of varying beauty and elegance), including imports.

Read More

My name is Manuel Ebert and I'm a scientist, web developer, designer and musician (not necessarily in this order) living in Barcelona, Spain. I'm currently doing a PhD in neuro-psychology and no, I can not read your thoughts.

This blog is about code & consciousness, data & design, serotonin & statistics.