/ Programming

IO in R vs Python

One of the perennial annoyances of modern scientific computing is how slow it is to read in large files. Bioinformaticians have a fetish for plain text, and this, while generally regarded as a Good Thing, has a few unfortunate consequences:

  • Flat text creates the opportunity for ambiguity
  • Flat text files aren't very good at conveying metadata, so the software author has to do a lot of guessing, or the user has to keep track of lots of "annotation files", as they're often referred to in the science world. Annotation and data files often get mixed up, and that can have really nasty consequences
  • Not knowing the dimensions of data ahead of time limits the approaches the programmer can take.

R has a large number of convenience functions for IO. These include (from highest level to lowest):

  • read.table and its siblings read.csv ,read.delim, and read.csv2, which return type data.frame
  • scan, and readLines which works very similary to read.table, but can only return a single vector

Both of these classes of functions can operate on a number of inputs:

  • "standard" files. This simply means that

    myfile <- "~/somefile.txt"
    mydata <- read.table(

I've been using Python a lot more lately. One thing I find interesting about Python is its use of generators. A generator (as far as I can tell) is a language construct that allows for easy iteration over objects. One obvious application of this is the reading of a file line by line.