Zach Rubenstein

Many data science projects follow the following workflow

  1. Define problem
  2. Get data
  3. Analyze the data (that may include building predictive models)
  4. Report the results of the analysis.
  5. Implement results into an operational system

The move from step 1 to step 2 is often not trivial. First, the project partner needs to extract data from their systems and encode it so that it can be transferred to us. Second, we need to decode the encoded data into a format that lets us analyze it. It is up to the project partners to decide on an encoding method, and the decoding method depends on how the partner decided to encode the data. With the Diogenes project, we aim to make it easier to perform this decoding step in addition to making it easier to perform a number of other common machine learning tasks.

In the world of data processing, there are a number of well-structured, unambiguous formats that we could use for data interchange. There are, for example, the ubiquitous JSON, the classic XML, and the fashionable YAML. However, when you have to deal with databases of arbitrary flavor from a variety of clients who have varying levels of familiarity with databases and data extraction, you often end up working with CSVs.

How can such a simple format make our lives so hard?

The CSV (comma separated value) format has been around for decades and, in principal, has a lot of advantages. It is a simple, universal specification that is easy for humans to read, (mostly) easy for computers to parse, and has a long legacy of good software support.

However, the csv format has a few problems; it’s generally not human readable, is not universally formatted, is problematic to parse, and has limited support in the Python ecosystem.

A CSV file has a number of records, each consisting of a number of fields. Fields are separated by commas and records are separated by newlines:

1,Alfred Adams,1234 Michigan Ave,Chicago
2,Belle Banks,4567 Randolph,Chicago

Except maybe fields are separated by tabs or pipes:

1|Alfred Adams|1234 Michigan Ave|Chicago
2|Belle Banks|4567 Randolph|Chicago

And, heck, maybe records are separated by commas.

1|Alfred Adams|1234 Michigan Ave|Chicago,2|Belle Banks|4567 Randolph|Chicago

We can (usually) put field separators in the data if we quote it:

1,"Adams, Alfred",1234 Michigan Ave,Chicago
2,"Banks, Belle",4567 Randolph,Chicago

Or maybe that should be single quotes.

1,'Adams, Alfred',1234 Michigan Ave,Chicago
2,'Banks, Belle',4567 Randolph,Chicago

But maybe we should quote all our strings.

1,'Adams, Alfred','1234 Michigan Ave','Chicago'
2,'Banks, Belle','4567 Randolph','Chicago'

Or we could just quote everything.

'1','Adams, Alfred','1234 Michigan Ave','Chicago'
'2','Banks, Belle','4567 Randolph','Chicago'

But is quoted '1' different from just 1? Is the first a string and the second a number? For that matter, was the unquoted 1 a number or a string to begin with?

We can put a literal quote in our data if we double it.

1,'Adams, Alfred ''The Giant''',1234 Michigan Ave,Chicago
2,'Banks, Belle',4567 Randolph,Chicago

Unless we are supposed to use backslash.

1,'Adams, Alfred \'The Giant\'',1234 Michigan Ave,Chicago
2,'Banks, Belle',4567 Randolph,Chicago

Unless the parser just disallows field separators in data, and every quote is literal and we must use an obscure character to separate fields.

1|Adams, Alfred 'The Giant'|1234 Michigan Ave|Chicago
2|Banks, Belle|4567 Randolph|Chicago

And what do we do about missing data? Do we leave the field blank?

1|Adams, Alfred 'The Giant'|1234 Michigan Ave|Chicago
2|Banks, Belle|4567 Randolph|Chicago
3|Doe, John||Chicago

In that case, how do we know that John Doe’s address is missing rather than being an empty string. Maybe we need a special constant for missing data.

1|Adams, Alfred 'The Giant'|1234 Michigan Ave|Chicago
2|Banks, Belle|4567 Randolph|Chicago
3|Doe, John|NO_DATA_HERE|Chicago

We hope that nobody has NO_DATA_HERE as an address. There are other problems with CSVs too. Many of the issues with the format propagate to issues with CSV parsers. We will discuss the software ecosystem in more detail below.

Our Project

Since Diogenes aims to make common machine learning tasks easier, we want to decode CSVs in a way that makes analysis easy without giving our users unnecessary headaches.

  1. Results in a Numpy structured array
  2. Correctly interprets the data type of each column.
  3. Handles missing data sensibly
  4. Loads multi-gigabyte CSVs quickly enough that our users don’t give up and write their own tools.

The combination of those four items took us on an exciting journey as we profiled our way through the ecosystem of CSV-reading Python libraries.

The ecosystem

  • The Python Standard Library CSV module is quite good at reading oddly-formatted CSVs, but unconditionally makes every field a string, which makes it difficult to interpret data types.
  • Numpy’s genfromtxt can produce structured arrays and correctly interpret the types of fields, but it ignores quotation, so if a literal field separator appears anywhere in the data, the read will fail.
  • Pandas’ read_csv which can produce structured arrays using the to_records The primary issue with using Pandas is that our structured arrays will ultimately represent strings as pointers to Python objects (like a Pandas dataframe) rather than a fixed-width Numpy string. Until very recently, Numpy had a number of issues handling pointer types.

Profiling

Initially, we opted to ensure correctness rather than aiming for performance. The process was:

  1. Read the data using the csv module from the standard library.
  2. Figure out the types of all the columns.
  3. Cast columns to the correct data type.
  4. Create a structured array from the cast columns.

One of our colleagues informed us that it was taking many hours to read a CSV of only a few gigabytes, so we ran some experiments. These experiments were run on a mid-2012 Macbook Air (1.8 Ghz, 2 core, 8GB of RAM, solid-state hard drive). They’re all very back-of-the envelope rather than being statistically sound, but for the purpose of this experiment, we were only interested rough estimates of performance.

First, we generated a 1GB CSV of floating-point numbers with a custom-built, random CSV generator

When loading the CSV with Diogenes, we got:

$ time python -c 'from diogenes.read import open_csv; open_csv("GB.csv")'
real   17m7.468s
user   15m50.126s
sys 0m9.647s

It took over 17 minutes to read our 1 GB CSV. Next, we tried loading the CSV with numpy.genfromtxt:

$ time python -c 'import numpy as np; sa = np.genfromtxt("GB.csv", dtype=None, names=True, delimiter=",")'
real   8m13.824s
user   6m33.998s
sys 1m10.819s

Numpy took less than half the time our solution took. We hope to find a better solution than genfromtxt, since genfromtxt doesn’t handle quoted strings. Next, we try Pandas.

$ time python -c 'import pandas as pd; df = pd.read_csv("GB.csv"); df.to_records()'
real   0m48.100s
user   0m44.678s
sys 0m2.619s

Wow! Pandas took less than 1 minute–more than 8 times better than genfromtxt and 16 times better than our original solution. At this point, it looks like it might be worth using Pandas and trying to find a work-around for the string representation issue.

For comparison, we wrote a C program that converts from CSV to Numpy’s efficient .npy format. The .npy format is very close to Numpy’s internal format, and Numpy can handle .npy files without loading them all into memory at once. Consequently, converting to .npy is about as close as we can get to writing our own CSV loader without breaking into Numpy internals. In addition, our program does a minimal amount of work–mostly just parsing a text representation of doubles from standard in and writing their binary representation to standard out. Our C program should give us something approximating the minimum amount of time theoretically required to read a csv into numpy.

$ time ./csv_to_npy GB.csv > GB.npy
real   2m6.890s
user   2m1.276s
sys 0m3.282s

It looks like Pandas does 2x better than our “minimal” baseline. Let’s try commenting out the line that copies our binary representation to standard out.

$ time ./csv_to_npy GB.csv > GB.npy
real   2m3.673s
user   2m0.239s
sys 0m2.928s

We’re still doing far worse than Pandas. We can probably perform more optimizations, but the apparent conclusion is that we should just use Pandas if we can.

Making Pandas work

Given that we’re going with Pandas, we next experimented to see if we could convert Pandas’ pointers to fixed-width strings. For this experiment, we modified our CSV generator slightly so that it generates CSVs of string data rather than numerical data.

If we let pandas represent data as pointers, we still get very good performance:

$ time python -c 'import pandas as pd; df= pd.DataFrame.from_csv("GBstr.csv")'
real   0m49.716s
user   0m43.496s
sys 0m4.294s

We then rewrote Diogenes’ csv import so that it uses Pandas to parse the CSV, then converts the pointers to fixed-width strings.

$ time python -c 'from diogenes.read import open_csv; sa = open_csv("GBstr.csv")'

I don’t have a result for this experiment because I killed it after 30 minutes. It’s worse than the version that uses Python’s standard library. Converting pointers to fixed-width strings in the obvious way clearly isn’t viable.

Fortunately, we didn’t end up having to find a work-around. The issue is fixed as of Numpy 1.10.1. That’s why Diogenes requires a very recent version of Numpy, even though none of its other dependencies are very strict.

Incidentally, we still have to post-process Pandas’ CSV reader to handle missing datetimes correctly, but datetimes are a whole different, unexpectedly challenging, issue.

Conclusion

Reading CSVs is an unexpectedly complex task, with a number of gotchas and obscure issues. After our Odyssey through CSV-reading methodology, we have found that, unsurprisingly, the very mature and popular Pandas is the best tool for the job… if we tweak a few things.

Check out the Diogenes Project and see where we can save you time in your data analysis.