A few weeks ago my employer helped the NY State Senate parse the MTA budget information into a machine searchable format. (The MTA originally published the budget as a PDF.) To parse the PDF I used a utility called pdftohtml to first convert the PDL into an XML document. I then used the python library lxml to convert the document into a set of csv files. The results of this labor can be seen on TOPP’s data site.

Soon after I published this data, however, I was told by a number of people that the data would be more useful if presented in another format. At first I just started creating a bunch of command line python scripts that would suck in these csv files, and spit them out in different formats. I quickly realized that I could accumulate these scripts and create a quick and dirty web application.

Over a few train rides I created an application called DataIO, and today, I finally got a chance to upload it to Google App Engine. The application is pretty simple to interact with; instructions are located on its front page.

Currently the application can only transpose data and multiply the data set by a given factor. I hope to soon add a jsonp api that will make it trivial to convert a given data set into a format that plays nice with google charts and flot.

The code for this application is hosted at bitbucket.

Just for fun, here is some data from, sent through dataIO.

Operating Revenue (transposed and multiplied by 1000000):

Total Receipts by Agency (transposed and returned in json):

Bridges and Tunnels Summary of Total Budgeted Debt Service (multiplied by 100 and returned in csv):





2 responses to “DataIO”

  1. Silona Avatar

    Hi Anil,

    We are talking about similar issues at I am going to crosslink your post there. I am also working on getting my PDF expert that parsed all the PDF’s from and to participate.

  2. Anil Avatar

    @Silona: Neat. I actually have done some more work on DataIO. The additional features I added are outlined in this post:

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.