Tagged: lxml

DataIO

A few weeks ago my employer helped the NY State Senate parse the MTA budget information into a machine searchable format. (The MTA originally published the budget as a PDF.) To parse the PDF I used a utility called pdftohtml to first convert the PDL into an XML document. I then used the python library lxml to convert the document into a set of csv files. The results of this labor can be seen on TOPP’s data site.

Soon after I published this data, however, I was told by a number of people that the data would be more useful if presented in another format. At first I just started creating a bunch of command line python scripts that would suck in these csv files, and spit them out in different formats. I quickly realized that I could accumulate these scripts and create a quick and dirty web application.

Over a few train rides I created an application called DataIO, and today, I finally got a chance to upload it to Google App Engine. The application is pretty simple to interact with; instructions are located on its front page.

Continue reading