You should also check out Dan’s write up.
- DigitalOcean for our (heavy) compute
- GNU Parallel for multiprocessing
- Ghostscript for image conversion
- Tesseract for OCR
- Apache Solr for full text search and indexing
Dan and I arrived in London in time for some discussion about ideas. I had been thinking about Hillsborough and had come up with a bizarre idea about crowd simulation from the data. We bounced ideas for a while and pared all this down to an idea lots of people at hackdays seem to have. Get the text, do machine learning, see what we get.
After some fantastic breakfast and intros we got down to work. Despite some initial hope (and some excellent work by Dan categorising documents) we rapidly discovered that the documents, however plentiful, weren’t in the best format ever. They were typewritten, photocopied, scrawled over, photocopied again and finally scanned. Now the instinct at this point is to stop and do something else, but we aren’t that easily beaten! So our goal was to OCR and then attempt our original machine learning related goals.
We eventually reached a reasonably elegant toolchain, but please remember that this was the result of much stress, top notch caffeine, a nearly-melted MacBook Pro, dismantling the floor for a faster network connection. All of this with no idea if we’d get good data out at the end!
So, why our final toolchain?
The first problem with the Hillsborough data is it has been designed with humans in mind first, and machines second. Part of this is the production of PDF files. So the first thing we needed was to send this to tiff, which could be processed by Tesseract. We chose Ghostscript after ImageMagick wouldn’t give us multi-page tiff. We used GNU Parallel (which is an excellent drop in replacement for xargs) to make this run across the cores of Dan’s Mac, getting it to 100° in the process.
Now, to process it there is only one choice. Just. Use. Tesseract. It coped with all sorts of horrible looking documents. It gave us some mess where there were scribbles, and it’s CPU intensive, but it really gave very accurate matches on the text.
To do this we needed some serious horse power. Our University 24-core system wouldn’t run Tesseract, so within the last hours we spawned a huge 24-core 96GB RAM machine on DigitalOcean (who are cheap and fast to the UK). Even with that we were still processing at the 4 hour mark when I finally hit the big stop button.
We are really hoping that we can do some more work on this, and get it finished.
I personally would like to thank everyone for their really kind comments, and the prize!
Also, while I do hope we can improve the content of the data, I would like to say a huge thank you to the Hillsborough Independent Panel who have done some great work, and should be an inspiration to all inquiries and panels in future in how they have released so much data in such difficult circumstances. I just want to see what we can do with it!
Finally, I love the Rewired State Hackdays. The organisers work so hard and bend over backwards to help everyone there. So a big thanks to them.