July 21, 2015

MonMet now with OCR magic

I’ve just received my new iPhone and I wanted to install MonMet on it, but I discovered that the service provided by the bus company has changed. Since my favorite bus stops were already saved on my Nexus 5 I did not saw it coming.

The time table page has changed. It now displays a PNG image instead of textual content… an image. They don’t provide any public API either, so there’s no way we can retrieve the information without using some OCR magic on the image. Even the PDF version embeds an image. Who’s got time for that? I do, I do have one hour of my life to lose. So let’s open IntelliJ and just do that.

I replaced the old parsing code with an OCR version using Tesseract and Tesseract4J. It’s been time since I’ve used it, so it took me some time to refresh my memory and install the dependencies (namely I was missing GhostScript on my Mac). Here’s some code for the impatient:


Which extracts correctly the timetable from the image, in the following form:

week = [04:57, 05:34, 06:01, 06:31, 06:59, 07:15, 07:30, 07:45, 08:00, 08:15, 08:30, 08:45, 09:01, 09:16, 09:31, 09:46, 10:01, 10:16, 10:31, 10:46, 11:01, 11:16, 11:31, 11:46, 12:01, 12:16, 12:31, 12:46, 13:01, 13:16, 13:31, 13:46, 14:01, 14:16, 14:31, 14:46, 15:01, 15:16, 15:31, 15:46, 16:00, 16:15, 16:30, 16:45, 17:00, 17:15, 17:30, 17:45, 18:01, 18:16, 18:31, 18:46, 19:01, 19:17, 19:32, 19:47, 20:02, 20:17, 20:32, 20:52, 21:12, 21:33, 22:03, 22:33, 23:03, 23:33, 00:03]
saturday = ...
sunday = [07:59, 08:29, 09:00, 09:30, 10:00, 10:30, 11:00, 11:30, 12:00, 12:30, 13:00, 13:30, 14:00, 14:30, 15:00, 15:30, 15:59, 16:29, 16:59, 17:29, 18:00, 18:30, 19:01, 19:32, 20:02, 20:32, 21:03, 21:33, 22:03, 22:33, 23:03, 23:33, 00:03]

I now need to check that’s its working correctly for various timetables.

Who’s talking about open data?

I’m a big fan of open data. We live in the era of big data. Data is what matters, APIs are essential so that anyone can make something good out of it. But what is open data?

Open data is data that can be freely used, re-used and redistributed by anyone – subject only, at most, to the requirement to attribute and sharealike.

  • Availability and Access: the data must be available as a whole and at no more than a reasonable reproduction cost, preferably by downloading over the internet. The data must also be available in a convenient and modifiable form.
  • Re-use and Redistribution: the data must be provided under terms that permit re-use and redistribution including the intermixing with other datasets.
  • Universal Participation: everyone must be able to use, re-use and redistribute – there should be no discrimination against fields of endeavour or against persons or groups. For example, ‘non-commercial’ restrictions that would prevent ‘commercial’ use, or restrictions of use for certain purposes (e.g. only in education), are not allowed.

I don’t understand why in 2015 some big websites and corps don’t offer a simple endpoint to retrieve such data. I’m not asking for a full OAuth 2 REST API, but a simple JSON endpoint for such a thing would be great. Even XML, I don’t care. At the end of the day it would benefit the end user, empowering him with a choice of application. Or are time tables copyrighted, sensitive data? I don’t know, maybe someone should explain me like I’m five years old, but I don’t think they make money out of schedules? Bus tickets are already more expensive than in Paris for buses with not always available air-conditioning.

What should be available

For such a global city bus company, this is the kind of API that I think should be available at least:

  • Retrieve all the lines
  • For each line
    • all the bus directions
    • all the bus stops in a specific direction
  • For each bus stop
    • the schedules of for each line and direction
    • information about the stop including:
      • lines and direction available at that stop
      • geolocation
      • real-time information about the waiting time before next N rides
      • number of seats
      • availability of a ticket distributor
      • open/closed due to maintenance
      • availability of an access ramp for disabled persons

If you’ve got some time, check the Why Open Data? (in French) guide to know why it matters and why I do care.

Alexandre Grison - //grison.me - @algrison