Web Based Visual-Meta Parser

The web based Visual-Meta Parser by Christopher Gutteridge is available:

chris.totl.net/pdf-inspect

  

Chris’ Developer Notes

It not super robust because it makes some assumptions… like the order of text information in PDF is the order it appears on the page. (it’s all got positions so that there’s no reason it *has* to be in order).

It assumes that two text blocks on the page A & B should be concatenated without any joining text.

It assumes that two text blocks after each other at different heights on the page should be joined with a newline.

There’s no formal grammar for parsing this so I just kinda guessed. BibTex is sadly a bit vague itself.

The code looks for the Last instance of the @{visual-meta-start} term in the document and from there to the next ‘end’ tag.

I’ve just decided to tell it that a valid ID string is anything except white space and {},= — ie. in

@book{foobar, 

it’s not 100% clear what a legal “foobar” can be… so I had to guess that it’s anything except the things it can’t be!

Code is available ‘as is’ from here: https://github.com/cgutteridge/visual-meta-pdf-inspect but it’s not production quality, just a demo.