Sunday 12 July 2015

texinfo is still TeX

I've been poking at the texinfo parser this week. I was hoping to do a quick-and-dirty parser of a sub-set of it but with a bit more ... but that bit more turns it into something a lot more complex.

The problem is that texinfo isn't a 'file format' as such; it's just a language built on tex. And tex is a sophisticated formatting language that can change input syntax on the fly amongst other possibilities. Unlike xml or sgml(html) there are no universal rules that apply to basic lexical tokens, let alone hierarchical structuring.

After many abortive attempts I think i've finally come up with a workable solution.

The objects in the parser state stack are the parsers themselves so it various the parsing and lexical analyis based on the current environment. Pseudo-environments are used for argument processing and so on. The lexical analyser provides multiple interfaces which allows each environment to switch analysis on the fly.

Error handling or recovery? Yeah no idea yet.

Streaming would be nice but I will leave that for another day and so far it dumps the result to a DOM-like structure. I could implement the W3 DOM interfaces but that's just too much work and not much use unless I wanted to process it as XML directly (which i don't).

I still need to fill out the solution a bit more but it's nice to have the foundation of the design sorted out. It's been a long time since I worked on trying to write a `decent' solution to a parser as normally a hack will suffice and i was pretty rusty on it.

No comments: