Page 1 of 1

Parsing HTML in Lisp

PostPosted: Thu Mar 14, 2013 8:23 am
by vityok

In my pet project I have to parse different HTML pages from different sources and then extract meta-information from them (Open Graph, RDFa, Dublin Core, etc.) Currently I use closure-html together with xpath. But closure-html and cxml are very sensitive to any troubles with markup: even minor deviations cause errors.

What HTML parsers do you use? What would you advice to do to better extract meta-information from HTML pages?


Re: Parsing HTML in Lisp

PostPosted: Mon Mar 18, 2013 8:37 pm
by pjstirling
Personally I use closure-html, but I'm fortunate in that the html I'm reading is well-formed (or at least closure-html isn't choking).
html5lib has an actively maintained python parser for html5, I've been pondering whether you could run it directly with clpython, or burgled-batteries, or even writing a small python script to print the dom in lhtml and invoke it by SB-EXT:RUN-PROGRAM, but I'm not looking into it very quickly, because as I said above, chtml works for me so far :)

Re: Parsing HTML in Lisp

PostPosted: Thu Mar 21, 2013 10:12 am
by vityok
Hi, thanks for your suggestions. However running Python code for this task is kind of an awkward way to parse HTML.

I think about either using libxml2 HTML parser or patching the Clozure-HTML.

Re: Parsing HTML in Lisp

PostPosted: Thu Jul 11, 2013 1:05 am
by neslepaks
cl-html5-parser is a port of html5lib.

Re: Parsing HTML in Lisp

PostPosted: Tue Jul 16, 2013 12:33 pm
by marcoxa
In XHTMLambda there is a preliminary XML parser which is a symptom of my personal NIH syndrome. You can take it for a spin. :) YMMV etc etc.



Re: Parsing HTML in Lisp

PostPosted: Wed Jul 17, 2013 1:44 pm
by pjstirling
html5 explicitly steps back from xml features. It's something that I disagree with but they decided since 'nobody' was writing valid xml there was no point in supporting it. Specifically:

  • boolean attributes do not have a value, it's just the presence of the name, or its absence,
  • nodes like BR that have no children need no self-closing, and
  • nodes that CAN have children but DON'T in your markup (DIV etc) MUST have a closing tag (if you try the self-closing syntax from xml it gets treated as an opening tag)

And all that ignores the now-properly-defined algorithm for parsing bad markup in-general, which was why I linked the python library even though accessing it via some form of python bridge is quite funky. Parsing HTML properly is evil, but as neslepaks said there is now a CL port of it