Parsing HTML in Lisp

vityok · Post by **vityok** » Thu Mar 14, 2013 8:23 am

Hello!

In my pet project I have to parse different HTML pages from different sources and then extract meta-information from them (Open Graph, RDFa, Dublin Core, etc.) Currently I use closure-html together with xpath. But closure-html and cxml are very sensitive to any troubles with markup: even minor deviations cause errors.

What HTML parsers do you use? What would you advice to do to better extract meta-information from HTML pages?

Thanks

pjstirling · Post by **pjstirling** » Mon Mar 18, 2013 8:37 pm

Personally I use closure-html, but I'm fortunate in that the html I'm reading is well-formed (or at least closure-html isn't choking).
html5lib has an actively maintained python parser for html5, I've been pondering whether you could run it directly with clpython, or burgled-batteries, or even writing a small python script to print the dom in lhtml and invoke it by SB-EXT:RUN-PROGRAM, but I'm not looking into it very quickly, because as I said above, chtml works for me so far

vityok · Post by **vityok** » Thu Mar 21, 2013 10:12 am

Hi, thanks for your suggestions. However running Python code for this task is kind of an awkward way to parse HTML.

I think about either using libxml2 HTML parser or patching the Clozure-HTML.

neslepaks · Post by **neslepaks** » Thu Jul 11, 2013 1:05 am

cl-html5-parser is a port of html5lib.

marcoxa · Post by **marcoxa** » Tue Jul 16, 2013 12:33 pm

In XHTMLambda there is a preliminary XML parser which is a symptom of my personal NIH syndrome. You can take it for a spin.

YMMV etc etc.

Cheers

MA

pjstirling · Post by **pjstirling** » Wed Jul 17, 2013 1:44 pm

html5 explicitly steps back from xml features. It's something that I disagree with but they decided since 'nobody' was writing valid xml there was no point in supporting it. Specifically:

boolean attributes do not have a value, it's just the presence of the name, or its absence,
nodes like BR that have no children need no self-closing, and
nodes that CAN have children but DON'T in your markup (DIV etc) MUST have a closing tag (if you try the self-closing syntax from xml it gets treated as an opening tag)

And all that ignores the now-properly-defined algorithm for parsing bad markup in-general, which was why I linked the python library even though accessing it via some form of python bridge is quite funky. Parsing HTML properly is evil, but as neslepaks said there is now a CL port of it https://github.com/copyleft/cl-html5-parser

LispForum

Parsing HTML in Lisp

Parsing HTML in Lisp

Re: Parsing HTML in Lisp

Re: Parsing HTML in Lisp

Re: Parsing HTML in Lisp

Re: Parsing HTML in Lisp

Re: Parsing HTML in Lisp