Parsing HTML in Lisp

Discussion of Common Lisp

Parsing HTML in Lisp

Postby vityok » Thu Mar 14, 2013 8:23 am

Hello!

In my pet project I have to parse different HTML pages from different sources and then extract meta-information from them (Open Graph, RDFa, Dublin Core, etc.) Currently I use closure-html together with xpath. But closure-html and cxml are very sensitive to any troubles with markup: even minor deviations cause errors.

What HTML parsers do you use? What would you advice to do to better extract meta-information from HTML pages?

Thanks
vityok
 
Posts: 20
Joined: Fri Jul 11, 2008 6:20 am
Location: Kyiv, Ukraine

Re: Parsing HTML in Lisp

Postby pjstirling » Mon Mar 18, 2013 8:37 pm

Personally I use closure-html, but I'm fortunate in that the html I'm reading is well-formed (or at least closure-html isn't choking).
html5lib has an actively maintained python parser for html5, I've been pondering whether you could run it directly with clpython, or burgled-batteries, or even writing a small python script to print the dom in lhtml and invoke it by SB-EXT:RUN-PROGRAM, but I'm not looking into it very quickly, because as I said above, chtml works for me so far :)
pjstirling
 
Posts: 75
Joined: Sun Nov 28, 2010 4:21 pm

Re: Parsing HTML in Lisp

Postby vityok » Thu Mar 21, 2013 10:12 am

Hi, thanks for your suggestions. However running Python code for this task is kind of an awkward way to parse HTML.

I think about either using libxml2 HTML parser or patching the Clozure-HTML.
vityok
 
Posts: 20
Joined: Fri Jul 11, 2008 6:20 am
Location: Kyiv, Ukraine

Re: Parsing HTML in Lisp

Postby neslepaks » Thu Jul 11, 2013 1:05 am

cl-html5-parser is a port of html5lib.
neslepaks
 
Posts: 1
Joined: Thu Jul 11, 2013 1:04 am

Re: Parsing HTML in Lisp

Postby marcoxa » Tue Jul 16, 2013 12:33 pm

In XHTMLambda there is a preliminary XML parser which is a symptom of my personal NIH syndrome. You can take it for a spin. :) YMMV etc etc.

Cheers

MA
Marco Antoniotti
marcoxa
 
Posts: 69
Joined: Thu Aug 14, 2008 6:31 pm

Re: Parsing HTML in Lisp

Postby pjstirling » Wed Jul 17, 2013 1:44 pm

html5 explicitly steps back from xml features. It's something that I disagree with but they decided since 'nobody' was writing valid xml there was no point in supporting it. Specifically:

  • boolean attributes do not have a value, it's just the presence of the name, or its absence,
  • nodes like BR that have no children need no self-closing, and
  • nodes that CAN have children but DON'T in your markup (DIV etc) MUST have a closing tag (if you try the self-closing syntax from xml it gets treated as an opening tag)

And all that ignores the now-properly-defined algorithm for parsing bad markup in-general, which was why I linked the python library even though accessing it via some form of python bridge is quite funky. Parsing HTML properly is evil, but as neslepaks said there is now a CL port of it https://github.com/copyleft/cl-html5-parser
pjstirling
 
Posts: 75
Joined: Sun Nov 28, 2010 4:21 pm


Return to Common Lisp

Who is online

Users browsing this forum: No registered users and 2 guests