Hello!
In my pet project I have to parse different HTML pages from different sources and then extract meta-information from them (Open Graph, RDFa, Dublin Core, etc.) Currently I use closure-html together with xpath. But closure-html and cxml are very sensitive to any troubles with markup: even minor deviations cause errors.
What HTML parsers do you use? What would you advice to do to better extract meta-information from HTML pages?
Thanks
Parsing HTML in Lisp
-
- Posts: 166
- Joined: Sun Nov 28, 2010 4:21 pm
Re: Parsing HTML in Lisp
Personally I use closure-html, but I'm fortunate in that the html I'm reading is well-formed (or at least closure-html isn't choking).
html5lib has an actively maintained python parser for html5, I've been pondering whether you could run it directly with clpython, or burgled-batteries, or even writing a small python script to print the dom in lhtml and invoke it by SB-EXT:RUN-PROGRAM, but I'm not looking into it very quickly, because as I said above, chtml works for me so far
html5lib has an actively maintained python parser for html5, I've been pondering whether you could run it directly with clpython, or burgled-batteries, or even writing a small python script to print the dom in lhtml and invoke it by SB-EXT:RUN-PROGRAM, but I'm not looking into it very quickly, because as I said above, chtml works for me so far
Re: Parsing HTML in Lisp
Hi, thanks for your suggestions. However running Python code for this task is kind of an awkward way to parse HTML.
I think about either using libxml2 HTML parser or patching the Clozure-HTML.
I think about either using libxml2 HTML parser or patching the Clozure-HTML.
Re: Parsing HTML in Lisp
cl-html5-parser is a port of html5lib.
Re: Parsing HTML in Lisp
In XHTMLambda there is a preliminary XML parser which is a symptom of my personal NIH syndrome. You can take it for a spin. YMMV etc etc.
Cheers
MA
Cheers
MA
Marco Antoniotti
-
- Posts: 166
- Joined: Sun Nov 28, 2010 4:21 pm
Re: Parsing HTML in Lisp
html5 explicitly steps back from xml features. It's something that I disagree with but they decided since 'nobody' was writing valid xml there was no point in supporting it. Specifically:
- boolean attributes do not have a value, it's just the presence of the name, or its absence,
- nodes like BR that have no children need no self-closing, and
- nodes that CAN have children but DON'T in your markup (DIV etc) MUST have a closing tag (if you try the self-closing syntax from xml it gets treated as an opening tag)