SBCL - BSD Socket

Discussion of Common Lisp

SBCL - BSD Socket

Postby plu9in » Wed Jun 06, 2012 3:24 pm

Hi all,

In my lisp code, I open a socket to download websites but I have a problem with the socket charset. When I open my socket, I have to give it a charset. But, surfing the web, my socket has to deal with others charsets.

Before downloading the page, how can my code know the charset requested ? Is it possible to change the charset of an existing socket during its life ?

Thank you for your help.
plu9in
 
Posts: 1
Joined: Wed Jun 06, 2012 3:15 pm

Re: SBCL - BSD Socket

Postby pjstirling » Thu Jun 07, 2012 3:45 am

The short answer is that this is a very ugly problem

HTTP speaks an 8 bit encoding compatible with ASCII, but the HTML inside the response can be any encoding, including multi-byte encodings like UTF-16 UTF-32 or Shift-JIS.

You need to open the socket with a raw 8 bit encoding, speak to the server in HTTP to make the request, and then download the body of the response.

Once you have the body you can:

  • check for the 8, 16, or 32 bit versions of the unicode byte-order mark (BOM), in which case it's that encoding,
  • look for an xml charset processing directive,
  • scan the html HEAD element for a charset directive, and
  • finally: guess based on byte frequencies.
pjstirling
 
Posts: 79
Joined: Sun Nov 28, 2010 4:21 pm

Re: SBCL - BSD Socket

Postby wvxvw » Fri Jun 08, 2012 1:44 am

To add one more option to how the page may specify the character encoding - it's through HTTP header, namely "Content-Type: text/html; charset=utf-8" specifies utf-8, obviously. You also need to look for Transfer-Encoding header, because while the content may be encoded in some text coding technique, it may be later compressed using gzip, for example.

Way much-much-much worse than all above is that web was 90% crafted by people who didn't have a clue about how these things work. So very often the information you will receive in headers, meta-equivalent and other "less visible" attributes of the page will be wrong! Effectively, browsers usually just guess the encoding based on character frequencies.
wvxvw
 
Posts: 127
Joined: Sat Mar 26, 2011 6:23 am

Re: SBCL - BSD Socket

Postby Kompottkin » Sun Jun 10, 2012 7:12 am

Or use Drakma.
User avatar
Kompottkin
 
Posts: 94
Joined: Mon Jul 21, 2008 7:26 am
Location: München, Germany


Return to Common Lisp

Who is online

Users browsing this forum: Bing [Bot] and 2 guests

cron