babel:octets-to-string throws out INVALID-UTF8-CONTINUATION-BYTE

https://stackoverflow.com/questions/8545777

19-03-2021
|

Question

I'm writing a lisp program to fetch a web page of a Chinese website, I meet problem about parsing the Chinese words from the binary stream, I already have a vector of (unsigned-byte 8) containing the whole page, but when I put it to the babel:octets-to-string, it throws out an exception.

(setf buffer (babel:octets-to-string buffer :encoding :utf-8))

The exception is:

Illegal :UTF-8 character starting at position 437. [Condition of type BABEL-ENCODINGS:INVALID-UTF8-CONTINUATION-BYTE]

I fount that when it meet a Chinese word it must throw out this exception. How can I solve it?

Solution

The error message says everything - there is an invalid UTF-8 byte sequence in your data.

The most probable cause for this error is that the page text itself is not encoded in UTF-8 but some other encoding for Chinese text. You should check the HTML 'META HTTP-EQUIV' tag and 'Content-Type' HTTP Response Header for encoding.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow