AllegroGraph - UTF-8 characters in N-Triples

https://stackoverflow.com/questions/10986640

13-06-2021
|

문제

When I use the AllegroGraph 4.6 Python API, I can use the connection.addTriple() method to try to add a triple that ends in a literal containing a unicode character (×):

conn.addTriple( ..., ..., '5 × 10**5' )

This doesn't work. I get the error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position...

Here's the full traceback:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/cygdrive/c/agraph-4.6-client-python/src2/franz/openrdf/repository/repositoryconnection.py", line 357, in addTriple
    self._convert_term_to_mini_term(obj), cxt)
  File "/cygdrive/c/agraph-4.6-client-python/src2/franz/openrdf/repository/repositoryconnection.py", line 235, in _convert_term_to_mini_term
    return self._to_ntriples(term)
  File "/cygdrive/c/agraph-4.6-client-python/src2/franz/openrdf/repository/repositoryconnection.py", line 367, in _to_ntriples
    else: return term.toNTriples();
  File "/cygdrive/c/agraph-4.6-client-python/src2/franz/openrdf/model/literal.py", line 182, in toNTriples
    sb.append(strings.encode_ntriple_string(self.getLabel()))
  File "/cygdrive/c/agraph-4.6-client-python/src2/franz/openrdf/util/strings.py", line 52, in encode_ntriple_string
    string = unicode(string)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 18: ordinal not in range(128)

Instead I can add the triple like this:

conn.addTriple( ..., ..., u'5 × 10**5' )

That way I don't get an error.

But if I load a file of ntriples that contains some UTF-8 encoded characters using connection.addFile(filename, format=RDFFormat.NTRIPLES), I get this error message if the ntriples file is saved as ANSI encoding from Notepad++:

400 MALFORMED DATA: N-Triples parser error while parsing
#<http request stream @ #x10046f9ea2> at line 12764 (last character was
#\×): nil
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/cygdrive/c/agraph-4.6-client-python/src2/franz/openrdf/repository/repositoryconnection.py", line 341, in addFile
    commitEvery=self.add_commit_size)
  File "/cygdrive/c/agraph-4.6-client-python/src2/franz/miniclient/repository.py", line 342, in loadFile
    nullRequest(self, "POST", "/statements?" + params, body, contentType=mime)
  File "/cygdrive/c/agraph-4.6-client-python/src2/franz/miniclient/request.py", line 198, in nullRequest
    if (status < 200 or status > 204): raise RequestError(status, body)
franz.miniclient.request.RequestError: Server returned 400: N-Triples parser error while parsing

I get this error message if the file is saved as UTF-8 encoding:

400 MALFORMED DATA: N-Triples parser error while parsing
#<http request stream @ #x100486e8b2> at line 1 (last character was
#\): Subjects must be resources (i.e., URIs or blank nodes)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/cygdrive/c/agraph-4.6-client-python/src2/franz/openrdf/repository/repositoryconnection.py", line 341, in addFile
    commitEvery=self.add_commit_size)
  File "/cygdrive/c/agraph-4.6-client-python/src2/franz/miniclient/repository.py", line 342, in loadFile
    nullRequest(self, "POST", "/statements?" + params, body, contentType=mime)
  File "/cygdrive/c/agraph-4.6-client-python/src2/franz/miniclient/request.py", line 198, in nullRequest
    if (status < 200 or status > 204): raise RequestError(status, body)
franz.miniclient.request.RequestError: Server returned 400: N-Triples parser error while parsing

However, if the file is set to ANSI encoding in Notepad++, I can go in and paste the × character, save, and then the file loads fine. Or, if I change the file encoding to UTF-8 after I paste the character, then the character changes to some strange xD7 character. If the file is set to UTF-8 encoding and I paste the × in there, then if I change the encoding to ANSI the × changes to a Ã—.

When the file was given to me, it had Ã— where the × should have been, and when I tried to load it in AllegroGraph I got the first 400 MALFORMED DATA error, which fails at the line where the character actually appears in the file (12764), instead of just at the first line. I assume that the reason I get the second 400 MALFORMED DATA error on line 1 has something to do with the header written by Notepad++ for UTF-8 encoded files. So apparently, I have to save a file as ANSI if I want AllegroGraph not to hiccup immediately, but there has to be some way to tell AllegroGraph to read things like Ã— as UTF-8 characters.

In the file, the triple looks like:

<...some subject URI...> <...some predicate URI...> "5 × 10**5" .

해결책

\xd7 is the Latin-1 encoding of ×.

Ã— is what you get if you mistakenly decode × to cp1252 (often Windows' default codec) if it's been encoded in UTF-8.

When you're given files that show Ã—, try changing the codec that's used to display them to UTF-8.

For an overview of Unicode in Python see here. ~ Thanks to Daenyth.

As you found out from AllegroGraph support:

AllegroGraph can take unicode characters in nTriples using \uXXXX notation. Alternatively one can use RDFXML, which allows you to leave the unicode characters as they are.

다른 팁

use codecs module.

import codecs
f = codecs.open('file.txt','r','utf8')

this will open your file forcing the utf8 encoding

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow