AllegroGraph - UTF-8 characters in N-Triples

https://stackoverflow.com/questions/10986640

13-06-2021
|

Domanda

When I use the AllegroGraph 4.6 Python API, I can use the connection.addTriple() method to try to add a triple that ends in a literal containing a unicode character (×):

conn.addTriple( ..., ..., '5 × 10**5' )

This doesn't work. I get the error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position...

Here's the full traceback:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/cygdrive/c/agraph-4.6-client-python/src2/franz/openrdf/repository/repositoryconnection.py", line 357, in addTriple
    self._convert_term_to_mini_term(obj), cxt)
  File "/cygdrive/c/agraph-4.6-client-python/src2/franz/openrdf/repository/repositoryconnection.py", line 235, in _convert_term_to_mini_term
    return self._to_ntriples(term)
  File "/cygdrive/c/agraph-4.6-client-python/src2/franz/openrdf/repository/repositoryconnection.py", line 367, in _to_ntriples
    else: return term.toNTriples();
  File "/cygdrive/c/agraph-4.6-client-python/src2/franz/openrdf/model/literal.py", line 182, in toNTriples
    sb.append(strings.encode_ntriple_string(self.getLabel()))
  File "/cygdrive/c/agraph-4.6-client-python/src2/franz/openrdf/util/strings.py", line 52, in encode_ntriple_string
    string = unicode(string)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 18: ordinal not in range(128)

Instead I can add the triple like this:

conn.addTriple( ..., ..., u'5 × 10**5' )

That way I don't get an error.

But if I load a file of ntriples that contains some UTF-8 encoded characters using connection.addFile(filename, format=RDFFormat.NTRIPLES), I get this error message if the ntriples file is saved as ANSI encoding from Notepad++:

400 MALFORMED DATA: N-Triples parser error while parsing
#<http request stream @ #x10046f9ea2> at line 12764 (last character was
#\×): nil
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/cygdrive/c/agraph-4.6-client-python/src2/franz/openrdf/repository/repositoryconnection.py", line 341, in addFile
    commitEvery=self.add_commit_size)
  File "/cygdrive/c/agraph-4.6-client-python/src2/franz/miniclient/repository.py", line 342, in loadFile
    nullRequest(self, "POST", "/statements?" + params, body, contentType=mime)
  File "/cygdrive/c/agraph-4.6-client-python/src2/franz/miniclient/request.py", line 198, in nullRequest
    if (status < 200 or status > 204): raise RequestError(status, body)
franz.miniclient.request.RequestError: Server returned 400: N-Triples parser error while parsing

I get this error message if the file is saved as UTF-8 encoding:

400 MALFORMED DATA: N-Triples parser error while parsing
#<http request stream @ #x100486e8b2> at line 1 (last character was
#\): Subjects must be resources (i.e., URIs or blank nodes)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/cygdrive/c/agraph-4.6-client-python/src2/franz/openrdf/repository/repositoryconnection.py", line 341, in addFile
    commitEvery=self.add_commit_size)
  File "/cygdrive/c/agraph-4.6-client-python/src2/franz/miniclient/repository.py", line 342, in loadFile
    nullRequest(self, "POST", "/statements?" + params, body, contentType=mime)
  File "/cygdrive/c/agraph-4.6-client-python/src2/franz/miniclient/request.py", line 198, in nullRequest
    if (status < 200 or status > 204): raise RequestError(status, body)
franz.miniclient.request.RequestError: Server returned 400: N-Triples parser error while parsing

However, if the file is set to ANSI encoding in Notepad++, I can go in and paste the × character, save, and then the file loads fine. Or, if I change the file encoding to UTF-8 after I paste the character, then the character changes to some strange xD7 character. If the file is set to UTF-8 encoding and I paste the × in there, then if I change the encoding to ANSI the × changes to a Ã—.

When the file was given to me, it had Ã— where the × should have been, and when I tried to load it in AllegroGraph I got the first 400 MALFORMED DATA error, which fails at the line where the character actually appears in the file (12764), instead of just at the first line. I assume that the reason I get the second 400 MALFORMED DATA error on line 1 has something to do with the header written by Notepad++ for UTF-8 encoded files. So apparently, I have to save a file as ANSI if I want AllegroGraph not to hiccup immediately, but there has to be some way to tell AllegroGraph to read things like Ã— as UTF-8 characters.

In the file, the triple looks like:

<...some subject URI...> <...some predicate URI...> "5 × 10**5" .

Soluzione

\xd7 is the Latin-1 encoding of ×.

Ã— is what you get if you mistakenly decode × to cp1252 (often Windows' default codec) if it's been encoded in UTF-8.

When you're given files that show Ã—, try changing the codec that's used to display them to UTF-8.

For an overview of Unicode in Python see here. ~ Thanks to Daenyth.

As you found out from AllegroGraph support:

AllegroGraph can take unicode characters in nTriples using \uXXXX notation. Alternatively one can use RDFXML, which allows you to leave the unicode characters as they are.

Altri suggerimenti

use codecs module.

import codecs
f = codecs.open('file.txt','r','utf8')

this will open your file forcing the utf8 encoding

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow