Question

Having issue encoding issue in inserting data in cassandra using pycassa. The field name is 'text' and content is tweet which can have non-ascii characters. I tried to encode using encode('UTF-8') the text field and it shows, getting converted from 'unicode' to 'str' but still fails? Exact error is here,

-'ascii' codec can't encode character u'\xbf' in position 0: ordinal not in range(128).
-'ascii' codec can't encode character u'\2026' in position 139: ordinal not in range(128).

EDIT 1: For field that this is failing in Cassandra, no default validator type has been defined? Could that be a problem? What would cassandra store it as, if type is not specified?

EDIT 2: This answers EDIT 1. Just noticed something, The field where it's failing does not have default type defined and as per doc, cassandra will try to store it as Hex byte arrays (ByteType) where as I am trying to insert UTF-8 encoded string, Could this be a problem?

Traceback:

Traceback (most recent call last): File "/opt/socialflow/prod/api-reporting/api-reporting/CassFH/app/c.py", line 40, in send Mutator.send(self, *a, **kw) File "/usr/local/lib/python2.6/dist-packages/pycassa/batch.py", line 126, in send allow_retries=self.allow_retries)

File "/usr/local/lib/python2.6/dist-packages/pycassa/pool.py", line 124, in new_f result = f(self, *args, **kwargs)
File "/usr/local/lib/python2.6/dist-packages/pycassa/cassandra/Cassandra.py", line 1005, in batch_mutate self.send_batch_mutate(mutation_map, consistency_level)
File "/usr/local/lib/python2.6/dist-packages/pycassa/cassandra/Cassandra.py", line 1013, in send_batch_mutate args.write(self._oprot)
File "/usr/local/lib/python2.6/dist-packages/pycassa/cassandra/Cassandra.py", line 5200, in write oprot.trans.write(fastbinary.encode_binary(self, (self.class, self.thrift_spec)))UnicodeEncodeError: 'ascii' codec can't encode character u'\xbf' in position 0: ordinal not in range(128)[2013-05-20 21:31:14,450] root CRITICAL:

Was it helpful?

Solution

This issue has been fixed. So, here was the issue.

  • Encoding issue existed in couple of column families for same field called tweet text, which can have non-ascii characters.
  • I used, pycassa Mutator to batch requests across multiple column families
  • So, I fixed encoding issue for 2 column families but failed to do so for rest of 3 CFs.
  • So batch insertion fails for all because it failed for 1 in Pycassa batch.
  • I recommend 3 thorough reads of python pycassa documentation and cassandra data model.

Hope it will help you all.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top