Question

I'm following along with the DISCO example for counting words from a file:

Counting Words as a map/reduce job

I have no issues getting this working, however I want to try reading in a specific field from a text file that contains JSON strings.

The file has lines like:

{"favorited": false, "in_reply_to_user_id": 306846931, "contributors": null, "truncated": false, "text": "@CataDuarte8 No! av\u00edseme cuando vaya ah salir para yo salir igual!", "created_at": "Wed Apr 04 20:25:37 +0000 2012", "retweeted": false, "in_reply_to_status_id": 187636960632901632, "coordinates": null, "id": 187637067415683073, "entities": {"user_mentions": [{"indices": [0, 12], "id_str": "306846931", "id": 306846931, "name": "Catalina Ria\u00f1o!\u2661", "screen_name": "CataDuarte8"}], "hashtags": [], "urls": []}, "in_reply_to_status_id_str": "187636960632901632", "id_str": "187637067415683073", "in_reply_to_screen_name": "CataDuarte8", "user": {"follow_request_sent": null, "profile_use_background_image": true, "id": 286402064, "description": "Cada quien RECOJE lo que SIEMBRA (:\r\n\u2551\u258c\u2502\u2551\u2502\u2551\u258c\u2502\u2588\u2551\u2502\u2551\u258c\u2502\u2551\u258c\u2551 ", "verified": false, "profile_image_url_https": "https://si0.twimg.com/profile_images/1858805061/ginri_normal.jpg", "profile_sidebar_fill_color": "525252", "is_translator": false, "geo_enabled": false, "profile_text_color": "ffffff", "followers_count": 620, "protected": false, "location": "", "default_profile_image": false, "id_str": "286402064", "utc_offset": -21600, "statuses_count": 16395, "profile_background_color": "000000", "friends_count": 537, "profile_link_color": "ff0000", "profile_image_url": "http://a0.twimg.com/profile_images/1858805061/ginri_normal.jpg", "notifications": null, "show_all_inline_media": true, "profile_background_image_url_https": "https://si0.twimg.com/profile_background_images/419254765/Scan0004.jpg", "profile_background_image_url": "http://a0.twimg.com/profile_background_images/419254765/Scan0004.jpg", "screen_name": "LadyRomeroo", "lang": "es", "profile_background_tile": true, "favourites_count": 136, "name": "Lady Romero \u2605", "url": "http://www.facebook.com/profile.php?id=1640385164", "created_at": "Fri Apr 22 23:04:41 +0000 2011", "contributors_enabled": false, "time_zone": "Central Time (US & Canada)", "profile_sidebar_border_color": "0a5b80", "default_profile": false, "following": null, "listed_count": 0}, "place": null, "retweet_count": 0, "geo": null, "in_reply_to_user_id_str": "306846931", "source": "web"}

I'm only interested in the "text" key, value fields. In python I can do:

import simplejson
f = open("file.json", "r")
for line in f:
    r = simplejson.loads(line).get('text')
    print r

which returns all the text field values like:

@_MuitoMais_  ´vcs são d  msm amei o pode ou ão pode e a entrevist com a @claudialeitte =)

This works fine, however when I try to apply this same method to the sample count_words.py example that comes with disco like so:

from disco.core import Job, result_iterator
import simplejson

def map(line, params):
    r = simplejson.loads(line).get('text')
    for word in r.split():
        yield word, 1

def reduce(iter, params):
    from disco.util import kvgroup
    for word, counts in kvgroup(sorted(iter)):
        yield word, sum(counts)

if __name__ == '__main__':
    job = Job().run(input=["/tmp/file.json"],
                    map=map,
                    reduce=reduce)
    for word, count in result_iterator(job.wait(show=True)):
        print word, count

I get the following error:

# python test.py 
Job@549:b4c76:9cbb1:
Status: [map] 0 waiting, 1 running, 0 done, 0 failed
2012/11/24 02:01:10  master     New job initialized!
2012/11/24 02:01:10  master     Starting job
2012/11/24 02:01:10  master     Starting map phase
2012/11/24 02:01:10  master     map:0 assigned to comp1
2012/11/24 02:01:11  master     ERROR: Job failed: Worker at 'comp1' died: Traceback (most recent call last):
  File "/home/DISCO/data/comp1/46/Job@549:b4c76:9cbb1/usr/local/lib/python2.7/site-packages/disco/worker/__init__.py", line 329, in main                               
    job.worker.start(task, job, **jobargs)                                                                                                                              
  File "/home/DISCO/data/comp1/46/Job@549:b4c76:9cbb1/usr/local/lib/python2.7/site-packages/disco/worker/__init__.py", line 290, in start                              
    self.run(task, job, **jobargs)                                                                                                                                      
  File "/home/DISCO/data/comp1/46/Job@549:b4c76:9cbb1/usr/local/lib/python2.7/site-packages/disco/worker/classic/worker.py", line 286, in run                          
    getattr(self, task.mode)(task, params)                                                                                                                              
  File "/home/DISCO/data/comp1/46/Job@549:b4c76:9cbb1/usr/local/lib/python2.7/site-packages/disco/worker/classic/worker.py", line 302, in map                          
    part = str(self['partition'](key, self['partitions'], params))                                                                                                      
  File "/home/DISCO/data/comp1/46/Job@549:b4c76:9cbb1/usr/local/lib/python2.7/site-packages/disco/worker/classic/func.py", line 341, in default_partition              
    return hash(str(key)) % nr_partitions                                                                                                                               
UnicodeEncodeError: 'ascii' codec can't encode character u'\xb4' in position 0: ordinal not in range(128)                                                               

2012/11/24 02:01:11  master     WARN: Job killed
Status: [map] 1 waiting, 0 running, 0 done, 1 failed
Traceback (most recent call last):
  File "test.py", line 18, in <module>
    for word, count in result_iterator(job.wait(show=True)):
  File "/usr/local/lib/python2.7/site-packages/disco/core.py", line 348, in wait
    timeout, poll_interval * 1000)
  File "/usr/local/lib/python2.7/site-packages/disco/core.py", line 309, in check_results
    raise JobError(Job(name=jobname, master=self), "Status %s" % status)
disco.error.JobError: Job Job@549:b4c76:9cbb1 failed: Status dead

It seems like this should be straight forward but I'm obviously missing something.

Can anyone help?

Was it helpful?

Solution

Your problem is in disco/worker/classic/func.py... str() will not accept a unicode character...

>>> str(u'\xb4')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xb4' in position 0: ordinal not in range(128)
>>>

Since you are only counting words, you could convert your unicode data into strings with the unicodedata module...

import json
import unicodedata
f = open('file.json')
for line in f:
    r = json.loads(line).get('text')
    s = unicodedata.normalize('NFD', r).encode('ascii', 'ignore')
    print r
    print s

Output:

@CataDuarte8 No! avíseme cuando vaya ah salir para yo salir igual!
@CataDuarte8 No! aviseme cuando vaya ah salir para yo salir igual!

Applying this to your problem... rewrite your map() function as...

def map(line, params):
    r = simplejson.loads(line).get('text')
    s = unicodedata.normalize('NFD', r).encode('ascii', 'ignore')
    for word in s.split():
        yield word, 1
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top