Pergunta

I'm running dozens of map reduce jobs for a number of different purposes using disco. My data has grown enormous and I thought I would try using DDFS for a change rather than standard txt files.

I've followed the DISCO map/reduce example Counting Words as a map/reduce job, without to much difficulty and with the help of others, Reading JSON specific data into DISCO I've gotten past one of my latest problems.

I'm trying to read data in/out of ddfs to better chunk and distribute it but am having a bit of trouble.

Here's an example file: file.txt

{"favorited": false, "in_reply_to_user_id": null, "contributors": null, "truncated": false, "text": "I'll call him back tomorrow I guess", "created_at": "Mon Feb 13 05:34:27 +0000 2012", "retweeted": false, "in_reply_to_status_id_str": null, "coordinates": null, "in_reply_to_user_id_str": null, "entities": {"user_mentions": [], "hashtags": [], "urls": []}, "in_reply_to_status_id": null, "id_str": "168931016843603968", "place": null, "user": {"follow_request_sent": null, "profile_use_background_image": true, "profile_background_image_url_https": "https://si0.twimg.com/profile_background_images/305726905/FASHION-3.png", "verified": false, "profile_image_url_https": "https://si0.twimg.com/profile_images/1818996723/image_normal.jpg", "profile_sidebar_fill_color": "292727", "is_translator": false, "id": 113532729, "profile_text_color": "000000", "followers_count": 78, "protected": false, "location": "With My Niggas In Paris!", "default_profile_image": false, "listed_count": 0, "utc_offset": -21600, "statuses_count": 6733, "description": "Made in CHINA., Educated && Making My Own $$. Fear GOD && Put Him 1st. #TeamFollowBack #TeamiPhone\n", "friends_count": 74, "profile_link_color": "b03f3f", "profile_image_url": "http://a2.twimg.com/profile_images/1818996723/image_normal.jpg", "notifications": null, "show_all_inline_media": false, "geo_enabled": true, "profile_background_color": "1f9199", "id_str": "113532729", "profile_background_image_url": "http://a3.twimg.com/profile_background_images/305726905/FASHION-3.png", "name": "Bee'Jay", "lang": "en", "profile_background_tile": true, "favourites_count": 19, "screen_name": "OohMyBEEsNice", "url": "http://www.bitchimpaid.org", "created_at": "Fri Feb 12 03:32:54 +0000 2010", "contributors_enabled": false, "time_zone": "Central Time (US & Canada)", "profile_sidebar_border_color": "000000", "default_profile": false, "following": null}, "in_reply_to_screen_name": null, "retweet_count": 0, "geo": null, "id": 168931016843603968, "source": "<a href=\"http://twitter.com/#!/download/iphone\" rel=\"nofollow\">Twitter for iPhone</a>"}
{"favorited": false, "in_reply_to_user_id": 50940453, "contributors": null, "truncated": false, "text": "@LegaMrvica @MimozaBand makasi om artis :D kadoo kadoo", "created_at": "Mon Feb 13 05:34:27 +0000 2012", "retweeted": false, "in_reply_to_status_id_str": "168653037894770688", "coordinates": null, "in_reply_to_user_id_str": "50940453", "entities": {"user_mentions": [{"indices": [0, 11], "screen_name": "LegaMrvica", "id": 50940453, "name": "Lega_thePianis", "id_str": "50940453"}, {"indices": [12, 23], "screen_name": "MimozaBand", "id": 375128905, "name": "Mimoza", "id_str": "375128905"}], "hashtags": [], "urls": []}, "in_reply_to_status_id": 168653037894770688, "id_str": "168931016868761600", "place": null, "user": {"follow_request_sent": null, "profile_use_background_image": true, "profile_background_image_url_https": "https://si0.twimg.com/profile_background_images/347686061/Galungan_dan_Kuningan.jpg", "verified": false, "profile_image_url_https": "https://si0.twimg.com/profile_images/1803845596/Picture_20124_normal.jpg", "profile_sidebar_fill_color": "DDFFCC", "is_translator": false, "id": 48293450, "profile_text_color": "333333", "followers_count": 182, "protected": false, "location": "\u00dcT: -6.906799,107.622383", "default_profile_image": false, "listed_count": 0, "utc_offset": -28800, "statuses_count": 3052, "description": "Fashion design maranatha '11 // traditional dancer (bali) at sanggar tampak siring & Natya Nataraja", "friends_count": 206, "profile_link_color": "0084B4", "profile_image_url": "http://a3.twimg.com/profile_images/1803845596/Picture_20124_normal.jpg", "notifications": null, "show_all_inline_media": false, "geo_enabled": true, "profile_background_color": "9AE4E8", "id_str": "48293450", "profile_background_image_url": "http://a0.twimg.com/profile_background_images/347686061/Galungan_dan_Kuningan.jpg", "name": "nana afiff", "lang": "en", "profile_background_tile": true, "favourites_count": 2, "screen_name": "hasnfebria", "url": null, "created_at": "Thu Jun 18 08:50:29 +0000 2009", "contributors_enabled": false, "time_zone": "Pacific Time (US & Canada)", "profile_sidebar_border_color": "BDDCAD", "default_profile": false, "following": null}, "in_reply_to_screen_name": "LegaMrvica", "retweet_count": 0, "geo": null, "id": 168931016868761600, "source": "<a href=\"http://blackberry.com/twitter\" rel=\"nofollow\">Twitter for BlackBerry\u00ae</a>"}
{"favorited": false, "in_reply_to_user_id": 27260086, "contributors": null, "truncated": false, "text": "@justinbieber u were born to be somebody, and u're super important in beliebers' life. thanks for all biebs. I love u. follow me? 84", "created_at": "Mon Feb 13 05:34:27 +0000 2012", "retweeted": false, "in_reply_to_status_id_str": null, "coordinates": null, "in_reply_to_user_id_str": "27260086", "entities": {"user_mentions": [{"indices": [0, 13], "screen_name": "justinbieber", "id": 27260086, "name": "Justin Bieber", "id_str": "27260086"}], "hashtags": [], "urls": []}, "in_reply_to_status_id": null, "id_str": "168931016856178688", "place": null, "user": {"follow_request_sent": null, "profile_use_background_image": true, "profile_background_image_url_https": "https://si0.twimg.com/profile_background_images/416005864/Captura.JPG", "verified": false, "profile_image_url_https": "https://si0.twimg.com/profile_images/1808883280/Captura6_normal.JPG", "profile_sidebar_fill_color": "f5e7f3", "is_translator": false, "id": 406750700, "profile_text_color": "333333", "followers_count": 1122, "protected": false, "location": "Adentro de una supra.", "default_profile_image": false, "listed_count": 0, "utc_offset": -14400, "statuses_count": 20966, "description": "Mi \u00eddolo es @justinbieber , si te gusta \u00a1genial!, si no, solo respetalo. El cambi\u00f3 mi vida completamente y mi sue\u00f1o es conocerlo #TrueBelieber . ", "friends_count": 1015, "profile_link_color": "9404b8", "profile_image_url": "http://a1.twimg.com/profile_images/1808883280/Captura6_normal.JPG", "notifications": null, "show_all_inline_media": false, "geo_enabled": false, "profile_background_color": "f9fcfa", "id_str": "406750700", "profile_background_image_url": "http://a3.twimg.com/profile_background_images/416005864/Captura.JPG", "name": "neversaynever,right?", "lang": "es", "profile_background_tile": false, "favourites_count": 22, "screen_name": "True_Belieebers", "url": "http://www.wehavebieber-fever.tumblr.com", "created_at": "Mon Nov 07 04:17:40 +0000 2011", "contributors_enabled": false, "time_zone": "Santiago", "profile_sidebar_border_color": "C0DEED", "default_profile": false, "following": null}, "in_reply_to_screen_name": "justinbieber", "retweet_count": 0, "geo": null, "id": 168931016856178688, "source": "<a href=\"http://yfrog.com\" rel=\"nofollow\">Yfrog</a>"}

I load it into DDFS with:

# ddfs chunk data:test1 ./file.txt
created: disco://localhost/ddfs/vol0/blob/44/file_txt-0$549-db27b-125e1

I test that the file is indeed loaded into ddfs with:

# ddfs xcat data:test1
{"favorited": false, "in_reply_to_user_id": null, "contributors": null, "truncated": false, "text": "I'll call him back tomorrow I guess", "created_at": "Mon Feb 13 05:34:27 +0000 2012", "retweeted": false, "in_reply_to_status_id_str": null, "coordinates": null, "in_reply_to_user_id_str": null, "entities": {"user_mentions": [], "hashtags": [], "urls": []}, "in_reply_to_status_id": null, "id_str": "168931016843603968", "place": null, "user": {"follow_request_sent": null, "profile_use_background_image": true, "profile_background_image_url_https": "https://si0.twimg.com/profile_background_images/305726905/FASHION-3.png", "verified": false, "profile_image_url_https": "https://si0.twimg.com/profile_images/1818996723/image_normal.jpg", "profile_sidebar_fill_color": "292727", "is_translator": false, "id": 113532729, "profile_text_color": "000000", "followers_count": 78, "protected": false, "location": "With My Niggas In Paris!", "default_profile_image": false, "listed_count": 0, "utc_offset": -21600, "statuses_count": 6733, "description": "Made in CHINA., Educated && Making My Own $$. Fear GOD && Put Him 1st. #TeamFollowBack #TeamiPhone\n", "friends_count": 74, "profile_link_color": "b03f3f", "profile_image_url": "http://a2.twimg.com/profile_images/1818996723/image_normal.jpg", "notifications": null, "show_all_inline_media": false, "geo_enabled": true, "profile_background_color": "1f9199", "id_str": "113532729", "profile_background_image_url": "http://a3.twimg.com/profile_background_images/305726905/FASHION-3.png", "name": "Bee'Jay", "lang": "en", "profile_background_tile": true, "favourites_count": 19, "screen_name": "OohMyBEEsNice", "url": "http://www.bitchimpaid.org", "created_at": "Fri Feb 12 03:32:54 +0000 2010", "contributors_enabled": false, "time_zone": "Central Time (US & Canada)", "profile_sidebar_border_color": "000000", "default_profile": false, "following": null}, "in_reply_to_screen_name": null, "retweet_count": 0, "geo": null, "id": 168931016843603968, "source": "<a href=\"http://twitter.com/#!/download/iphone\" rel=\"nofollow\">Twitter for iPhone</a>"}
{"favorited": false, "in_reply_to_user_id": 50940453, "contributors": null, "truncated": false, "text": "@LegaMrvica @MimozaBand makasi om artis :D kadoo kadoo", "created_at": "Mon Feb 13 05:34:27 +0000 2012", "retweeted": false, "in_reply_to_status_id_str": "168653037894770688", "coordinates": null, "in_reply_to_user_id_str": "50940453", "entities": {"user_mentions": [{"indices": [0, 11], "screen_name": "LegaMrvica", "id": 50940453, "name": "Lega_thePianis", "id_str": "50940453"}, {"indices": [12, 23], "screen_name": "MimozaBand", "id": 375128905, "name": "Mimoza", "id_str": "375128905"}], "hashtags": [], "urls": []}, "in_reply_to_status_id": 168653037894770688, "id_str": "168931016868761600", "place": null, "user": {"follow_request_sent": null, "profile_use_background_image": true, "profile_background_image_url_https": "https://si0.twimg.com/profile_background_images/347686061/Galungan_dan_Kuningan.jpg", "verified": false, "profile_image_url_https": "https://si0.twimg.com/profile_images/1803845596/Picture_20124_normal.jpg", "profile_sidebar_fill_color": "DDFFCC", "is_translator": false, "id": 48293450, "profile_text_color": "333333", "followers_count": 182, "protected": false, "location": "\u00dcT: -6.906799,107.622383", "default_profile_image": false, "listed_count": 0, "utc_offset": -28800, "statuses_count": 3052, "description": "Fashion design maranatha '11 // traditional dancer (bali) at sanggar tampak siring & Natya Nataraja", "friends_count": 206, "profile_link_color": "0084B4", "profile_image_url": "http://a3.twimg.com/profile_images/1803845596/Picture_20124_normal.jpg", "notifications": null, "show_all_inline_media": false, "geo_enabled": true, "profile_background_color": "9AE4E8", "id_str": "48293450", "profile_background_image_url": "http://a0.twimg.com/profile_background_images/347686061/Galungan_dan_Kuningan.jpg", "name": "nana afiff", "lang": "en", "profile_background_tile": true, "favourites_count": 2, "screen_name": "hasnfebria", "url": null, "created_at": "Thu Jun 18 08:50:29 +0000 2009", "contributors_enabled": false, "time_zone": "Pacific Time (US & Canada)", "profile_sidebar_border_color": "BDDCAD", "default_profile": false, "following": null}, "in_reply_to_screen_name": "LegaMrvica", "retweet_count": 0, "geo": null, "id": 168931016868761600, "source": "<a href=\"http://blackberry.com/twitter\" rel=\"nofollow\">Twitter for BlackBerry\u00ae</a>"}
{"favorited": false, "in_reply_to_user_id": 27260086, "contributors": null, "truncated": false, "text": "@justinbieber u were born to be somebody, and u're super important in beliebers' life. thanks for all biebs. I love u. follow me? 84", "created_at": "Mon Feb 13 05:34:27 +0000 2012", "retweeted": false, "in_reply_to_status_id_str": null, "coordinates": null, "in_reply_to_user_id_str": "27260086", "entities": {"user_mentions": [{"indices": [0, 13], "screen_name": "justinbieber", "id": 27260086, "name": "Justin Bieber", "id_str": "27260086"}], "hashtags": [], "urls": []}, "in_reply_to_status_id": null, "id_str": "168931016856178688", "place": null, "user": {"follow_request_sent": null, "profile_use_background_image": true, "profile_background_image_url_https": "https://si0.twimg.com/profile_background_images/416005864/Captura.JPG", "verified": false, "profile_image_url_https": "https://si0.twimg.com/profile_images/1808883280/Captura6_normal.JPG", "profile_sidebar_fill_color": "f5e7f3", "is_translator": false, "id": 406750700, "profile_text_color": "333333", "followers_count": 1122, "protected": false, "location": "Adentro de una supra.", "default_profile_image": false, "listed_count": 0, "utc_offset": -14400, "statuses_count": 20966, "description": "Mi \u00eddolo es @justinbieber , si te gusta \u00a1genial!, si no, solo respetalo. El cambi\u00f3 mi vida completamente y mi sue\u00f1o es conocerlo #TrueBelieber . ", "friends_count": 1015, "profile_link_color": "9404b8", "profile_image_url": "http://a1.twimg.com/profile_images/1808883280/Captura6_normal.JPG", "notifications": null, "show_all_inline_media": false, "geo_enabled": false, "profile_background_color": "f9fcfa", "id_str": "406750700", "profile_background_image_url": "http://a3.twimg.com/profile_background_images/416005864/Captura.JPG", "name": "neversaynever,right?", "lang": "es", "profile_background_tile": false, "favourites_count": 22, "screen_name": "True_Belieebers", "url": "http://www.wehavebieber-fever.tumblr.com", "created_at": "Mon Nov 07 04:17:40 +0000 2011", "contributors_enabled": false, "time_zone": "Santiago", "profile_sidebar_border_color": "C0DEED", "default_profile": false, "following": null}, "in_reply_to_screen_name": "justinbieber", "retweet_count": 0, "geo": null, "id": 168931016856178688, "source": "<a href=\"http://yfrog.com\" rel=\"nofollow\">Yfrog</a>

At this point everything is great, I load up the script that resulted from a previous Stack Post:

from disco.core import Job, result_iterator
import gzip

def map(line, params):
    import unicodedata
    import json
    r = json.loads(line).get('text')
    s = unicodedata.normalize('NFD', r).encode('ascii', 'ignore')
    for word in s.split():
        yield word, 1

def reduce(iter, params):
    from disco.util import kvgroup
    for word, counts in kvgroup(sorted(iter)):
        yield word, sum(counts)

if __name__ == '__main__':
    job = Job().run(input=["tag://data:test1"],
                    map=map,
                    reduce=reduce)
    for word, count in result_iterator(job.wait(show=True)):
        print word, count

NOTE: That this script runs file if the input=["file.txt"], however when I run it with "tag://data:test1" I get the following error:

 # DISCO_EVENTS=1 python count_normal_words.py
Job@549:db30e:25bd8:
Status: [map] 0 waiting, 1 running, 0 done, 0 failed
2012/11/25 21:43:26  master     New job initialized!
2012/11/25 21:43:26  master     Starting job
2012/11/25 21:43:26  master     Starting map phase
2012/11/25 21:43:26  master     map:0 assigned to solice
2012/11/25 21:43:26  master     ERROR: Job failed: Worker at 'solice' died: Traceback (most recent call last):
  File "/home/DISCO/data/solice/01/Job@549:db30e:25bd8/usr/local/lib/python2.7/site-packages/disco/worker/__init__.py", line 329, in main                                                                                                                     
    job.worker.start(task, job, **jobargs)                                                                                     
  File "/home/DISCO/data/solice/01/Job@549:db30e:25bd8/usr/local/lib/python2.7/site-packages/disco/worker/__init__.py", line 290, in start                                                                                                                    
    self.run(task, job, **jobargs)                                                                                             
  File "/home/DISCO/data/solice/01/Job@549:db30e:25bd8/usr/local/lib/python2.7/site-packages/disco/worker/classic/worker.py", line 286, in run                                                                                                                
    getattr(self, task.mode)(task, params)                                                                                     
  File "/home/DISCO/data/solice/01/Job@549:db30e:25bd8/usr/local/lib/python2.7/site-packages/disco/worker/classic/worker.py", line 299, in map                                                                                                                
    for key, val in self['map'](entry, params):                                                                                
  File "count_normal_words.py", line 12, in map                                                                                
  File "/usr/lib64/python2.7/json/__init__.py", line 326, in loads                                                             
    return _default_decoder.decode(s)                                                                                          
  File "/usr/lib64/python2.7/json/decoder.py", line 366, in decode                                                             
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())                                                                          
  File "/usr/lib64/python2.7/json/decoder.py", line 384, in raw_decode                                                         
    raise ValueError("No JSON object could be decoded")                                                                        
ValueError: No JSON object could be decoded                                                                                    

2012/11/25 21:43:26  master     WARN: Job killed
Status: [map] 1 waiting, 0 running, 0 done, 1 failed
Traceback (most recent call last):
  File "count_normal_words.py", line 28, in <module>
    for word, count in result_iterator(job.wait(show=True)):
  File "/usr/local/lib/python2.7/site-packages/disco/core.py", line 348, in wait
    timeout, poll_interval * 1000)
  File "/usr/local/lib/python2.7/site-packages/disco/core.py", line 309, in check_results
    raise JobError(Job(name=jobname, master=self), "Status %s" % status)
disco.error.JobError: Job Job@549:db30e:25bd8 failed: Status dead

The Error states: ValueError: No JSON object could be decoded. Again, this works fine using the text file as input but now DDFS.

Any ideas, I'm open to suggestions?

Foi útil?

Solução

It appears that the data was incomplete when loaded into DDFS. The very last line that was loaded did not have an } as the final character. As such the job failed. I guess I need to have a Try Continue to deal with non JSON lines.

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top