HadoP MapReduce وظيفة في الملف يحتوي على علامات HTML

https://stackoverflow.com/questions/1842747

12-09-2019
|

سؤال

لدي مجموعة من ملفات HTML الكبيرة وأريد تشغيل وظيفة Hadoop MapReduce عليها للعثور على الكلمات الأكثر استخداما. كتبت كل من مخططي ومخديا في بيثون واستخدمت هادوب بث لتشغيلها.

هنا هو بلدي mapper:

#!/usr/bin/env python

import sys
import re
import string

def remove_html_tags(in_text):
'''
Remove any HTML tags that are found. 

'''
    global flag
    in_text=in_text.lstrip()
    in_text=in_text.rstrip()
    in_text=in_text+"\n"

    if flag==True: 
        in_text="<"+in_text
        flag=False
    if re.search('^<',in_text)!=None and re.search('(>\n+)$', in_text)==None: 
        in_text=in_text+">"
        flag=True
    p = re.compile(r'<[^<]*?>')
    in_text=p.sub('', in_text)
    return in_text

# input comes from STDIN (standard input)
global flag
flag=False
for line in sys.stdin:
    # remove leading and trailing whitespace, set to lowercase and remove HTMl tags
    line = line.strip().lower()
    line = remove_html_tags(line)
    # split the line into words
    words = line.split()
    # increase counters
    for word in words:
       # write the results to STDOUT (standard output);
       # what we output here will be the input for the
       # Reduce step, i.e. the input for reducer.py
       #
       # tab-delimited; the trivial word count is 1
       if word =='': continue
       for c in string.punctuation:
           word= word.replace(c,'')

       print '%s\t%s' % (word, 1)

هنا مخفض بلدي:

#!/usr/bin/env python

from operator import itemgetter
import sys

# maps words to their counts
word2count = {}

# input comes from STDIN
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()

    # parse the input we got from mapper.py
    word, count = line.split('\t', 1)
    # convert count (currently a string) to int
    try:
        count = int(count)
        word2count[word] = word2count.get(word, 0) + count
    except ValueError:
        pass

sorted_word2count = sorted(word2count.iteritems(), 
key=lambda(k,v):(v,k),reverse=True)

# write the results to STDOUT (standard output)
for word, count in sorted_word2count:
    print '%s\t%s'% (word, count)

كلما أنا فقط أنابيب عينة صغيرة سلسلة صغيرة مثل "Hello World Hello Hello World ..." أحصل على الإخراج المناسب لقائمة المرتبة. ومع ذلك، عندما أحاول استخدام ملف HTML صغير، وحاول استخدام Cat إلى توجيه HTML إلى MAPPER الخاص بي، أحصل على الخطأ التالي (يحتوي Input2 على بعض كود HTML):

rohanbk@hadoop:~$ cat input2 | /home/rohanbk/mapper.py | sort | /home/rohanbk/reducer.py
Traceback (most recent call last):
  File "/home/rohanbk/reducer.py", line 15, in <module>
    word, count = line.split('\t', 1)
ValueError: need more than 1 value to unpack

هل يمكن لأي شخص أن يفسر لماذا أحصل على هذا؟ أيضا، ما هو وسيلة جيدة لتصحيح برنامج عمل ذاكرة؟

المحلول

يمكنك إعادة إنتاج الخلل حتى مع فقط:

echo "hello - world" | ./mapper.py  | sort | ./reducer.py

القضية هنا:

if word =='': continue
for c in string.punctuation:
           word= word.replace(c,'')

إذا word هي علامة علامات الترقيم واحدة، كما سيكون الحال بالنسبة للإدخال أعلاه (بعد الانقسام)، ثم يتم تحويله إلى سلسلة فارغة. لذلك، فقط حرك الشيك للحصول على سلسلة فارغة بعد الاستبدال.

مرخصة بموجب: CC-BY-SA مع الإسناد

لا تنتمي إلى StackOverflow