Hadoop MapReduce trabajo en el archivo que contiene las etiquetas HTML

https://stackoverflow.com/questions/1842747

12-09-2019
|

Pregunta

Tengo un montón de grandes archivos HTML y quiero ejecutar un trabajo Hadoop MapReduce en ellos para encontrar las palabras más frecuentemente utilizadas. Escribí tanto mi asignador y reductor en Python y utiliza Hadoop de streaming para ejecutarlas.

Aquí está mi asignador:

#!/usr/bin/env python

import sys
import re
import string

def remove_html_tags(in_text):
'''
Remove any HTML tags that are found. 

'''
    global flag
    in_text=in_text.lstrip()
    in_text=in_text.rstrip()
    in_text=in_text+"\n"

    if flag==True: 
        in_text="<"+in_text
        flag=False
    if re.search('^<',in_text)!=None and re.search('(>\n+)$', in_text)==None: 
        in_text=in_text+">"
        flag=True
    p = re.compile(r'<[^<]*?>')
    in_text=p.sub('', in_text)
    return in_text

# input comes from STDIN (standard input)
global flag
flag=False
for line in sys.stdin:
    # remove leading and trailing whitespace, set to lowercase and remove HTMl tags
    line = line.strip().lower()
    line = remove_html_tags(line)
    # split the line into words
    words = line.split()
    # increase counters
    for word in words:
       # write the results to STDOUT (standard output);
       # what we output here will be the input for the
       # Reduce step, i.e. the input for reducer.py
       #
       # tab-delimited; the trivial word count is 1
       if word =='': continue
       for c in string.punctuation:
           word= word.replace(c,'')

       print '%s\t%s' % (word, 1)

Aquí está mi reductor:

#!/usr/bin/env python

from operator import itemgetter
import sys

# maps words to their counts
word2count = {}

# input comes from STDIN
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()

    # parse the input we got from mapper.py
    word, count = line.split('\t', 1)
    # convert count (currently a string) to int
    try:
        count = int(count)
        word2count[word] = word2count.get(word, 0) + count
    except ValueError:
        pass

sorted_word2count = sorted(word2count.iteritems(), 
key=lambda(k,v):(v,k),reverse=True)

# write the results to STDOUT (standard output)
for word, count in sorted_word2count:
    print '%s\t%s'% (word, count)

Cada vez que acabo de tubería de una pequeña muestra pequeña cadena como 'hola mundo hola hola mundo ...' Me da la salida correcta de una lista clasificada. Sin embargo, cuando trato de usar un pequeño archivo HTML, e intente utilizar el gato para tubería HTML en mi asignador, me sale el siguiente error (INPUT2 contiene algo de código HTML):

rohanbk@hadoop:~$ cat input2 | /home/rohanbk/mapper.py | sort | /home/rohanbk/reducer.py
Traceback (most recent call last):
  File "/home/rohanbk/reducer.py", line 15, in <module>
    word, count = line.split('\t', 1)
ValueError: need more than 1 value to unpack

Puede alguien explicar por qué estoy recibiendo esto? Además, lo que es una buena manera de depurar un programa de trabajo MapReduce?

Solución

Puede reproducir el error incluso con sólo:

echo "hello - world" | ./mapper.py  | sort | ./reducer.py

La cuestión está aquí:

if word =='': continue
for c in string.punctuation:
           word= word.replace(c,'')

Si word es una única marca de puntuacion, como sería el caso para la entrada anteriormente (después de que se split), a continuación, se convierte en una cadena vacía. Por lo tanto, basta con mover el cheque por una cadena vacía para después de la sustitución.

Licenciado bajo: CC-BY-SA con atribución

No afiliado a StackOverflow