문제

I am trying to partition an input file using AWS EMR. I use a streaming step to read from stdin.
I want to split this file into 2 files based on the values of specific fields from each line of stdin and store the resulting outputs into S3 to be used later. I cannot find any documentation on how to achieve this using python. Can you point me in the right direction? I'd greatly appreciate it.

Thank you

도움이 되었습니까?

해결책

Not exactly sure what troubles you are having. Here is a good article - http://aws.amazon.com/articles/2294

Your specific question, you want to create a mapper which takes in your file as input and splits each line into a key, value pair (key determining which output file it will be in), and your reducer will just have to output these, a no-op.

Mapper

#!/usr/bin/python

def main():
    for line in sys.stdin:
        key = get_my_key(line)
        value = line
        print '{}\t{}'.format(key, value)

if __name__ == "__main__":
    main()

Reducer

#!/usr/bin/python

def main():
    for line in sys.stdin:
        print line

if __name__ == "__main__":
    main()

When you are adding this step you specify your input, output (some s3 bucket) and these files as the mapper and reducer.

Note, there are configurations to set no reducer, just a mapper task. I've included it all above because you seem to be a beginner

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top