Perché il mio file di sequenza è letto due volte nella mia classe Mapper Hadoop?

https://stackoverflow.com/questions/9514710

14-11-2019
|

Domanda

Ho un sequencefile con 1264 record. Ogni tasto è unico per ogni record. Il mio problema è che il mio Mapper sembra leggere questo file due volte o viene letto due volte. Per il controllo sanitario, ho scritto una piccola classe di utilità per leggere il sequencefile e infatti, ci sono solo 1264 record (I.e. SequenceFile.Reader).

Nel mio riduttore, dovrei ottenere solo 1 record per iterabile, tuttavia, quando ho iteralmente su Iterable (Iterator), ottengo 2 record per chiave (sempre 2 per chiave, e non 1 o 3 o qualcos'altro per chiave ).

L'uscita di registrazione del mio lavoro è inferiore. Non sono sicuro del perché, ma perché è che i "Percorsi di input totali da elaborare" sono 2? Quando eseguo il mio lavoro, ho provato -DMAPRED.Input.dir= / Data e anche -DMapred.input.dir= / Data / Part-R-00000, ma ancora, i percorsi totali da elaborare è 2.

Qualsiasi idea è apprezzata.

12/03/01 05:28:30 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
12/03/01 05:28:30 INFO input.FileInputFormat: Total input paths to process : 2
12/03/01 05:28:31 INFO mapred.JobClient: Running job: job_local_0001
12/03/01 05:28:31 INFO input.FileInputFormat: Total input paths to process : 2
12/03/01 05:28:31 INFO mapred.MapTask: io.sort.mb = 100
12/03/01 05:28:31 INFO mapred.MapTask: data buffer = 79691776/99614720
12/03/01 05:28:31 INFO mapred.MapTask: record buffer = 262144/327680
12/03/01 05:28:31 INFO mapred.MapTask: Starting flush of map output
12/03/01 05:28:31 INFO mapred.MapTask: Finished spill 0
12/03/01 05:28:31 INFO mapred.TaskRunner: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting
12/03/01 05:28:31 INFO mapred.LocalJobRunner:
12/03/01 05:28:31 INFO mapred.TaskRunner: Task 'attempt_local_0001_m_000000_0' done.
12/03/01 05:28:31 INFO mapred.MapTask: io.sort.mb = 100
12/03/01 05:28:31 INFO mapred.MapTask: data buffer = 79691776/99614720
12/03/01 05:28:31 INFO mapred.MapTask: record buffer = 262144/327680
12/03/01 05:28:31 INFO mapred.MapTask: Starting flush of map output
12/03/01 05:28:31 INFO mapred.MapTask: Finished spill 0
12/03/01 05:28:31 INFO mapred.TaskRunner: Task:attempt_local_0001_m_000001_0 is done. And is in the process of commiting
12/03/01 05:28:31 INFO mapred.LocalJobRunner:
12/03/01 05:28:31 INFO mapred.TaskRunner: Task 'attempt_local_0001_m_000001_0' done.
12/03/01 05:28:31 INFO mapred.LocalJobRunner:
12/03/01 05:28:31 INFO mapred.Merger: Merging 2 sorted segments
12/03/01 05:28:31 INFO mapred.Merger: Down to the last merge-pass, with 2 segments left of total size: 307310 bytes
12/03/01 05:28:31 INFO mapred.LocalJobRunner:
12/03/01 05:28:32 INFO mapred.TaskRunner: Task:attempt_local_0001_r_000000_0 is done. And is in the process of commiting
12/03/01 05:28:32 INFO mapred.LocalJobRunner:
12/03/01 05:28:32 INFO mapred.TaskRunner: Task attempt_local_0001_r_000000_0 is allowed to commit now
12/03/01 05:28:32 INFO mapred.JobClient:  map 100% reduce 0%
12/03/01 05:28:32 INFO output.FileOutputCommitter: Saved output of task 'attempt_local_0001_r_000000_0' to results
12/03/01 05:28:32 INFO mapred.LocalJobRunner: reduce > reduce
12/03/01 05:28:32 INFO mapred.TaskRunner: Task 'attempt_local_0001_r_000000_0' done.
12/03/01 05:28:33 INFO mapred.JobClient:  map 100% reduce 100%
12/03/01 05:28:33 INFO mapred.JobClient: Job complete: job_local_0001
12/03/01 05:28:33 INFO mapred.JobClient: Counters: 12
12/03/01 05:28:33 INFO mapred.JobClient:   FileSystemCounters
12/03/01 05:28:33 INFO mapred.JobClient:     FILE_BYTES_READ=1320214
12/03/01 05:28:33 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=1275041
12/03/01 05:28:33 INFO mapred.JobClient:   Map-Reduce Framework
12/03/01 05:28:33 INFO mapred.JobClient:     Reduce input groups=1264
12/03/01 05:28:33 INFO mapred.JobClient:     Combine output records=0
12/03/01 05:28:33 INFO mapred.JobClient:     Map input records=2528
12/03/01 05:28:33 INFO mapred.JobClient:     Reduce shuffle bytes=0
12/03/01 05:28:33 INFO mapred.JobClient:     Reduce output records=2528
12/03/01 05:28:33 INFO mapred.JobClient:     Spilled Records=5056
12/03/01 05:28:33 INFO mapred.JobClient:     Map output bytes=301472
12/03/01 05:28:33 INFO mapred.JobClient:     Combine input records=0
12/03/01 05:28:33 INFO mapred.JobClient:     Map output records=2528
12/03/01 05:28:33 INFO mapred.JobClient:     Reduce input records=2528

La mia classe Mapper è molto semplice. Legge in un file di testo. Per ogni riga, aggiunge "M" alla linea.

public class MyMapper extends Mapper<LongWritable, Text, LongWritable, Text> {

 private static final Log _log = LogFactory.getLog(MyMapper.class);

 @Override
 public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
  String s = (new StringBuilder()).append(value.toString()).append("m").toString();
  context.write(key, new Text(s));
  _log.debug(key.toString() + " => " + s);
 }
}

La mia classe di riduttore è anche molto semplice. Semplicemente aggiunge "R" alla linea.

public class MyReducer extends Reducer<LongWritable, Text, LongWritable, Text> {

private static final Log _log = LogFactory.getLog(MyReducer.class);

@Override
public void reduce(LongWritable key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
 for(Iterator<Text> it = values.iterator(); it.hasNext();) {
  Text txt = it.next();
  String s = (new StringBuilder()).append(txt.toString()).append("r").toString();
  context.write(key, new Text(s));
  _log.debug(key.toString() + " => " + s);
  }
 }
}

La mia classe di lavoro è la seguente.

public class MyJob extends Configured implements Tool {

public static void main(String[] args) throws Exception {
 ToolRunner.run(new Configuration(), new MyJob(), args);
}

@Override
public int run(String[] args) throws Exception {
 Configuration conf = getConf();
 Path input = new Path(conf.get("mapred.input.dir"));
 Path output = new Path(conf.get("mapred.output.dir"));

 System.out.println("input = " + input);
 System.out.println("output = " + output);

 Job job = new Job(conf, "dummy job");
 job.setMapOutputKeyClass(LongWritable.class);
 job.setMapOutputValueClass(Text.class);
 job.setOutputKeyClass(LongWritable.class);
 job.setOutputValueClass(Text.class);

 job.setMapperClass(MyMapper.class);
 job.setReducerClass(MyReducer.class);

 FileInputFormat.addInputPath(job, input);
 FileOutputFormat.setOutputPath(job, output);

 job.setJarByClass(MyJob.class);

 return job.waitForCompletion(true) ? 0 : 1;
 }
}

I miei dati di input assomigliano al seguente.

T, T
T, T
T, T
F, F
F, F
F, F
F, F
T, F
F, T

Dopo aver eseguito il mio lavoro, ottengo un'uscita come il seguente.

0   T, Tmr
0   T, Tmr
6   T, Tmr
6   T, Tmr
12  T, Tmr
12  T, Tmr
18  F, Fmr
18  F, Fmr
24  F, Fmr
24  F, Fmr
30  F, Fmr
30  F, Fmr
36  F, Fmr
36  F, Fmr
42  T, Fmr
42  T, Fmr
48  F, Tmr
48  F, Tmr

Ho fatto qualcosa di sbagliato nel creare il mio lavoro? Ho provato il seguente modo per gestire il mio lavoro, e in questo approccio, il file viene letto solo una volta. perchè è questo? I valori System.out.Println (inpath) e System.out.PrintLn (OutPath) sono identici! Aiuto?

public class MyJob2 {

public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
    if (otherArgs.length != 2) {
      System.err.println("Usage: MyJob2 <in> <out>");
      System.exit(2);
    }

    String sInput = args[0];
    String sOutput = args[1];

    Path input = new Path(sInput);
    Path output = new Path(sOutput);

    System.out.println("input = " + input);
    System.out.println("output = " + output);

    Job job = new Job(conf, "dummy job");
    job.setMapOutputKeyClass(LongWritable.class);
    job.setMapOutputValueClass(Text.class);
    job.setOutputKeyClass(LongWritable.class);
    job.setOutputValueClass(Text.class);

    job.setMapperClass(MyMapper.class);
    job.setReducerClass(MyReducer.class);

    FileInputFormat.addInputPath(job, input);
    FileOutputFormat.setOutputPath(job, output);

    job.setJarByClass(MyJob2.class);

    int result = job.waitForCompletion(true) ? 0 : 1;
    System.exit(result);
 }
}

Soluzione

i got help from the hadoop mailing list. my problem was with the line below.

FileInputFormat.addInputPath(job, input);

this line simply appends input back to config. after commenting this line out, the input file is read only once now. in fact, i also commented out the other line,

FileOutputFormat.setOutputPath(job, output);

and everything still works.

Altri suggerimenti

I've had a similar problem, but for a different reason: linux apparently created a hidden copy of my input file (~input.txt), so that's a second way of getting this error..

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow