i'm facing issue chained multiple mapreduce jobs.
the current scenario works follows: application process 2 data-sources, each of them go separated mapper using "multipleinputs" 2 different directories
first job reads inputs in 2 mappers, process data , output data on 2 different directories using "multipleoutputs".
now, second job should work on output of first task using same task ids of first job.
for example,
job1: 2 different mappers, 1 reducer
mapper1_1 reads datasource1 directory, creates 2 tasks process them , output ds1/ds1-m-00000 , ds1/ds1-m-00001 intermediate files
mapper1_2 reads datasource2 directory, creates 1 task process , output ds2/ds2-m-00002 intermediate file
reducer1 makes calculations , outputs statistics
job2: 2 different mappers, 1 reducer
mapper2_1 reads ds1 directory, creates 2 tasks process 2 intermediate files.
mapper2_2 reads ds2 directory, creates 1 task process intermediate file.
reducer2 makes calculations output final result
for reasons, same task id must used in second job ones generated in first job actually, task ids generated randomly, in second job, generated first jobs , process complete , not.
is there anyway control generating same task id 2 jobs or other method generate unique number identify same file in both jobs?
Comments
Post a Comment