com.datasalt.pangool.tuplemr
Class TupleMRBuilder

java.lang.Object
  extended by com.datasalt.pangool.tuplemr.TupleMRConfigBuilder
      extended by com.datasalt.pangool.tuplemr.TupleMRBuilder

public class TupleMRBuilder
extends TupleMRConfigBuilder

TupleMRBuilder creates Tuple-based Map-Reduce jobs.

One of the key concepts of Tuple-based Map-Reduce is that Hadoop Key-Value pairs are no longer used.Instead,they are replaced by tuples.
Tuples(see ITuple) are just an ordered list of elements whose types are defined in a Schema .TupleMRBuilder contains several methods to define how grouping and sorting among tuples will be performed, avoiding the complex task of defining custom binary SortComparator , GroupComparator and TupleHashPartitioner implementations.

A Tuple-based Map-Red job, in its simplest form, requires to define :

See Also:
ITuple, Schema, TupleMapper, TupleReducer

Constructor Summary
TupleMRBuilder(org.apache.hadoop.conf.Configuration conf)
           
TupleMRBuilder(org.apache.hadoop.conf.Configuration conf, String name)
           
 
Method Summary
 void addInput(org.apache.hadoop.fs.Path path, org.apache.hadoop.mapreduce.InputFormat inputFormat, TupleMapper inputProcessor)
          Defines an input as in PangoolMultipleInputs
 void addInput(org.apache.hadoop.fs.Path path, org.apache.hadoop.mapreduce.InputFormat inputFormat, TupleMapper inputProcessor, Map<String,String> specificContext)
           
 void addNamedOutput(String namedOutput, org.apache.hadoop.mapreduce.OutputFormat outputFormat, Class keyClass, Class valueClass)
           
 void addNamedOutput(String namedOutput, org.apache.hadoop.mapreduce.OutputFormat outputFormat, Class keyClass, Class valueClass, Map<String,String> specificContext)
           
 void addNamedTupleOutput(String namedOutput, Schema outputSchema)
           
 void addTupleInput(org.apache.hadoop.fs.Path path, Schema targetSchema, TupleMapper<ITuple,org.apache.hadoop.io.NullWritable> tupleMapper)
          Adds an input file associated with a TupleFile.
 void addTupleInput(org.apache.hadoop.fs.Path path, TupleMapper<ITuple,org.apache.hadoop.io.NullWritable> tupleMapper)
          Adds an input file associated with a TupleFile.
 void cleanUpInstanceFiles()
          Run this method after running your Job for instance files to be properly cleaned.
 org.apache.hadoop.mapreduce.Job createJob()
           
 org.apache.hadoop.conf.Configuration getConf()
           
 void setDefaultNamedOutput(org.apache.hadoop.mapreduce.OutputFormat outputFormat, Class keyClass, Class valueClass)
          Sets the default named output specs.
 void setDefaultNamedOutput(org.apache.hadoop.mapreduce.OutputFormat outputFormat, Class keyClass, Class valueClass, Map<String,String> specificContext)
          Sets the default named output specs.
 void setDefaultNamedOutput(Schema outputSchema)
          Sets the default named output (Tuple format) specs.
 void setJarByClass(Class<?> jarByClass)
          Sets the jar by class , as in Job.setJarByClass(Class)
 void setOutput(org.apache.hadoop.fs.Path outputPath, org.apache.hadoop.mapreduce.OutputFormat outputFormat, Class<?> outputKeyClass, Class<?> outputValueClass)
           
 void setTupleCombiner(TupleReducer tupleCombiner)
           
 void setTupleOutput(org.apache.hadoop.fs.Path outputPath, Schema schema)
           
 void setTupleReducer(TupleReducer tupleReducer)
           
 
Methods inherited from class com.datasalt.pangool.tuplemr.TupleMRConfigBuilder
addIntermediateSchema, buildConf, initializeComparators, setCustomPartitionFields, setFieldAliases, setGroupByFields, setOrderBy, setRollupFrom, setSpecificOrderBy
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

TupleMRBuilder

public TupleMRBuilder(org.apache.hadoop.conf.Configuration conf)

TupleMRBuilder

public TupleMRBuilder(org.apache.hadoop.conf.Configuration conf,
                      String name)
Parameters:
conf - Configuration instance
name - Job's name as in Job
Method Detail

getConf

public org.apache.hadoop.conf.Configuration getConf()

setJarByClass

public void setJarByClass(Class<?> jarByClass)
Sets the jar by class , as in Job.setJarByClass(Class)


addTupleInput

public void addTupleInput(org.apache.hadoop.fs.Path path,
                          TupleMapper<ITuple,org.apache.hadoop.io.NullWritable> tupleMapper)
Adds an input file associated with a TupleFile.


addTupleInput

public void addTupleInput(org.apache.hadoop.fs.Path path,
                          Schema targetSchema,
                          TupleMapper<ITuple,org.apache.hadoop.io.NullWritable> tupleMapper)
Adds an input file associated with a TupleFile.

A specific "Target Schema" is specified, which should be backwards-compatible with the Schema in the Tuple File (new nullable fields are allowed, not used old fields too).


addNamedOutput

public void addNamedOutput(String namedOutput,
                           org.apache.hadoop.mapreduce.OutputFormat outputFormat,
                           Class keyClass,
                           Class valueClass)
                    throws TupleMRException
Throws:
TupleMRException

setDefaultNamedOutput

public void setDefaultNamedOutput(org.apache.hadoop.mapreduce.OutputFormat outputFormat,
                                  Class keyClass,
                                  Class valueClass)
                           throws TupleMRException
Sets the default named output specs. By using this method one can use an arbitrary number of named outputs without pre-defining them beforehand.

Throws:
TupleMRException

setDefaultNamedOutput

public void setDefaultNamedOutput(org.apache.hadoop.mapreduce.OutputFormat outputFormat,
                                  Class keyClass,
                                  Class valueClass,
                                  Map<String,String> specificContext)
                           throws TupleMRException
Sets the default named output specs. By using this method one can use an arbitrary number of named outputs without pre-defining them beforehand.

The specific (key, value) default context defined here will be applied to ALL named outputs.

Throws:
TupleMRException

setDefaultNamedOutput

public void setDefaultNamedOutput(Schema outputSchema)
                           throws TupleMRException
Sets the default named output (Tuple format) specs. By using this method one can use an arbitrary number of named outputs without pre-defining them beforehand.

Throws:
TupleMRException

addNamedOutput

public void addNamedOutput(String namedOutput,
                           org.apache.hadoop.mapreduce.OutputFormat outputFormat,
                           Class keyClass,
                           Class valueClass,
                           Map<String,String> specificContext)
                    throws TupleMRException
Throws:
TupleMRException

addNamedTupleOutput

public void addNamedTupleOutput(String namedOutput,
                                Schema outputSchema)
                         throws TupleMRException
Throws:
TupleMRException

addInput

public void addInput(org.apache.hadoop.fs.Path path,
                     org.apache.hadoop.mapreduce.InputFormat inputFormat,
                     TupleMapper inputProcessor)
Defines an input as in PangoolMultipleInputs

See Also:
PangoolMultipleInputs

addInput

public void addInput(org.apache.hadoop.fs.Path path,
                     org.apache.hadoop.mapreduce.InputFormat inputFormat,
                     TupleMapper inputProcessor,
                     Map<String,String> specificContext)

setTupleCombiner

public void setTupleCombiner(TupleReducer tupleCombiner)

setOutput

public void setOutput(org.apache.hadoop.fs.Path outputPath,
                      org.apache.hadoop.mapreduce.OutputFormat outputFormat,
                      Class<?> outputKeyClass,
                      Class<?> outputValueClass)

setTupleOutput

public void setTupleOutput(org.apache.hadoop.fs.Path outputPath,
                           Schema schema)

setTupleReducer

public void setTupleReducer(TupleReducer tupleReducer)

cleanUpInstanceFiles

public void cleanUpInstanceFiles()
                          throws IOException
Run this method after running your Job for instance files to be properly cleaned.

Throws:
IOException

createJob

public org.apache.hadoop.mapreduce.Job createJob()
                                          throws IOException,
                                                 TupleMRException
Throws:
IOException
TupleMRException


Copyright © –2014 Datasalt Systems S.L.. All rights reserved.