com.datasalt.pangool.tuplemr.mapred.lib.input
Class CascadingTupleInputFormat

java.lang.Object
  extended by org.apache.hadoop.mapreduce.InputFormat<K,V>
      extended by org.apache.hadoop.mapreduce.lib.input.FileInputFormat<ITuple,org.apache.hadoop.io.NullWritable>
          extended by com.datasalt.pangool.tuplemr.mapred.lib.input.CascadingTupleInputFormat
All Implemented Interfaces:
Serializable

public class CascadingTupleInputFormat
extends org.apache.hadoop.mapreduce.lib.input.FileInputFormat<ITuple,org.apache.hadoop.io.NullWritable>
implements Serializable

A wrapper around a SequenceFile that contains Cascading's Tuples that implements a Pangool-friendly InputFormat. The Schema is lazily discovered with the first seen Cascading Tuple. The type correspondence is:

Any other type is unrecognized and an IOException is thrown.

Column names must be provided to the InputFormat, this is of course because Cascading doesn't save them anywhere. The schemaName is used to instantiate a Pangool Schema.

Note that for this to work Cascading serialization must have been enabled in Hadoop Configuration. You can do this by calling static method setSerializations(Configuration).

See Also:
Serialized Form

Constructor Summary
CascadingTupleInputFormat(String schemaName, String... fieldNames)
           
 
Method Summary
 org.apache.hadoop.mapreduce.RecordReader<ITuple,org.apache.hadoop.io.NullWritable> createRecordReader(org.apache.hadoop.mapreduce.InputSplit split, org.apache.hadoop.mapreduce.TaskAttemptContext ctx)
           
static void setSerializations(org.apache.hadoop.conf.Configuration conf)
          Like in Cascading's TupleSerialization.setSerializations() but accepting a Hadoop's Configuration rather than JobConf.
 
Methods inherited from class org.apache.hadoop.mapreduce.lib.input.FileInputFormat
addInputPath, addInputPaths, computeSplitSize, getBlockIndex, getFormatMinSplitSize, getInputPathFilter, getInputPaths, getMaxSplitSize, getMinSplitSize, getSplits, isSplitable, listStatus, setInputPathFilter, setInputPaths, setInputPaths, setMaxInputSplitSize, setMinInputSplitSize
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

CascadingTupleInputFormat

public CascadingTupleInputFormat(String schemaName,
                                 String... fieldNames)
Method Detail

setSerializations

public static void setSerializations(org.apache.hadoop.conf.Configuration conf)
Like in Cascading's TupleSerialization.setSerializations() but accepting a Hadoop's Configuration rather than JobConf.


createRecordReader

public org.apache.hadoop.mapreduce.RecordReader<ITuple,org.apache.hadoop.io.NullWritable> createRecordReader(org.apache.hadoop.mapreduce.InputSplit split,
                                                                                                             org.apache.hadoop.mapreduce.TaskAttemptContext ctx)
                                                                                                      throws IOException,
                                                                                                             InterruptedException
Specified by:
createRecordReader in class org.apache.hadoop.mapreduce.InputFormat<ITuple,org.apache.hadoop.io.NullWritable>
Throws:
IOException
InterruptedException


Copyright © –2014 Datasalt Systems S.L.. All rights reserved.