com.datasalt.pangool.tuplemr.mapred.lib.input
Class CascadingTupleInputFormat
java.lang.Object
org.apache.hadoop.mapreduce.InputFormat<K,V>
org.apache.hadoop.mapreduce.lib.input.FileInputFormat<ITuple,org.apache.hadoop.io.NullWritable>
com.datasalt.pangool.tuplemr.mapred.lib.input.CascadingTupleInputFormat
- All Implemented Interfaces:
- Serializable
public class CascadingTupleInputFormat
- extends org.apache.hadoop.mapreduce.lib.input.FileInputFormat<ITuple,org.apache.hadoop.io.NullWritable>
- implements Serializable
A wrapper around a SequenceFile that contains Cascading's Tuples that implements a Pangool-friendly InputFormat.
The Schema is lazily discovered with the first seen Cascading Tuple. The type correspondence is:
- Integer - INT
- Long - LONG
- Float - FLOAT
- Double - DOUBLE
- String - STRING
- Short - INT
- Boolean - BOOLEAN
Any other type is unrecognized and an IOException is thrown.
Column names must be provided to the InputFormat, this is of course because Cascading doesn't save them anywhere.
The schemaName is used to instantiate a Pangool Schema.
Note that for this to work Cascading serialization must have been enabled in Hadoop Configuration.
You can do this by calling static method setSerializations(Configuration)
.
- See Also:
- Serialized Form
Method Summary |
org.apache.hadoop.mapreduce.RecordReader<ITuple,org.apache.hadoop.io.NullWritable> |
createRecordReader(org.apache.hadoop.mapreduce.InputSplit split,
org.apache.hadoop.mapreduce.TaskAttemptContext ctx)
|
static void |
setSerializations(org.apache.hadoop.conf.Configuration conf)
Like in Cascading's TupleSerialization.setSerializations() but accepting a Hadoop's Configuration rather than JobConf. |
Methods inherited from class org.apache.hadoop.mapreduce.lib.input.FileInputFormat |
addInputPath, addInputPaths, computeSplitSize, getBlockIndex, getFormatMinSplitSize, getInputPathFilter, getInputPaths, getMaxSplitSize, getMinSplitSize, getSplits, isSplitable, listStatus, setInputPathFilter, setInputPaths, setInputPaths, setMaxInputSplitSize, setMinInputSplitSize |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
CascadingTupleInputFormat
public CascadingTupleInputFormat(String schemaName,
String... fieldNames)
setSerializations
public static void setSerializations(org.apache.hadoop.conf.Configuration conf)
- Like in Cascading's TupleSerialization.setSerializations() but accepting a Hadoop's Configuration rather than JobConf.
createRecordReader
public org.apache.hadoop.mapreduce.RecordReader<ITuple,org.apache.hadoop.io.NullWritable> createRecordReader(org.apache.hadoop.mapreduce.InputSplit split,
org.apache.hadoop.mapreduce.TaskAttemptContext ctx)
throws IOException,
InterruptedException
- Specified by:
createRecordReader
in class org.apache.hadoop.mapreduce.InputFormat<ITuple,org.apache.hadoop.io.NullWritable>
- Throws:
IOException
InterruptedException
Copyright © –2014 Datasalt Systems S.L.. All rights reserved.