com.datasalt.pangool.tuplemr.mapred.lib.input
Class TupleTextInputFormat

java.lang.Object
  extended by org.apache.hadoop.mapreduce.InputFormat<K,V>
      extended by org.apache.hadoop.mapreduce.lib.input.FileInputFormat<ITuple,org.apache.hadoop.io.NullWritable>
          extended by com.datasalt.pangool.tuplemr.mapred.lib.input.TupleTextInputFormat
All Implemented Interfaces:
Serializable

public class TupleTextInputFormat
extends org.apache.hadoop.mapreduce.lib.input.FileInputFormat<ITuple,org.apache.hadoop.io.NullWritable>
implements Serializable

A special input format that supports reading text lines into ITuple. It supports CSV-like semantics such as separator character, quote character and escape character. It uses Open CSV underneath (http://opencsv.sourceforge.net/).

See Also:
Serialized Form

Nested Class Summary
static class TupleTextInputFormat.FieldSelector
          When provided, will use it as a mapping between the text file columns and the provided Schema.
static class TupleTextInputFormat.TupleTextInputReader
           
 
Field Summary
static char NO_ESCAPE_CHARACTER
           
static String NO_NULL_STRING
           
static char NO_QUOTE_CHARACTER
           
static char NO_SEPARATOR_CHARACTER
           
 
Constructor Summary
TupleTextInputFormat(Schema schema, boolean hasHeader, boolean strictQuotes, Character separator, Character quoteCharacter, Character escapeCharacter, TupleTextInputFormat.FieldSelector fieldSelector, String nullString)
          Character separated files reader.
TupleTextInputFormat(Schema schema, int[] fields, boolean hasHeader, String nullString)
          Fixed width fields file reader.
 
Method Summary
 org.apache.hadoop.mapreduce.RecordReader<ITuple,org.apache.hadoop.io.NullWritable> createRecordReader(org.apache.hadoop.mapreduce.InputSplit iS, org.apache.hadoop.mapreduce.TaskAttemptContext context)
           
 char getEscapeCharacter()
           
 int[] getFixedWidthFieldsPositions()
           
 String getNullString()
           
 char getQuoteCharacter()
           
 Schema getSchema()
           
 char getSeparatorCharacter()
           
 com.datasalt.pangool.tuplemr.mapred.lib.input.TupleTextInputFormat.InputType getType()
           
 boolean isHasHeader()
           
protected  boolean isSplitable(org.apache.hadoop.mapreduce.JobContext context, org.apache.hadoop.fs.Path file)
           
 boolean isStrictQuotes()
           
 
Methods inherited from class org.apache.hadoop.mapreduce.lib.input.FileInputFormat
addInputPath, addInputPaths, computeSplitSize, getBlockIndex, getFormatMinSplitSize, getInputPathFilter, getInputPaths, getMaxSplitSize, getMinSplitSize, getSplits, listStatus, setInputPathFilter, setInputPaths, setInputPaths, setMaxInputSplitSize, setMinInputSplitSize
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

NO_QUOTE_CHARACTER

public static final char NO_QUOTE_CHARACTER
See Also:
Constant Field Values

NO_ESCAPE_CHARACTER

public static final char NO_ESCAPE_CHARACTER
See Also:
Constant Field Values

NO_SEPARATOR_CHARACTER

public static final char NO_SEPARATOR_CHARACTER
See Also:
Constant Field Values

NO_NULL_STRING

public static final String NO_NULL_STRING
Constructor Detail

TupleTextInputFormat

public TupleTextInputFormat(Schema schema,
                            boolean hasHeader,
                            boolean strictQuotes,
                            Character separator,
                            Character quoteCharacter,
                            Character escapeCharacter,
                            TupleTextInputFormat.FieldSelector fieldSelector,
                            String nullString)
Character separated files reader. You must specify the Schema that will be used for Tuples being read so that automatic type conversions can be applied (i.e. parsing) and the CSV semantics (if any). Use NO_ESCAPE_CHARACTER and NO_QUOTE_CHARACTER if the input files don't have any such semantics. If hasHeader is true, the first line of any file will be skipped.

Additional options for parsing the input file are:


TupleTextInputFormat

public TupleTextInputFormat(Schema schema,
                            int[] fields,
                            boolean hasHeader,
                            String nullString)
Fixed width fields file reader. You must specify the Schema that will be used for Tuples being read so that automatic type conversions can be applied (i.e. parsing). If hasHeader is true, the first line of any file will be skipped.

Method Detail

isSplitable

protected boolean isSplitable(org.apache.hadoop.mapreduce.JobContext context,
                              org.apache.hadoop.fs.Path file)
Overrides:
isSplitable in class org.apache.hadoop.mapreduce.lib.input.FileInputFormat<ITuple,org.apache.hadoop.io.NullWritable>

getSchema

public Schema getSchema()

isHasHeader

public boolean isHasHeader()

getSeparatorCharacter

public char getSeparatorCharacter()

getQuoteCharacter

public char getQuoteCharacter()

getEscapeCharacter

public char getEscapeCharacter()

getType

public com.datasalt.pangool.tuplemr.mapred.lib.input.TupleTextInputFormat.InputType getType()

isStrictQuotes

public boolean isStrictQuotes()

getNullString

public String getNullString()

getFixedWidthFieldsPositions

public int[] getFixedWidthFieldsPositions()

createRecordReader

public org.apache.hadoop.mapreduce.RecordReader<ITuple,org.apache.hadoop.io.NullWritable> createRecordReader(org.apache.hadoop.mapreduce.InputSplit iS,
                                                                                                             org.apache.hadoop.mapreduce.TaskAttemptContext context)
                                                                                                      throws IOException,
                                                                                                             InterruptedException
Specified by:
createRecordReader in class org.apache.hadoop.mapreduce.InputFormat<ITuple,org.apache.hadoop.io.NullWritable>
Throws:
IOException
InterruptedException


Copyright © –2014 Datasalt Systems S.L.. All rights reserved.