Description

An extension to FsDataWriter that writes in Parquet format in the form of Group.java. This implementation allows users to specify the CodecFactory to use through the configuration property writer.codec.type. By default, the deflate codec is used.

Usage

writer.builder.class=org.apache.gobblin.writer.ParquetDataWriterBuilder
writer.destination.type=HDFS
writer.output.format=PARQUET

For more info, see ParquetHdfsDataWriter and ParquetDataWriterBuilder

Configuration

Key Description Default Value Required
writer.parquet.page.size The page size threshold. 1048576 No
writer.parquet.dictionary.page.size The block size threshold for the dictionary pages. 134217728 No
writer.parquet.dictionary To turn dictionary encoding on. Parquet has a dictionary encoding for data with a small number of unique values ( < 10^5 ) that aids in significant compression and boosts processing speed. true No
writer.parquet.validate To turn on validation using the schema. This validation is done by ParquetWriter not by Gobblin. false No
writer.parquet.version Version of parquet writer to use. Available versions are v1 and v2. v1 No