Table of Contents

Overview

The job configuration template is implemented for saving efforts of Gobblin users. For a specific type of job, e.g. Gobblin-Kafka data pulling, there exists quite amount of repetitive options to fill in. We are aiming at moving those repetitive options into a template for specific type of job, only exposing some essential configurable options for user to specify. This does not sacrifice flexibility, users can still specify options that already exist in the template to override the default value.

Here is the .pull file for wikipedia example with template support:

job.template=templates/wikiSample.template
source.page.titles=NASA,LinkedIn,Parris_Cues,Barbara_Corcoran

How to Use Templates

Users need only submit the .pull file above to the specified directory as described in wikipedia example. Although there are far fewer options there are still some mandatory options to specify in .pull file.

In general, to use a template: - Specify which template to use in the key job.template. - All the keys specified in gobblin.template.required_attributes must be provided. - As mentioned before, user can also specify existing options in template to override the default value.

Available Templates

  • wikiSample.template
  • gobblin-kafka.template

Templates above are available on Github repo.

How to Create Your Own Template

To create a template, simply create a file with all the common configurations for that template (recommended to use .template extension). Place this file into Gobblin's classpath, and set job.template to the path to that file in the classpath.

For reference, this is how the Wikipedia template looks:

job.name=PullFromWikipedia
job.group=Wikipedia
job.description=A getting started example for Gobblin

source.class=org.apache.gobblin.example.wikipedia.WikipediaSource
source.revisions.cnt=5

wikipedia.api.rooturl=https://en.wikipedia.org/w/api.php?format=json&action=query&prop=revisions&rvprop=content|timestamp|user|userid|size
wikipedia.avro.schema={"namespace": "example.wikipedia.avro","type": "record","name": "WikipediaArticle","fields": [{"name": "pageid", "type": ["double", "null"]},{"name": "title", "type": ["string", "null"]},{"name": "user", "type": ["string", "null"]},{"name": "anon", "type": ["string", "null"]},{"name": "userid",  "type": ["double", "null"]},{"name": "timestamp", "type": ["string", "null"]},{"name": "size",  "type": ["double", "null"]},{"name": "contentformat",  "type": ["string", "null"]},{"name": "contentmodel",  "type": ["string", "null"]},{"name": "content", "type": ["string", "null"]}]}

converter.classes=org.apache.gobblin.example.wikipedia.WikipediaConverter

extract.namespace=org.apache.gobblin.example.wikipedia

writer.destination.type=HDFS
writer.output.format=AVRO
writer.partitioner.class=org.apache.gobblin.example.wikipedia.WikipediaPartitioner

data.publisher.type=org.apache.gobblin.publisher.BaseDataPublisher

gobblin.template.required_attributes=source.page.titles

How does Template Work in Gobblin

Currently Gobblin stores and loads existing templates as resources in the classpath. Gobblin will then resolve this template with the user-specified .pull file. Note that there is an option in template named gobblin.template.required_attributes which lists all options that are required for users to fill in. If any of options in the required list is absent, the configuration will be detected as invalid by Gobblin throw an runtime excpetion accordingly.

Gobblin provides methods to retrieve all options inside .template file and resolved configuration option list. These interactive funtions will be integrated soon.