Table of Contents
Job Configuration Basics
A Job configuration file is a text file with extension
.job that defines the job properties that can be loaded into a Java Properties object. Gobblin uses commons-configuration to allow variable substitutions in job configuration files. You can find some example Gobblin job configuration files here.
A Job configuration file typically includes the following properties, in additional to any mandatory configuration properties required by the custom Gobblin Constructs classes. For a complete reference of all configuration properties supported by Gobblin, please refer to Configuration Properties Glossary.
job.name: job name.
job.group: the group the job belongs to.
Sourceclass the job uses.
converter.classes: a comma-separated list of
Converterclasses to use in the job. This property is optional.
- Quality checker related configuration properties: a Gobblin job typically has both row-level and task-level quality checkers specified. Please refer to Quality Checker Properties for configuration properties related to quality checkers.
Hierarchical Structure of Job Configuration Files
It is often the case that a Gobblin instance runs many jobs and manages the job configuration files corresponding to those jobs. The jobs may belong to different job groups and are for different data sources. It is also highly likely that jobs for the same data source shares a lot of common properties. So it is very useful to support the following features:
- Job configuration files can be grouped by the job groups they belong to and put into different subdirectories under the root job configuration file directory.
- Common job properties shared among multiple jobs can be extracted out to a common properties file that will be applied into the job configurations of all these jobs.
Gobblin supports the above features using a hierarchical structure to organize job configuration files under the root job configuration file directory. The basic idea is that there can be arbitrarily deep nesting of subdirectories under the root job configuration file directory. Each directory regardless how deep it is can have a single
.properties file storing common properties that will be included when loading the job configuration files under the same directory or in any subdirectories. Below is an example directory structure.
root_job_config_dir/ common.properties foo/ foo1.job foo2.job foo.properties bar/ bar1.job bar2.job bar.properties baz/ baz1.pull baz2.pull baz.properties
In this example,
common.properties will be included when loading
foo.properties will be included when loading
foo2.job and properties set here are considered more special and will overwrite the same properties defined in
bar.properties will be included when loading
bar2.job, as well as
baz.properties will be included when loading
baz2.pull and will overwrite the same properties defined in
To avoid storing passwords in configuration files in plain text, Gobblin supports encryption of the password configuration properties. All such properties can be encrypted (and decrypted) using a master password. The master password is stored in a file available at runtime. The file can be on a local file system or HDFS and has restricted access.
The URI of the master password file is controlled by the configuration option
encrypt.key.loc . By default, Gobblin will use org.jasypt.util.password.BasicPasswordEncryptor. If you have installed the JCE Unlimited Strength Policy, you can set
encrypt.use.strong.encryptor=true which will configure Gobblin to use org.jasypt.util.password.StrongPasswordEncryptor.
Encrypted passwords can be generated using the
$ gradle :gobblin-utility:assemble $ cd build/gobblin-utility/distributions/ $ tar -zxf gobblin-utility.tar.gz $ bin/gobblin_password_encryptor.sh usage: -f <master password file> file that contains the master password used to encrypt the plain password -h print this message -m <master password> master password used to encrypt the plain password -p <plain password> plain password to be encrypted -s use strong encryptor $ bin/gobblin_password_encryptor.sh -m Hello -p Bye ENC(AQWoQ2Ybe8KXDXwPOA1Ziw==)
If you are extending Gobblin and you want some of your configurations (e.g. the ones containing credentials) to support encryption, you can use
gobblin.password.PasswordManager.getInstance() methods to get an instance of
PasswordManager. You can then use
PasswordManager.readPassword(String) which will transparently decrypt the value if needed, i.e. if it is in the form
ENC(...) and a master password is provided.
Adding or Changing Job Configuration Files
The Gobblin job scheduler in the standalone deployment monitors any changes to the job configuration file directory and reloads any new or updated job configuration files when detected. This allows adding new job configuration files or making changes to existing ones without bringing down the standalone instance. Currently, the following types of changes are monitored and supported:
- Adding a new job configuration file with a
.pullextension. The new job configuration file is loaded once it is detected. In the example hierarchical structure above, if a new job configuration file
baz3.pullis added under
bar/baz, it is loaded with properties included from
baz.propertiesin that order.
- Changing an existing job configuration file with a
.pullextension. The job configuration file is reloaded once the change is detected. In the example above, if a change is made to
foo2.job, it is reloaded with properties included from
foo.propertiesin that order.
- Changing an existing common properties file with a
.propertiesextension. All job configuration files that include properties in the common properties file will be reloaded once the change is detected. In the example above, if
bar.propertiesis updated, job configuration files
baz2.pullwill be reloaded. Properties from
bar.propertieswill be included when loading
bar2.job. Properties from
baz.propertieswill be included when loading
baz2.pullin that order.
Note that this job configuration file change monitoring mechanism uses the
FileAlterationMonitor of Apache's commons-io with a custom
FileAlterationListener. Regardless of how close two adjacent file system checks are, there are still chances that more than one files are changed between two file system checks. In case more than one file including at least one common properties file are changed between two adjacent checks, the reloading of affected job configuration files may be intermixed and applied in an order that is not desirable. This is because the order the listener is called on the changes is not controlled by Gobblin, but instead by the monitor itself. So the best practice to use this feature is to avoid making multiple changes together in a short period of time.
Gobblin ships with a job scheduler backed by a Quartz scheduler and supports Quartz's cron triggers. A job that is to be scheduled should have a cron schedule defined using the property
job.schedule. Here is an example cron schedule that triggers every two minutes:
job.schedule=0 0/2 * * * ?
One Time Jobs
Some Gobblin jobs may only need to be run once. A job without a cron schedule in the job configuration is considered a run-once job and will not be scheduled but run immediately after being loaded. A job with a cron schedule but also the property
job.runonce=true specified in the job configuration is also treated as a run-once job and will only be run the first time the cron schedule is triggered.
A Gobblin job can be disabled by setting the property
true. A disabled job will not be loaded nor scheduled to run.