Table of Contents
Gobblin integrates with Docker by running a Gobblin standalone service inside a Docker container. The Gobblin service inside the container can monitor the host filesystem for new job configuration files, run the jobs, and write the resulting data to the host filesystem. The Gobblin Docker images can be found on Docker Hub at: https://hub.docker.com/u/gobblin/
For more information on Docker, including how to install it, check out the documentation at: https://docs.docker.com/
Gobblin currently has four different repositories, and all are on Docker Hub here.
gobblin/gobblin-wikipedia repository contains images that run the Gobblin Wikipedia job found in the getting started guide. These images are useful for users new to Docker or Gobblin, they primarily act as a "Hello World" example for the Gobblin Docker integration.
gobblin/gobblin-standalone repository contains images that run a Gobblin standalone service inside a Docker container. These images provide an easy and simple way to setup a Gobblin standalone service on any Docker compatible machine.
gobblin/gobblin-distributions repositories are for internal use only, and are primarily useful for Gobblin developers.
The Docker images for this repository can be found on Docker Hub here. These images are mainly meant to act as a "Hello World" example for the Gobblin-Docker integration, and to provide a sanity check to see if the Gobblin-Docker integration is working on a given machine. The image contains the Gobblin configuration files to run the Gobblin Wikipedia job. When a container is launched using the
gobblin-wikipedia image, Gobblin starts up, runs the Wikipedia example, and then exits.
gobblin-wikipedia image requires taking following steps (lets assume we want to an Ubuntu based image):
- Download the images from the
docker pull gobblin/gobblin-wikipedia:ubuntu-gobblin-latest
- Run the
gobblin/gobblin-wikipedia:ubuntu-gobblin-latestimage in a Docker container
docker run gobblin/gobblin-wikipedia:ubuntu-gobblin-latest
The logs are printed to the console, and no errors should pop up. This should provide a nice sanity check to ensure that everything is working as expected. The output of the job will be written to a directory inside the container. When the container exits that data will be lost. In order to preserve the output of the job, continue to the next step.
- Preserving the output of a Docker container requires using a data volume. To do this, run the below command:
docker run -v /home/gobblin/work-dir:/home/gobblin/work-dir gobblin-wikipedia
The output of the Gobblin-Wikipedia job should now be written to
-v command in Docker uses a feature of Docker called data volumes. The
-v option mounts a host directory into a container and is of the form
[host-directory]:[container-directory]. Now any modifications to the host directory can be seen inside the container-directory, and any modifications to the container-directory can be seen inside the host-directory. This is a standard way to ensure data persists even after a Docker container finishes. It's important to note that the
[host-directory] in the
-v option can be changed to any directory (on OSX it must be under the
/Users/ directory), but the
[container-directory] must remain
/home/gobblin/work-dir (at least for now).
The Docker images for this repository can be found on Docker Hub here. These images run a Gobblin standalone service inside a Docker container. The Gobblin standalone service is a long running process that can run Gobblin jobs defined in a
.pull file. The job / pull files are submitted to the standalone service by placing them in a directory on the local filesystem. The standalone service monitors this directory for any new job / pull files and runs them either immediately or on a scheduled basis (more information on how this works can be found here). Running the Gobblin standalone service inside a Docker container allows Gobblin to pick up job / pull files from a directory on the host filesystem, run the job, and write the output back the host filesystem. All the heavy lifting is done inside a Docker container, the user just needs to worry about defining and submitting job / pull files. The goal is to provide a easy to setup environment for the Gobblin standalone service.
gobblin-standalone image requires taking the following steps:
- Download the images from the
docker pull gobblin/gobblin-standalone:ubuntu-gobblin-latest
- Run the
gobblin/gobblin-standalone:ubuntu-gobblin-latestimage in a Docker container
docker run -v /home/gobblin/conf:/etc/opt/job-conf \ -v /home/gobblin/work-dir:/home/gobblin/work-dir \ -v /home/gobblin/logs:/var/log/gobblin \ gobblin/gobblin-standalone:ubuntu-gobblin-latest
A data volume needs to be created for the job configuration directory (contains all the job configuration files), the work directory (contains all the job output data), and the logs directory (contains all the Gobblin standalone logs).
-v /home/gobblin/conf:/etc/opt/job-conf option allows any new job / pull files added to the
/home/gobblin/conf directory on the host filesystem will be seen by the Gobblin standalone service inside the container. So any job / pull added to the
/home/gobblin/conf directory on the local filesystem will be run by the Gobblin standalone inside running inside the Docker container. Note the container directory (
/etc/opt/job-conf) should not be modified, while the host directory (
/home/gobblin/conf) directory can be any directory on the host filesystem that contains job / pull files.
-v /home/gobblin/work-dir:/home/gobblin/work-dir option allows the container to write data to the host filesystem, so that the data persists after the container is shutdown. Once again, the container directory (
/home/gobblin/work-dir) should not be modified, while the host directory (
/home/gobblin/work-dir) can be any directory on the host filesystem.
-v /home/gobblin/logs:/var/log/gobblin option allows the Gobblin standalone logs to be written to the host filesystem, so that they can be read on the host machine. This is useful for monitoring and debugging purposes. Once again, the container directory (
/var/log/gobblin) directory should not be modified, while the container directory (
/home/gobblin/logs) can be any directory on the host filesystem.
gobblin-devimages that provide an development environment for Gobblin contributors
gobblin-kafkaimages that provide an end-to-end service for writing to Kafka and ingesting the Kafka data through Gobblin
- Test and write a tutorial on using
gobblin-standaloneimages to write to a HDFS cluster
- Create images based on Linux Alpine (lightweight Linux distro)