Introduction

The Gobblin Compliance module allows for data purging to meet regulatory compliance requirements. The module includes purging, retention and restore functionality for datasets.

The purging is performed using Hive meaning that purging of datasets is supported in any format that Hive can read from and write to, including for example ORC and Parquet. Further the purger is built on top of the Gobblin framework which means that the fault-tolerance, scalability and flexibility that Gobblin provides is taken full advantage of.

The User Guide describes how to onboard a dataset for purging.

Design

The elements of the Compliance design are:

The onboarding process
The purge process
The retention process
The restore process

Onboarding

A dataset is onboarded to the Purger with these steps:

The whitelist includes either of the database or table that will be considered for purging
Every table that is to be purged includes the necessary information for purging (dataset descriptor) as a JSON string in its TBLPROPERTIES

The purger iterates over all the tables that are whitelisted, and of those tables further looks for the presence of the dataset descriptor to specify the information required by the purger to proceed with the purge process.

With this information, the purger iterates over the partitions of the table that needs to be purged and proceeds to purge each partition of the table individually.

Purger

The purger code is mostly in the gobblin.compliance.purger package.

The elements of the purger are:

The Gobblin constructs
The Hive operations

Gobblin constructs

The Gobblin constructs that make up the Purger are:

HivePurgerSource generates a WorkUnit per partition that needs to be purged
HivePurgerExtractor instantiates a PurgeableHivePartitionDataset object that encapsulates all the information required to purge the partition
For each partition, HivePurgerConverter populates the purge queries into the PurgeableHivePartitionDataset object
The purge queries are executed by HivePurgerWriter
The HivePurgerPublisher moves successful Workunits to the COMMITTED state

Hive operations

The purging process operates as follows:

The partition information including location and partitioning scheme is determined from the metadata of the partition
A new external staging table is created using the Hive LIKE construct of the current table that is being purged
The location of this staging table on HDFS is a new folder within the table location with the current timestamp
The purge query executes a LEFT OUTER JOIN of the original table against the table containing the ids whose data is to be purged and INSERT OVERWRITEs this data into the staging table, and thereby location. Once this query returns, the location will contain the purged data
Since when we ALTER the original partition location next to the new staging table location, we preserve the location of the current/original location of the partition by creating a backup table pointing to this location. We do not move this immediately to avoid breaking any in-flight queries.
The next step is to ALTER the partition location to the location containing the purged data
The final step is to DROP the staging table, this only drops the metadata and not the data

Taking as an example, a tracking.event table, and the datepartition=2017-02-16-00/is_guest=0 partition, the purge process would be the following:

Let's assume the tracking.event table is located at the location /user/tracking/event/
The full partition name would be tracking@event@datepartition=2017-02-16-00/is_guest=0 per Hive, and let's assume the data is located at /user/tracking/event/original/datepartition=2017-02-16-00/is_guest=0/
A staging table tracking.event_staging_1234567890123 (1234567890123 is the example timestamp we will use for clarity, a real timestamp looks more like '1487154972824') is created LIKE tracking.event with the location /user/tracking/event/1234567890123/datepartition=2017-02-16-00/is_guest=0/. This would be within the original table location
The purge query would be similar to (assuming u_purger.guestids has the ids whose data is to be purged):

INSERT OVERWRITE TABLE tracking@event_staging_1234567890123
PARTITION (datepartition='2017-02-16-00',is_guest='0') 
SELECT /*+MAPJOIN(b) */ a.metadata.guestid, a.col_a, a.col_b 
FROM tracking.event a 
LEFT JOIN u_purger.guestids b
ON a.metadata.guestid=b.guestid
WHERE b.guestid IS NULL AND a.datepartition='2017-02-16-00' AND a.is_guest='0'

A backup table tracking.event_backup_1234567890123 is created with PARTITION datepartition=2017-02-16-00,is_guest=0 pointing to the original location /user/tracking/event/original/datepartition=2017-02-16-00/is_guest=0
The partition location of tracking@event@2017-02-16-00 is updated to be /user/tracking/event/1234567890123/datepartition=2017-02-16-00/is_guest=0
The tracking.event_staging_1234567890123 table is dropped

Retention

The retention code is mostly in the gobblin.compliance.retention package.

The retention process builds on top of Gobblin Retention and performs the following operations:

Cleanup of backup data beyond a specified policy
Cleanup of any staging tables not cleaned up in case of failures
Reaping of backup locations from the original location
Cleanup of trash data from the restore process beyond a specified policy

Restore

The restore code is mostly in the gobblin.compliance.restore package.

The restore process allows for restoration to a backup dataset if required.