Replication Strategies

Ingesting data from third party sources and databases

Strategies

The primary use-case of Pennsieve's metadata store is to make metadata records collected in other applications available for downstream analysis and integrate them with file-based scientific information. Although it is possible to use Pennsieve Metadata services as a primary store for metadata, this is not its primary use-case.

Therefore, most users will want to develop ingest strategies to copy data from a primary resource (e.g. CSV, REDCap, Postgres, Google Spreadsheet, Others) to Pennsieve. Depending on the specific use-case of the user, there are three valid strategies for data replication on the Pennsieve Platform:

StrategyImplementation
AppendIngest creates new records by default which are appended to the existing record set
MergeIngest leverages Primary-Keys to replace old objects and add new records to the record set
ReplaceIngest archives all existing records and replaces them by the new set of ingested records.

Append

Users do not define primary keys in the properties of the models in a dataset. When new records are ingested, they will always append the existing set of records. This method works if there is a clear incremental source and records never change after they are created. In this case, users can manually archive records that are no longer needed or need to be removed from the active record set.

Merge

Users declare one, or more properties in the model as x-pennsieve-key. When new records are ingested, any records that match the compound key are overwritten. That is, the old record is archived and a new record is created as a new version of the original record. New records without an existing matching record are appended to the record set. Merged records will still have a new Pennsieve ID as we are not updating the old record but instead archive the old record and replace it by the new record.

Replace

Users can optionally define properties as x-pennsieve-key but are not leveraging the functionality to merge records. Instead, prior to ingesting new records, they archive all records in the current record set and replace them with the newly ingested record set.

Tools

While it is possible to add/archive records manually through the web-application, most users will want to automate this process programmatically. Pennsieve provides two ways to do this:

Pennsieve API

You can use the Pennsieve API to insert, merge, archive records in bulk. The Pennsieve API Documentation can be found here: https://docs.pennsieve.io/reference/insertrecords

It includes methods to:

  1. Create/List/Update/Archive Models and Templates
  2. Create/List/Merge/Archive Records
  3. Create/List/Archive Relationships
  4. Create/List/Archive Package Relationships

Singer.io (beta)

The open-source standard for writing scripts that move data.

The open-source standard for writing scripts that move data.

Pennsieve enables you to use the Singer.io ecosystem for ETL processes. You can use any of the available taps on singer.io and use the Pennsieve_target to stream the records from the tap into the Pennsieve metadata store.

"Singer describes how data extraction scripts—called “taps” —and data loading scripts—called “targets”— should communicate, allowing them to be used in any combination to move data from any source to any destination. Send data between databases, web APIs, files, queues, and just about anything else you can think of." (https://singer.io).