Structure of published datasets

This page outlines the publishing schema for datasets on Discover. Specifically, this outlines schema version 4.

Publishing schema

The published dataset contains three core components that are further described in this document:

  • Exported information about the dataset itself
  • Exported files
  • Exported metadata schema and records.

Combined, the exported information is a reflection of the dataset on the platform and can be used to re-import the data at a later stage. Note that the current schema does not export:

  • Annotations on files (such as timeseries annotations, and microscopy annotations).
  • Any discussions that are captured as part of the dataset.
  • Permissions associated with the dataset prior to publishing
  • Files that are remote resources

File Structure of a Published Dataset

Each dataset is exported using a standardized file/folder structure:

  • {Dataset & Version ID} (Folder)

Dataset Information

Readme.md
The Readme.md file is created from the “Description” section in the dataset overview page and is formatted as a MarkDown file.

Banner.jpg
The banner image file is the banner image that is associated with the dataset. Users can upload this image in the dataset settings page.

Manifest.json
The manifest file contains metadata information about the dataset and a manifest of all the files that are part of the published dataset. It is formatted as JSON-LD and adheres to the schema.org ontology where possible.

The contents of the file are as follows:

  • blackfynnDatasetId: Int

  • version: Int

  • name: String

  • description: String

  • creator: PublishedContributor

  • contributors: List[PublishedContributor]

  • sourceOrganization: String

  • keywords: List[String]

  • datePublished: LocalDate

  • license: License

  • @id: String

  • publisher = "Blackfynn, Inc"

  • @context = "http://schema.org/"

  • @type = "Dataset"

  • schemaVersion = "http://schema.org/version/3.7/"

  • files: List[FileManifest]

  • blackfynnSchemaVersion = “4.0”
    where:

  • @id: The DOI of the dataset.

  • PublishedContributor: Json Object:

  • License: The license associated with the dataset. We limit this to the following values:

  • FileManifest: Json Object:

Files Folder

The Files folder contains all the files that are included in the dataset and are organized in the same way that the user interacts with them on the platform.

As the exported data is file based, and no longer package based, we flatten the notion of a package if the original package has a single source-file. If a package on the platform has multiple source files, we create a folder with the package name, and place the source files into this folder.

All file locations are captured in the manifest.json file at the root level of the published dataset.

If users added “External Files“ in the platform, these are not included in the published data.

Revisions Folder

The Revisions folder contains folders for each revision that is published for a specific version of the dataset. A revision of a dataset includes:

  • Updated dataset metadata
  • Updated Readme.md file
  • Updated Banner image
    The updated assets are included in the revision folder while the original assets remain in the root folder. The revised manifest.json file does not include an updated file manifest as that information is static and not updated during a revision.

Metadata Folder

The Metadata folder contains a serialized version of all metadata records and their relationships in the dataset as well as any links to files that are associated with these records.

Schema.json
The contents of the schema.json file are as follows:

  • relationships [Object]

  • models [Object]
    where:

  • relationships/to: Model name of the target of the relationshipip

  • relationships/name: Name of the relationship

  • relationships/file: relative path to the file that contains the instances of this relationship

  • relationships/from: Model name of the source of the relationship

  • models/properties/dataType/items: required if the datatype is “Array”

  • models/properties/dataType/to: required if dataType is “Model”. Indicates the name of the model that the property points to. The “Model” datatype represents linked properties.

  • models/properties/dataType/file: required if dataType is “Model”. Indicates file that contains the records that the property points to.

  • models/file: Relative path to the file that contains the instances of this model.

{relation}.csv
Relationship instances are serialized in CSV files and grouped by relationship name. Each type of relationship has its own CSV file.

Each file has the following columns:

  • From: String with the id of the source record
  • To: String with the id of the target record
  • Relationship: String with the name of the relationship (same as file-name)
  • {model_name}.csv
    Model instances are serialized in CSV files and grouped by model name. Each type of model has its own CSV file.

Each file has columns reflecting the property names of the model. The first column in the CSV file reflects the ID of the record.

files.csv
The Record folder has an optional files.csv file. This file exists when files are associated with records.

Each file has the following columns:

  • id: String with the id of the record
  • path: String with path to the file
  • sourcePackageId: Original package Id of the referenced file