Compute Node Resources and Infrastructure

Pennsieve is a data management platform for biomedical research. Researchers upload datasets — imaging files, time series recordings, tabular data — and organize them into packages within workspaces. Processors are the compute layer that transforms this data.

A workflow defines a pipeline of one or more processors arranged as a directed acyclic graph (DAG). When a user triggers a workflow on a dataset, the platform downloads the selected files, runs each processor in order, and makes the results available. Processors can be chained: the output of one becomes the input of the next, enabling multi-step pipelines such as format conversion followed by feature extraction followed by quality scoring.

  Pennsieve Dataset
        │
        ▼
  ┌──────────────┐     ┌──────────────┐     ┌──────────────┐
  │  Processor A │ ──► │  Processor B │ ──► │  Processor C │
  │  (convert)   │     │  (extract)   │     │  (score)     │
  └──────────────┘     └──────────────┘     └──────────────┘
                                                    │
                                                    ▼
                                             Results uploaded
                                             back to Pennsieve

Each processor runs in its own isolated container with access to a shared file system. Processors do not need to know about AWS, Step Functions, or the orchestration layer — they simply read files from a directory, do their work, and write results to another directory. The platform handles everything else: downloading data from Pennsieve, chaining processors together, passing credentials, tracking status, and archiving logs.

Processors are reusable across workflows and datasets. A single processor image can be registered once and used in many different pipelines. Because processors communicate only through files on disk, they can be written in any language and combined freely regardless of implementation.