Sustainable analytic pipelines

The Pennsieve platform supports scalable and sustainable data workflow management and deployment. In order to manage costs, and to de-risk enabling compute on the Pennsieve Data Ecosystem, the Pennsieve platform requires users to bring their own compute resources (BYOC). That means that any costs that is associated with running the analysis is paid by the user instead of the Pennsieve team. This allows us to provide a scalable solution and not artificially limit analysis access or throttle speeds to minimize costs.

The goal of Pennsieve Analysis is to provide a seamless solution for users to submit and run analysis pipelines without having to worry about infrastructure, cloud-deployments and software engineering. We aim to make this functionality available to anyone who currently runs analysis over scientific data on their own machines using either Python or R. Researchers upload datasets — imaging files, time series recordings, tabular data — and organize them into packages within workspaces. Processors are the compute layer that transforms this data.

A workflow defines a pipeline of one or more processors arranged as a directed acyclic graph (DAG). When a user triggers a workflow on a dataset, the platform downloads the selected files, runs each processor in order, and makes the results available. Processors can be chained: the output of one becomes the input of the next, enabling multi-step pipelines such as format conversion followed by feature extraction followed by quality scoring.

  Pennsieve Dataset
        │
        ▼
  ┌──────────────┐     ┌──────────────┐     ┌──────────────┐
  │  Processor A │ ──► │  Processor B │ ──► │  Processor C │
  │  (convert)   │     │  (extract)   │     │  (score)     │
  └──────────────┘     └──────────────┘     └──────────────┘
                                                    │
                                                    ▼
                                             Results uploaded
                                             back to Pennsieve

Each processor runs in its own isolated container with access to a shared file system. Processors do not need to know about AWS, Step Functions, or the orchestration layer — they simply read files from a directory, do their work, and write results to another directory. The platform handles everything else: downloading data from Pennsieve, chaining processors together, passing credentials, tracking status, and archiving logs.

Processors are reusable across workflows and datasets. A single processor image can be registered once and used in many different pipelines. Because processors communicate only through files on disk, they can be written in any language and combined freely regardless of implementation.

⚠️
There are costs associated with running analysis using Pennsieve. Make sure you understand how running compute on your cloud resource is invoiced.

Introduction to Pennsieve Analytics

Sustainable analytic pipelines

⚠️
There are costs associated with running analysis using Pennsieve. Make sure you understand how running compute on your cloud resource is invoiced.

Sustainable analytic pipelines

⚠️There are costs associated with running analysis using Pennsieve. Make sure you understand how running compute on your cloud resource is invoiced.

⚠️
There are costs associated with running analysis using Pennsieve. Make sure you understand how running compute on your cloud resource is invoiced.