Introduction to Pennsieve Analytics

Pennsieve Analytics allows scientists to run analysis workflows at scale within the Pennsieve Data Ecosystem

📘

This functionality is in beta and tested with a few partner efforts. Please reach out to our team if you are interested in learning more.

Sustainable analytics

The Pennsieve platform supports scalable and sustainable data workflow management and deployment. In order to manage costs, and to de-risk enabling compute on the Pennsieve Data Ecosystem, the Pennsieve platform requires users to bring their own compute resources (BYOC). That means that any costs that is associated with running the analysis is paid by the user instead of the Pennsieve team. This allows us to provide a scalable solution and not artificially limit analysis access or throttle speeds to minimize costs.

The goal of Pennsieve Analytics is to provide a seamless solution for users to submit and run analysis pipelines without having to worry about infrastructure, cloud-deployments and software engineering. We aim to make this functionality available to anyone who currently runs analysis over scientific data on their own machines using either Python or R.

Simplified data infrastructure setup

Simplified diagram of the Pennsieve infrastructure for Analytics

Simplified diagram of the Pennsieve infrastructure for Analytics

Running analytics using the Pennsieve Data Platform requires the following steps:

1. Registering a compute resource : Users will need to register a compute resource within a Pennsieve workspace. Currently, we only support AWS as a compute resource. Registering a compute resource is done through the Pennsieve Agent. When you register a compute resource, you grant the Pennsieve Platform permissions to deploy infrastructure on your compute resource. The platform is only granted the minimum required permissions on your compute resource that is necessary to: a) run dockerized containers, b) move data to and from the compute resource, and 3) setup a node manager that can interact with the Pennsieve platform.

2. Registering analytic workflows: Users will need to register analytic workflow components within a Pennsieve workspace. Analytic workflow components are managed through GitHub and the Pennsieve GitHub integration. There are some requirements for an analytic workflow component to qualify and run within Pennsieve Analytics but in general, if an analytic workflow can be Dockerized and assumes to read data from an input folder and saves results to an output folder, it is likely that the workflow can run within Pennsieve Analytics.

3. Grant workflows access to data: Users need to specifically grant the workflows with access to the datasets. This step ensures that workflows don't have automatic access to all data within a workspace and allows for secure internal sharing and management of dataset privacy.

4. Selecting a dataset, or files and initiate the workflow: Users can use the Pennsieve App, or the API to select files within a dataset and initiate a workflow. Depending on the type of workflow, this can result in new files being generated in a dataset, changes to a datasets metadata graph, file-annotations and other components in a dataset.

While a workflow is in progress, users can check status logs and cancel a process in case this is necessary.


🚧

There are costs associated with running analysis using Pennsieve. Make sure you understand how running compute on your cloud resource is invoiced.