Uploading files programmatically
Outline of workflow for using the second iteration of the Upload process.
Pennsieve Agent - Version 2
This document describes workflows using the new Pennsieve Agent which provides significant improvements for uploading data to the platform.
The following document outline the steps to upload files to the Pennsieve platform with the new Agent.
Overview
The general flow for uploading files to a dataset is as follows:
- Select the dataset which should be targeted.
- Running the Pennsieve-agent server.
- Create a manifest locally which contains all files that should be uploaded to the dataset.
- Synchronize the manifest with the Pennsieve server (this happens automatically when upload is started).
- Initiate uploading the manifest.
- Subscribing to events while data is uploaded.
- Verifying upload status from the server.
Installing the Pennsieve Agent
The new Pennsieve Agent can be downloaded from:
https://github.com/Pennsieve/pennsieve-agent/releases/latest
You should be able to run the installer (currently Windows installer is not working) to install the Pennsieve Agent. To check if the Pennsieve Agent is installed, open your terminal and run pennsieve
. This should return some help documentation for the Pennsieve Agent.
$ pennsieve
The Pennsieve-Agent can be used to interact with the Pennsieve Platform.
Usage:
pennsieve [command]
Available Commands:
agent Starts the Agent gRPC server
completion Generate the autocompletion script for the specified shell
config Show the current Pennsieve configuration file.
dataset Set your current working dataset.
help Help about any command
manifest Lists upload sessions.
profile Manage Pennsieve profiles
upload Upload files to the Pennsieve platform.
whoami Displays information about the logged in user.
Flags:
--db string db file (default is $HOME/.pennsieve/db.ini)
-h, --help help for pennsieve-server
-t, --toggle Help message for toggle
Use "pennsieve [command] --help" for more information about a command.
Running the Pennsieve-Agent
The Pennsieve Agent contains two components:
- The Pennsieve Agent
- The Pennsieve Command Line Interface (CLI)
The Pennsieve Agent is an application that runs in the background and listens to commands from the CLI or any of the other Pennsieve clients (Python, MATLAB, or Javascript). Many of the functions that are available in the CLI or clients require that the agent is running on the local machine (e.g. uploading files).
You can run the agent as a background process by calling in the terminal:
$ pennsieve agent
This runs the process in the background and allows you to use the same terminal for subsequent commands. You can also run the agent as a regular process by running:
$ pennsieve agent start
This will run the agent in the current session and will output any logging to the terminal window.
Subscribing to events from the agent
In order to subscribe to messages from the agent, you can use the pennsieve subscribe
method. This will open a long-lasting connection to the agent and retrieve messages from the agent about ongoing processes. You can use this to track upload status for files during upload sessions. You can have multiple windows subscribe to messages from the agent.
Creating an upload manifest
In contrast to our previous version of the agent. Uploading files is now a two step process. Users first create a local manifest, and then initiate uploading the manifest. Uploading of all files within a single manifests is considered a single upload session.
Setting the active dataset
In order to specify where to upload data, you need to identify a dataset that you'll be using for the upload-session. You can do this by running the pennsieve dataset use <DatasetID>
command. The Dataset ID should be of the form "N:Dataset:xxxx...". You can find the dataset ID as part of the url if you navigate to the dataset on the Pennsieve platform. (We will add other mechanisms to select the dataset going forward)
$ pennsieve dataset use N:dataset:44ad6ead-bd8e-48a2-a249-a3fa3261cb43
This sets the active dataset for the CLI. Any commands interacting with a dataset going forward will be run against this dataset. Setting the active dataset is persistent and the dataset will remain active until the user changes the active dataset manually.
Creating a manifest
Next, you create a manifest by calling the pennsieve manifest create <PATH>
command. When you specify a path, all files under that path will be added recursively to the manifest.
$ pennsieve manifest create ~/Desktop/testUpload
After creating the manifest, you can optionally add files by calling the pennsieve manifest add
command. Each time you add files, you can use optional flags to specify directly which folder on the Pennsieve platform the files should be added to. You can leverage this functionality to create custom file-location mappings between the file-paths locally and on the Pennsieve platform.
Uploading files from a manifest
Once you have created a manifest, you can initiate uploading the manifest using the pennsieve upload manifest <ManifestID>
command. This will direct the agent to start uploading the files in the background. The agent will use multiple threads to upload the files efficiently..
In order to check the progress of the upload session, use the pennsieve agent subscribe
method. In the CLI, running this method will show a dynamic list of files and their progress.
You can also check the status of the files for a manifest using the ```pennsieve manifest list command. This will show a list of all files in a manifest and their current status. Each file can have one of the following statuses associated with it. These identify where the file is in the import process for the Pennsieve platform:
- LOCAL: This means the file is added to a local manifest, but the Pennsieve platform has not been informed that it will be uploaded.
- REGISTERED: The local file status and the remote file status are synchronized. The Pennsieve Platform is expecting this file to be uploaded
- UPLOADED: The file is successfully uploaded to the Pennsieve platform and is currently queued to be imported in a dataset and moved to the right storage bucket.
- IMPORTED: The file is successfully uploaded to the Pennsieve platform and has been registered in the Pennsieve database. It is currently scheduled to be post-processed and moved to its final storage location.
- FINALIZED: The file has successfully been imported and is stored in the final storage location.
- VERIFIED: A client was successfully notified that the file was finalized. This is the final state of the upload pipeline.
- CANCELLED: The file was started to be uploaded but was cancelled by the user. Synchronizing the manifest with the server will place the file in SYNCED status again.
- FAILED: The file failed to be imported correctly. Rerunning upload will try to upload the file again.
Finally
There are a number of improvements that we will be adding to the agent going forward, but the agent should be fully functional if used in the outlined manner.
Updated over 2 years ago