Uploading files using the Pennsieve Agent

Programmatically upload files to the Pennsieve platform

Prerequisites

1. Installing the Pennsieve Agent

The latest version of the Pennsieve Agent can be downloaded from:
https://github.com/Pennsieve/pennsieve-agent/releases/latest

Instructions
Windowsdownload the file with the .msi extension. This is a Windows installer and will guide you through the steps to install the Pennsieve Agent.
Macdownload the file with the .pkg extension. This is a Mac installer and will guide you through the steps to install the Pennsieve agent. If you can't open the installer because of security settings on your computer, try to right-click on the file in your explorer, and select 'Open'.
Linuxdownload the file with the .deb extension and use apt or dpkg to install the package, e.g.

2. Configuring the Pennsieve Agent

A detailed page describing how to configure the Pennsieve Agent and how to setup profiles associated with your Pennsieve user-account can be found here: Configuring client credentials.

3. [OPTIONAL] Installing Python Pennsieve Agent

In order to with Pennsieve Python library, install the latest version with the following command (Anaconda console for Windows, terminal for Linux/MAC):

pip install -U pennsieve2

This command should download and install the latest release of the Pennsieve package for python (temporarily called pennsieve2).

Verifying the installation

You should be able to run the installer to install the Pennsieve Agent. To check if the Pennsieve Agent is installed, open your terminal and run pennsieve. This should return some help documentation for the Pennsieve Agent.

For Python, simply open Jupyter Notebook or execute python and import library from pennsieve2 package.

$ pennsieve

The Pennsieve-Agent can be used to interact with the Pennsieve Platform.

Usage:
  pennsieve [command]

Available Commands:
  agent       Starts the Agent gRPC server
  completion  Generate the autocompletion script for the specified shell
  config      Show the current Pennsieve configuration file.
  dataset     Set your current working dataset.
  help        Help about any command
  manifest    Lists upload sessions.
  profile     Manage Pennsieve profiles
  upload      Upload files to the Pennsieve platform.
  whoami      Displays information about the logged in user.

Flags:
      --db string   db file (default is $HOME/.pennsieve/db.ini)
  -h, --help        help for pennsieve-server
  -t, --toggle      Help message for toggle

Use "pennsieve [command] --help" for more information about a command.
from pennsieve2 import Pennsieve

If this command succeeds, that means that Pennsieve Python client has been successfully installed. Please notice that in order to use the Python package, you need to have the Pennsieve Agent installed and properly configured on your system.

Running the Pennsieve Agent

The Pennsieve Agent contains two components:

  • The Pennsieve Agent
  • The Pennsieve Command Line Interface (CLI)

The Pennsieve Agent is an application that runs in the background and listens to commands from the CLI or any of the other Pennsieve clients (currently supported are command line, Python, and Javascript). Most of the functions that are available in the CLI or clients require that the agent is running on the local machine, as Python/Javascript clients directly use Pennsieve Agent for establishing connection with Pennsieve Agent.

You can run the agent as a background process by calling in the terminal:

$ pennsieve agent

This runs the process in the background and allows you to use the same terminal for subsequent commands. You can also run the agent as a regular process by running:

$ pennsieve agent start

This will run the agent in the current session and will output any logging to the terminal window.

Using the Pennsieve Agent

In order to use the Pennsieve agent or one of the clients, the Pennsieve agent needs to be started.
The most common workflow includes the following:

  1. [Python/JavaScript] Initiating the Pennsieve client
  2. Selecting the dataset.
  3. Creating an upload manifest and adding files.
  4. Uploading the files.
  5. Monitoring the status of the manifest.
  6. Verifying upload status from the server.

1. [Python/JavaScript] Initiating Pennsieve client

Using the Pennsieve Agent from Python/JavaScript clients requires to start the Pennsieve agent first (please refer to the instruction above).

For Python, in order to use the functionalities, the pennsieve2 users need to import the required library and create an instance of the Pennsieve class. Python client automatically reads the Pennsieve Agent configuration file and attempts to connect to the Pennsieve Agent instance running by default at port 9000. This setup could be overwritten by using custom host and port number (please refer to the package).

from pennsieve2 import Pennsieve  #to be substituted in future 
p=Pennsieve()

If the p=Pennsieve() command times out or results in an error message, the most probable cause is that the Pennsieve Agent has not been started in a separate terminal. Please refer to the Running the Pennsieve Agent instructions above.

2. Setting the active dataset

In order to specify where to upload data, you need to identify a dataset that you'll be using for the upload-session.
For Pennsieve Agent CLI this can be accomplished by running the pennsieve dataset use <DatasetID> command, and the Dataset ID needs to be in the form "N:Dataset:xxxx...". You can find the dataset ID as part of the url if you navigate to the dataset on the Pennsieve platform.
For Pennsieve Python client you can either use the long name of the dataset, or your own name. For example, if you called your dataset xyz you can select the dataset by referring either to its name, or to the long name.

$ pennsieve dataset use N:dataset:44ad6ead-bd8e-48a2-a249-a3fa3261cb43
p.use_dataset('xyz')  # or p.useDataset('N:dataset:44ad6ead-bd8e-48a2-a249-a3fa3261cb43')

This command sets the active dataset for the CLI/Python. Any commands interacting with a dataset going forward will be run against this dataset. Setting the active dataset is persistent and the dataset will remain active until the user changes the active dataset manually.

3. Creating an upload manifest and adding files

In contrast to our previous version of the agent. Uploading files is now a two step process. Users first create a local manifest, and then initiate uploading the manifest. Uploading of all files within a single manifests is considered a single upload session.

Creating a manifest
In order to prepare files for an upload to the platform, users need to provide , either to a file or directory.
When you specify a path, all files under that path will be added recursively to the manifest.
For example, to index all the files within ~/Desktop/testUpload folder, the following command needs to be executed:

$ pennsieve manifest create ~/Desktop/testUpload
p.manifest.create('~/Desktop/testUpload')  #add directory or file

Adding files to the manifest
After creating the manifest, you can optionally add files by calling the following command:

pennsieve manifest add ~/Desktop/test.txt
p.manifest.add('~/Desktop/test.txt')

Each time you add files, you can use optional flags to specify directly which folder on the Pennsieve platform the files should be added to (target_base_path). You can leverage this functionality to create custom file-location mappings between the file-paths locally and on the Pennsieve platform. You can use more than a single manifest and provide manifest_id.

4. Uploading files to Pennsieve platform

Once you have created a manifest, you can initiate uploading the manifest using the following command:

$ pennsieve upload manifest <ManifestID>
p.manifest.upload()  #e.g. p.upload(manifest_id=57) if the manifest identifier is 57

Notice, that for the python client, the last manifest_id will be used if it was not provided by the user.
This will direct the agent to start uploading the files in the background. The agent will use multiple threads to upload the files efficiently.

5. Monitoring the progress

In order to check the progress of the upload session, use the subscribe command:
In the CLI, running this method will show a dynamic list of files and their progress.

$ pennsieve agent subscribe
p.subscribe(34) #enter any number

Subscription allows to monitor live updates of the uploads in form of progress bars.

6. Verifying upload status

The status of the files can be checked by listing the manifests.
For CLI, you can provide to inquire about specific manifest.
In order to verify the upload status of the files, use the following command:

$ pennsieve manifest list <ManifestID>
p.manifest.list_files()  
#Alternatively: p.manifest.list_files(manifest_id=57) if the requested ManifestID is 57

This will show a list of all files in a manifest and their current status. Each file can have one of the following statuses associated with it. These identify where the file is in the import process for the Pennsieve platform:

  1. LOCAL: This means the file is added to a local manifest, but the Pennsieve platform has not been informed that it will be uploaded.
  2. REGISTERED: The local file status and the remote file status are synchronized. The Pennsieve Platform is expecting this file to be uploaded
  3. UPLOADED: The file is successfully uploaded to the Pennsieve platform and is currently queued to be imported in a dataset and moved to the right storage bucket.
  4. IMPORTED: The file is successfully uploaded to the Pennsieve platform and has been registered in the Pennsieve database. It is currently scheduled to be post-processed and moved to its final storage location.
  5. FINALIZED: The file has successfully been imported and is stored in the final storage location.
  6. VERIFIED: A client was successfully notified that the file was finalized. This is the final state of the upload pipeline.
  7. CANCELLED: The file was started to be uploaded but was cancelled by the user. Synchronizing the manifest with the server will place the file in SYNCED status again.
  8. FAILED: The file failed to be imported correctly. Rerunning upload will try to upload the file again.

Finally

There are a number of improvements that we will be adding to the agent going forward, but the agent should be fully functional if used in the outlined manner.