This document contains information on how to work with the Sensitive Data Archive (SDA) from the GDI user perspective.
The storage and interfaces software stack for the GDI-starter-kit consists of the following services:
| Component | Description |
|---|---|
| broker | RabbitMQ based message broker, SDA-MQ. |
| database | PostgreSQL database, SDA-DB. |
| storage | S3 object store, demo uses Minio S3. |
| auth | OpenID Connect relaying party and authentication service, SDA-auth. |
| s3inbox | Proxy inbox to the S3 backend store, SDA-S3Inbox. |
| download | Data out solution for downloading files from the SDA, SDA-download. |
| SDA-pipeline | The ingestion pipeline of the SDA, SDA-pipeline. This comprises of the following core components: ingest, verify, finalize and mapper. |
Detailed documentation on the sda-pipeline can be found at: https://neic-sda.readthedocs.io/en/latest/services/pipeline).
NeIC Sensitive Data Archive documentation can be found at: https://neic-sda.readthedocs.io/en/latest/ .
Before deploying the stack, please make sure that all configuration files are in place. The following files need to be created from their respective examples:
cp ./config/config.yaml.example ./config/config.yaml
cp ./config/iss.json.example ./config/iss.json
cp ./.env.example ./.envno further editing to the above files is required for running the stack locally.
The storage and interfaces stack can be deployed with the use of the provided docker-compose.yml file by running
docker compose -f docker-compose.yml up -dfrom the root of this repo. Please note that in the current form of the compose file, services are configured to work out-of-the-box with the LS-AAI-mock service and some configuration of the latter is needed beforehand, see here for step-by-step instructions. The rationale behind this setup is to allow for a seamless transition to an environment with a live LS-AAI service as discussed briefly below.
Configuration can be further customized by changing the files listed at the top of the ./.env file along with the .env file itself. Please bear in mind that environment variables take precedence over the config.yaml file.
Lastly, this repo includes a docker-compose-demo.yml file which deploys a standalone stack along with a demo of the sda services' functionality with test data. Details can be found here.
Internet facing services such as s3inbox, download and auth, need to be secured via TLS certification. This can be most conveniently achieved by using Let's Encrypt as Certificate Authority. Assuming shell access to your web host, a convenient way to set this up is through installing Certbot (or any other ACME client supported by Let's Encrypt). Detailed instructions on setting up Certbot for different systems can be found here.
To interact with SDA services, users need to provide JSON Web Token (JWT) authorization. Ultimately, tokens can be fetched by LS-AAI upon user login to an OpenID Connect (OIDC) relaying party (RP) service that is registered with LS-AAI. An example of such an RP service is the sda-auth, which is included in the present stack.
Assuming users with a valid LS-AAI ID, they can obtain a JWT by logging in to the sda-auth service. This can be done by navigating to the sda-auth service URL (e.g. https://localhost:8085 for a local deployment or https://login.gdi.nbis.se for a live one) and clicking on the Login button. This will redirect the user to the LS-AAI login page where they can enter their credentials. Once authenticated, the user will be redirected back to the sda-auth service and a JWT will be issued. This is an access token which can be copied from the sda-auth's page and used to interact with the SDA services like e.g. for authorizing calls to sda-download's API as described in the Downloading data section below.
From sda-auth's page users can also download a configuration file for accessing the s3inbox service. This s3cmd.conf file containes the aforementioned access token along with other necessary information and it is described in detail in the Uploading data section below.
The sda-pipeline only ingests files encrypted with the archive's c4gh public key. For instance, using the Go implementation of the crypt4gh utility a file can be encrypted simply by running:
crypt4gh encrypt -f <file-to-encrypt> -p <sda-c4gh-public-key>where <sda-c4gh-public-key> is the archive's public key. Note that docker-compose.yml stores the archive's c4gh public key in a volume named shared, see below for how to extract it.
Users can upload data to the SDA by transferring them directly to the archive's s3inbox with an S3 client tool such as s3cmd:
s3cmd -c s3cmd.conf put <path-to-file.c4gh> s3://<USER_LS-AAI_ID>/<target-path-to-file.c4gh>where USER_LS-AAI_ID is the user's LS-AAI ID with the @ replaced by a _ and s3cmd.conf is a configuration file with the following content:
[default]
access_key = <USER_LS-AAI_ID>
secret_key = <USER_LS-AAI_ID>
access_token=<JW_TOKEN>
check_ssl_certificate = False
check_ssl_hostname = False
encoding = UTF-8
encrypt = False
guess_mime_type = True
host_base = <S3_INBOX_DOMAIN_NAME>
host_bucket = <S3_INBOX_DOMAIN_NAME>
human_readable_sizes = true
multipart_chunk_size_mb = 50
use_https = True
socket_timeout = 30It is possible to download the s3cmd.conf file from the sda-auth service as described in the Authentication for users with LS-AAI (mock or alive) section above. However, do note that s3cmd.conf downloaded from this service lacks the section header [default] which needs to be added manually if one wishes to use the file directly with s3cmd.
For example, a s3cmd.conf file downloaded from auth after deploying the stack locally (with LS-AAI-mock as OIDC) would look like this:
access_key = jd123_lifescience-ri.eu
secret_key = jd123_lifescience-ri.eu
access_token=eyJraWQiOiJyc2ExIiwidH...
check_ssl_certificate = False
check_ssl_hostname = False
encoding = UTF-8
encrypt = False
guess_mime_type = True
host_base = localhost:8000
host_bucket = localhost:8000
human_readable_sizes = true
multipart_chunk_size_mb = 50
use_https = True
socket_timeout = 30where the acces token has been truncated for brevity. Please note that the option use_https = True is missing from the above file (therefore set implicitly to False) since the local deployment of the stack does not use TLS.
Instead of the tools above, users are encouraged to use sda-cli, which is a tool specifically developed to perform all common SDA user-related tasks in a convenient and unified manner. It is recommended to use precompiled executables for sda-cli which can be found at https://github.com/NBISweden/sda-cli/releases
To start using the tool run:
./sda-cli help- Encrypt and upload a file to the SDA in one go:
./sda-cli upload -config s3cmd.conf --encrypt-with-key <sda-c4gh-public-key> <unencrypted_file_to_upload>- Encrypt and upload a whole folder recursively to a specified path, which can be different from the source, in one go:
./sda-cli upload -config s3cmd.conf --encrypt-with-key <sda-c4gh-public-key> -r <folder_1_to_upload> -targetDir <upload_folder>- List all uploaded files in the user's bucket recursively:
./sda-cli list -config s3cmd.confFor detailed documentation on the tool's capabilities and usage please refer here.
Users can directly download data from the SDA via sda-download, for more details see the service's api reference. In short, given a valid JW token, $token, a user can download the file with file ID, $fileID by issuing the following command:
curl --cacert <path-to-certificate-file> -H "Authorization: Bearer $token" https://<sda-download_DOMAIN_NAME>/files/$fileID -o <output-filename>where for example sda-download_DOMAIN_NAME can be login.gdi.nbis.se or localhost:8443 depending on the deployment. In the case of a local deployment, the certificate file can be obtained by running:
docker cp download:/shared/cert/ca.crt .The fileID is a unique file identifier that can be obtained by calls to sda-download's /datasets endpoint. For details and a concrete example on how to use sda-download with demo data please see here.
In order for a user to access a file, permission to access the dataset that the file belongs to is needed. This is granted through REMS in the form of GA4GH visas. For details see starter-kit documentation on REMS and the links therein.
Within the scope of the starter-kit, it is up to the system administrator to curate incoming uploads to the Sensitive Data Archive. To ease this task, we have created the sda-admin tool which is a shell script that can perform all the necessary steps in order for an unencrypted file to end up properly ingested and archived by the SDA stack. The script can be found under scripts/ and can be used to upload and ingest files as well as assigning accession ID to archived files and linking them to a dataset.
In the background it utilizes the sda-cli for encrypting and uploading files and automates generating and sending broker messages between the SDA services. Detailed documentation on its usage along with examples can be retrieved upon running the command:
./sda-admin helpBelow we provide a step-by-step example of sda-admin usage.
Create a test file:
dd if=/dev/random of=test_file count=1 bs=$(( 1024 * 1024 * 1 )) iflag=fullblockFetch the archive's c4gh public key (assuming shell access to the host machine):
docker cp ingest:/shared/c4gh.pub.pem .To encrypt and upload test_file to the s3inbox, first get a token and prepare a s3cmd configuration file as described in the section Uploading data above. Then run the following:
./sda-admin --sda-config s3cmd.conf --sda-key c4gh.pub.pem upload test_fileOne can verify that the encrypted file is uploaded in the archive's inbox by the following command:
sda-cli list --config s3cmd.confTo list the filenames currently in the "inbox" queue waiting to be ingested run:
./sda-admin ingestIf test_file.c4gh is in the returned list, run:
./sda-admin --user <s3_access_key> ingest test_fileto trigger ingestion of the file.
In brief, accesion IDs are unique identifiers that are assigned to files in order to be able to reference them in the future. Check that the file has been ingested by listing the filenames currently in the "verified" queue waiting to have accession IDs assigned to them:
./sda-admin accessionIf test_file.c4gh is in the returned list, we can proceed with accession:
./sda-admin accession MYID001 test_filewhere MYID001 is the accession ID we wish to assign to the file.
Check that the file got an accession ID by listing the filenames currently in the "completed" queue waiting to be associated with a dataset ID:
./sda-admin datasetLastly, associate the file with a dataset ID:
./sda-admin dataset MYSET001 test_fileNote that all the above steps can be done for multiple files at a time except from assigning accession IDs which needs to be done for one file at a time.
Assuming access to a terminal session in the host machine of the deployed docker compose stack, the status of all running containers can be checked as per usual with the command: docker ps whereas all logs from the deployed services can be monitored in real time as per usual by the command:
docker compose -f docker-compose.yml logs -for per service as:
docker compose -f docker-compose.yml logs <container-name> -fNote that when applicable periodic healthchecks are in place to ensure that services are running normally. All containers are configured to always restart upon failure.
As stated, we use RabbitMQ as our message broker between different services in this stack. Monitoring the status of the broker service can most conveniently be done via the web interface, which is accessible at http://localhost:15672/ (use https if TLS is enabled). By default, user:password credentials with values test:test are created upon deployment and can be changed by editing the docker-compose.yml file. There are two ways to create a password hash for RabbitMQ as described here
Broker messages are most conveniently generated by scripts/sda-admin as described above. If for some reason one wants to send MQ messages manually instead, there exist step-by-step examples here.