Skip to content

Ethica Telemetry

Jefferson Smith edited this page Jun 22, 2021 · 7 revisions

This file was copied from the SenseDoc page, to serve as a template for the Ethica version. I'll remove this note once it has been adapted to Ethica.

Ethica is a research-grade smartphone app, made by Ethica Data, and used for mobility (GPS) and physical activity (accelerometer) tracking. These data are collected in a 1-in-5 duty cycle (1 minute active, 4 minutes idle) and allow us to measure location-based physical activity and infer transportation mode.

The data fields collected by the Ethica app can be found on INTERACT's Data Dictionary

Collection/Extraction

Data is uploaded by the individual phones multiple times per day and cached on Ethica Data's servers until the end of the study.

Migration

At the close of a study, the cached data is exported by Ethica Data into zipped CSV files and downloaded to Compute Canada servers by our data manager. The zip files are then validated (command: unzip -t ZIPFILE) to ensure that the contents of the downloaded files still conform to the checksum computed when the files were initially created.

Important Note: Once we notify Ethica Data that the files have been received and verified, they purge all the associated data tables from their systems. For this reason, we take two additional steps before giving such notification:

  1. Wait at least 48 hours after receiving and validating the files, to ensure that the new files have been captured by Compute Canada's nightly backup system twice.
  2. Conduct a random secondary verification of the downloaded zip files by unzipping a few of them and viewing the contained CSV files to ensure that they are indeed CSV files and appear to have credible looking data. (In the case of Wave 2, the zip files correctly contained CSV files, but they were erroneously given .zip extensions. This suggests that their export process is ad-hoc and we should be certain we have real data before it gets purged from their systems.)

With SenseDoc data, the files are uploaded by project coordinators and often do not conform to the normalized naming conventions, so they are staged in the /def-dfuller/interact/incoming_data folder. Later they are copied to the /def-dfuller/interact/permanent_archive folder, normalized, and filtered by the data manager, as the opening step of the ingest cycle. But since Ethica data is migrated directly by the data manager, no such staging area is required. Ethica files are downloaded directly to the permanent_archive area.

After the zip files have been validated, they are each given a "provenance sidecar." This sidecar file then moves around the system along with the data file itself and can be used at any time to verify that the file has not been altered since it was first collected. Additionally, if a change is made to the data file, the sidecar can be updated with a new checksum, as well as an explanation of what changed. These sidecars are handled by our ProvLog system, and provide provenance tracking for the lifecycle of all the data files that are part of our pipeline.

Manual Ingest Prep

Note: Due to the way they are managed and distributed, SenseDoc devices capture data from our project coordinators as well as from the participant they are assigned to, so a wear date window was tracked to allow us to filter out the telemetry not contributed by the participants themselves. But since Ethica data comes directly from the user's phone, there is no need to filter out extraneous contributors, so there are no wear dates associated with Ethica participants and no pre-filtering done at ingest time.

Before data is actually ingested, a number of verification and normalization steps are conducted first to ensure that the ingest can proceed successfully.

  1. The raw data files that were migrated to the incoming_data folder are copied to the permanent archive area
    • The folder path should be: /projects/def-dfuller/interact/permanent_archive/{CITYNAME}/Wave{WAVENUM}/Ethica/raw
  2. Some cities are organized with multiple studies (e.g. Montreal has one English study and one French study, conducted simultaneously), so the files within the raw folder are organized by study number and data table name
    • Files are named {STUDYID}_{TABLENAME}.zip
  3. The normalized permanent_archive files are then added to our ProvLog system, which scans every night to ensure that all files always match the checksums they were uploaded with and have not been deleted or altered on disk during the course of working with them
    • An entire directory can be added to the monitoring system with the command: provlog -T {ROOTDIR}
    • If any changes are ever detected in any logged files on disk, a message is sent to the data manager, who investigates and either restores the data files from backup, or updates the ProvLog record to explain the change, thus ensuring a complete manifest of data changes is attached to each contributing file
  4. The participant metadata (which usually encompasses both SenseDoc and Ethica users) is also placed into the permanent_archive folder
    • The file path should be: /projects/def-dfuller/interact/permanent_archive/{CITYNAME}/Wave{WAVENUM}/linkage.csv
    • It's called this because it provides the linkage between the participants and the devices they were issued

adapted to here

Guided Ingest

Once the data files are all in the correct place, with the expected names, the data manager can launch the guided process to complete the ingest. The first step is to set up the Jupyter Notebook that will govern the process.

Setting up the Jupyter Notebook

  1. Getting Jupyter Notebooks set up properly is outside the scope of this document, but we have explored two different configurations to date:
    • Running Jupyter Lab directly on the ComputeCanada cluster
      • Faster run-times
      • Harder to set up
      • Prone to frequent delays that can impede efficient workflow
    • Running Jupyter Lab on a local machine, with remote SSH mounts to the CC file system and PostgreSQL instance
      • Easier to set up
      • More responsive workflow
      • Slower execution of data-heavy operations
  2. Set up your environment variables
    • $SQL_LOCAL_SERVER and $SQL_LOCAL_PORT will depend on which configuration you chose above for Jupyter Lab
    • $SQL_USER should be set to your ComputeCanada userid
    • $INGEST_CITY should be the integer code for the city you'll be ingesting (Victoria=1, Vancouver=2, Saskatoon=3, Montreal=4)
    • $INGEST_WAVE should be the integer wave number that you'll be ingesting
  3. Once everything is configured, launch Jupyter Lab and open a copy of Ingest-SenseDoc-Wave2-Protocol.ipynb

The guided process will be an iterative process in which you work your way down the series of code blocks in the document, executing each one until it runs cleanly, and then moving on to the next block. Note that every block of executable code is followed by an after running the block section that explains what you should see in the output of the previous section, and what to do if problems are reported.

Block 1: Parameters

The whole point of normalizing the filenames, paths, and data bundles was to allow the same code to be used each time, regardless of which wave or city is being ingested. This block is where those values are initialized from the environment variables, and a few other frequently used variables are set up.

As a rule, you will not have to change anything here, but there may be special cases. In particular, there may be cases where the file structures do not conform precisely to the standard laid out above, so you might have to tweak file paths here.

Important: Never, under any circumstances code passwords or userids directly into this notebook. Remember that this document is hosted publicly on GitHub. Sharing security codes in this way would be a breach of our privacy protocol.

Run the block and then read note that follows. Confirm that everything ran as expected before moving on.

  1. Edit the parameter assignments in the first code block of the notebook to set the wave_id and city_id being ingested

The next few blocks of the notebook will conduct some additional analyses to find gaps and conflicts in the data so they can be fixed prior to ingest.

  • All expected files are confirmed to be present and named correctly
  • The incoming linkage data is confirmed to be well-formed
  • Each expected participant has corresponding telemetry data in the permanent_archive folder
  • All telemetry data found in the permanent_archive folder corresponds to a known participant

Any problems found by these tests are reported to the data manager who then consults with the regional coordinator to fix the discrepancies. The most common problem found here is to find a user in the linkage table who produced no data in the telemetry folders. These are usually cases where a coordinator created a dummy account to use for testing. But since they did not actually wear a device, there is no telemetry to go with the account. In these cases, the user record must be marked by putting the word 'ignore' in the data_disposition field of the linkage table, which instructs the ingest system to ignore that user record entirely.

Once the validation block of the Jupyter Notebook passes cleanly, reporting no unexpected conditions in the data, the actual ingest blocks can be run.

Ingest

First, the linkage data will be loaded into the DB table (portal_dev.sensedoc_assignments). This is a straightforward process of...

Then the notebook will proceed to loading the telemetry files.

Loading telemetry files is a bit more complicated...

  • The last few sections perform the actual ingest
    • In the first pass, the raw telemetry files are loaded into a temporary DB table
    • In the next block, that temporary table is cross-linked with the proper IID, based on the mapping from the device id found in the linkage table
    • Finally, the cross-linked telemetry data is added to the final telemetry tables (sd_gps, sd_accel, '''and others?''')
  • Once the ingest has completed successfully, a few housekeeping tasks are required:
    • Delete the temporary tables '''(called?)'''
    • Export the JupyterNotebook as a PDF, which provides a complete document of the ingest process, as it happened.
    • If any substantive code was changed in the notebook (aside from setting parameters), clear all the output blocks, save the notebook, and commit the changes to the git repo, describing what improvements or corrections were made to the code
  • Congratulations, you have now completed an ingest cycle.

Clone this wiki locally