Skip to content

feat(metadata,cli,frontend): ingest JSON-Lines (JSONL) data + synthesize participant_id join key#115

Merged
jodeleeuw merged 3 commits into
mainfrom
feat/jsonl-ingestion
Jun 18, 2026
Merged

feat(metadata,cli,frontend): ingest JSON-Lines (JSONL) data + synthesize participant_id join key#115
jodeleeuw merged 3 commits into
mainfrom
feat/jsonl-ingestion

Conversation

@Mandyx22

@Mandyx22 Mandyx22 commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Problem

Several jsPsych labs — and JATOS — export experiment data as JSON-Lines (.jsonl): one JSON value per line, typically one participant's full trial array per line, rather than a single JSON array. Two things broke on these files:

  1. generate() ran JSON.parse on the whole string, so every .jsonl file failed with Unexpected non-whitespace character after JSON and produced no metadata.
  2. The CLI and frontend filter data files by extension (.json/.csv), so .jsonl files were skipped before they ever reached generate().

Found while running the real raw exports in vucml/online_experiments through the tool — all 15 raw .jsonl files failed at the parse step.

Change

  • @jspsych/metadata — new exported parseJsonData() helper. A well-formed single document is returned unchanged (no behaviour change for existing single-array callers); only when whole-string parsing fails does it fall back to line-by-line parsing, flattening per-line arrays into one observation stream. Wired into generate().
  • @jspsych/metadata-cli.jsonl is now treated exactly like .json (isDataExt/isJsonDataExt helpers): directory reader, join-key pre-pass, filename-normalization pre-pass, raw-original preservation, and CSV conversion.
  • frontend.jsonl uploads are normalised to the JSON path; join-key pre-flight and the Psych-DS file builder use parseJsonData.

Verification

  • Full suite: 322/322 pass across all three packages, including new tests in metadata (8), cli (1), and frontend (1). All three packages build/typecheck clean.
  • Real data: assembled a full Psych-DS dataset for each file and ran psychds-validatorall 15 .jsonl exports now validate with 0 errors (they failed at parse time before).

Also in this PR: synthesize a participant_id join key (commit be70c00)

Parsing JSONL surfaced a follow-on problem: raw jsPsych exports carry no per-row participant identifier, so once JSONL is flattened (one participant per line) trial_index repeats across participants and can't uniquely key the extracted array/object sidecar CSVs — every participant's trial 0 collapses onto the same (trial_index, element_index) key, so a sidecar row can't be joined back to a single parent trial.

  • parseJsonData gains an opt-in { tagParticipantId } flag: in the JSON-Lines path it stamps each line's object rows with a 0-based participant_id (a no-op on the single-array fast path; never overwrites an existing value) and reports via an optional stats out-param whether it actually synthesized one.
  • generate() enables this for JSON input and promotes participant_id to the leading join key (['participant_id', 'trial_index']) when rows carry one, so sidecars join unambiguously. CSV inputs are unaffected.
  • Honest labelling: only when the id was actually synthesized (absent from the source) does it get an explicit description — "Synthetic participant identifier … NOT a real subject ID from the experiment …" — so a downstream user can't mistake it for a real subject ID. This also avoids serializing an empty {} description (an object with no @typeOBJECT_TYPE_MISSING). A participant_id already present in the data is left untouched.
  • The CLI's join-key pre-analysis/prompt and the frontend's pre-flight mirror this promotion, so multi-participant JSONL is no longer falsely flagged as having a non-unique join key.

Verification (this commit): new tests in the metadata JSONL suite (tagging, promotion, synthetic-vs-real description, single-array no-op). End-to-end on the raw .jsonl exports in githubpsyche/homophily: all three files generate metadata, pass psychds-validator, and write sidecars whose (participant_id, trial_index, element_index) keys are fully unique — e.g. view_history at 385/385 rows (vs. 7 colliding keys without participant_id).

Out of scope (pre-existing, not touched here)

  • Non-jsPsych CSVs (e.g. bonuses.csv, sortablerank analysis CSVs) fail VARIABLE_MISSING_FROM_CSV_COLUMNS because JsPsychMetadata seeds trial_type/trial_index/time_elapsed into variableMeasured even when the CSV lacks them.
  • OBJECT_TYPE_MISSING / INVALID_SCHEMAORG_PROPERTY warnings on the JSONL datasets (warnings only). Note: the synthesized participant_id no longer contributes to OBJECT_TYPE_MISSING, but other sources remain.

Stacked on fix/unnamed-columns-shared-builder (#114).

🤖 Generated with Claude Code

@changeset-bot

changeset-bot Bot commented Jun 17, 2026

Copy link
Copy Markdown

🦋 Changeset detected

Latest commit: 5a905a9

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 3 packages
Name Type
@jspsych/metadata Patch
@jspsych/metadata-cli Patch
frontend Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@Mandyx22 Mandyx22 changed the title feat(metadata,cli,frontend): ingest JSON-Lines (JSONL) experiment data feat(metadata,cli,frontend): ingest JSON-Lines (JSONL) data + synthesize participant_id join key Jun 17, 2026
@jodeleeuw jodeleeuw force-pushed the fix/unnamed-columns-shared-builder branch from 9770ac0 to 840dc2b Compare June 18, 2026 12:37
Mandyx22 and others added 3 commits June 18, 2026 10:43
Several jsPsych labs and JATOS export data as newline-delimited JSON — one
JSON value per line (typically one participant's trial array per line) —
rather than a single JSON array. generate() ran JSON.parse on the whole
string, so every such file failed with "Unexpected non-whitespace character
after JSON" and produced no metadata. The CLI and frontend also filter data
files by extension, so .jsonl files were skipped before reaching generate().

- metadata: new exported parseJsonData() helper accepts both a single JSON
  document (returned unchanged — no behaviour change for existing callers)
  and JSON-Lines, flattening per-line arrays into one observation stream.
  Wired into generate().
- cli: treat .jsonl as JSON everywhere (isDataExt/isJsonDataExt) — directory
  reader, join-key pre-pass, filename-normalization, and CSV conversion.
- frontend: normalise .jsonl uploads to the JSON path; join-key pre-flight
  and file builder use parseJsonData.

Verified against the raw .jsonl exports in vucml/online_experiments: all 15
files now generate metadata and pass the Psych-DS validator with zero errors.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…SONL

Multi-participant JSON-Lines exports carry no per-row participant id, so
after flattening, trial_index repeats across participants and can't uniquely
key the extracted array/object sidecars. parseJsonData now opt-in tags each
line's rows with a 0-based participant_id (reporting whether it synthesized
one), and generate() promotes it to the leading join key for JSON input so
sidecars join unambiguously. When the id is actually synthesized it gets an
explicit "not a real subject ID" description (also avoiding an empty {} that
trips OBJECT_TYPE_MISSING); a pre-existing participant_id is left untouched.
The CLI pre-analysis/prompt and frontend pre-flight mirror the promotion so
multi-participant JSONL isn't falsely flagged as non-unique.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…e_record_id

The per-line join key synthesized for JSON-Lines input was named
participant_id, which overclaims: a JSONL line is only guaranteed to be one
source record, not necessarily one participant. Rename it to source_record_id
across the library, CLI, and frontend.

- parseJsonData: option tagParticipantId -> tagSourceRecordId, stat
  synthesizedParticipantId -> synthesizedSourceRecordId; stamps source_record_id.
  Synthesis now defers to a real participant_id (or an existing source_record_id)
  already in the data, so a genuine subject id is never duplicated or mislabeled.
- generate(): promotes the identifier as the leading join key, preferring the
  synthesized source_record_id and falling back to a real participant_id. The
  synthetic-origin description now describes a "source record" (usually but not
  always one participant).
- CLI: emits a one-line info log when it adds the column ("Detected JSON-Lines
  input; added synthetic source_record_id ..."), surfaced via a new optional
  out-param on preAnalyzeDirectory (no extra parse pass, return contract
  unchanged). data.ts/index.ts pre-analysis mirror the new id.
- frontend: pre-flight + builder use source_record_id.
- Tests + changeset updated.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@jodeleeuw jodeleeuw force-pushed the feat/jsonl-ingestion branch from b81483d to 5a905a9 Compare June 18, 2026 22:01
@jodeleeuw jodeleeuw changed the base branch from fix/unnamed-columns-shared-builder to main June 18, 2026 22:03
@jodeleeuw jodeleeuw closed this Jun 18, 2026
@jodeleeuw jodeleeuw reopened this Jun 18, 2026
@jodeleeuw

Copy link
Copy Markdown
Member

Rebased onto current main (now includes #103, #110, #105, #102, #111, #114). Conflicts resolved in cli/src/index.ts (combined #111's canPrompt headless join-key gating with this PR's source_record_id promotion — both branches now resolve off preResult.keys) and the @jspsych/metadata import lines in cli/src/index.ts and frontend/DataUpload.tsx (unioned).

Note: the "Out of scope" bullet about VARIABLE_MISSING_FROM_CSV_COLUMNS from JsPsychMetadata seeding trial_type/trial_index/time_elapsed is now stale#110 (merged) made system-variable registration lazy, so those columns are only declared when present.

Local: metadata 181/181, frontend 44/44, cli 135/136 (the one failure is the pre-existing node-pty pty e2e that needs a native build; the two new headless tests pass). CI green.

@jodeleeuw jodeleeuw merged commit 3c7d1f7 into main Jun 18, 2026
2 checks passed
jodeleeuw added a commit that referenced this pull request Jun 18, 2026
…changes

The DataUpload component gained a parseJsonData-based join-key preflight (#115) and a
buildPsychDSDataFiles conversion step (#103/#114) since these tests were written. Expand
the @jspsych/metadata mock (parseJsonData, parseCSV, buildPsychDSDataFiles, deriveFallbackBase,
isValidPsychDSDataFilename, PSYCHDS_IGNORE_*) and the metadata stub (getExtractedArrays/Objects,
getArrayJoinKeys) so the join-key chooser and session-sync paths exercise correctly.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
jodeleeuw added a commit that referenced this pull request Jun 18, 2026
* test(frontend): add unit tests for remaining page/UI components

Covers DataUpload, Variables, Authors, JsonViewer, PageHeader,
PreviewDrawer, Sidebar, ProjectInfo, and AppShell (201 tests total).
Closes #9.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* test(frontend): update DataUpload tests for merged preflight/builder changes

The DataUpload component gained a parseJsonData-based join-key preflight (#115) and a
buildPsychDSDataFiles conversion step (#103/#114) since these tests were written. Expand
the @jspsych/metadata mock (parseJsonData, parseCSV, buildPsychDSDataFiles, deriveFallbackBase,
isValidPsychDSDataFilename, PSYCHDS_IGNORE_*) and the metadata stub (getExtractedArrays/Objects,
getArrayJoinKeys) so the join-key chooser and session-sync paths exercise correctly.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Josh de Leeuw <josh.deleeuw@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants