feat(metadata,cli,frontend): ingest JSON-Lines (JSONL) data + synthesize participant_id join key by Mandyx22 · Pull Request #115 · jspsych/metadata

Mandyx22 · 2026-06-17T14:37:26Z

Problem

Several jsPsych labs — and JATOS — export experiment data as JSON-Lines (.jsonl): one JSON value per line, typically one participant's full trial array per line, rather than a single JSON array. Two things broke on these files:

generate() ran JSON.parse on the whole string, so every .jsonl file failed with Unexpected non-whitespace character after JSON and produced no metadata.
The CLI and frontend filter data files by extension (.json/.csv), so .jsonl files were skipped before they ever reached generate().

Found while running the real raw exports in vucml/online_experiments through the tool — all 15 raw .jsonl files failed at the parse step.

Change

@jspsych/metadata — new exported parseJsonData() helper. A well-formed single document is returned unchanged (no behaviour change for existing single-array callers); only when whole-string parsing fails does it fall back to line-by-line parsing, flattening per-line arrays into one observation stream. Wired into generate().
@jspsych/metadata-cli — .jsonl is now treated exactly like .json (isDataExt/isJsonDataExt helpers): directory reader, join-key pre-pass, filename-normalization pre-pass, raw-original preservation, and CSV conversion.
frontend — .jsonl uploads are normalised to the JSON path; join-key pre-flight and the Psych-DS file builder use parseJsonData.

Verification

Full suite: 322/322 pass across all three packages, including new tests in metadata (8), cli (1), and frontend (1). All three packages build/typecheck clean.
Real data: assembled a full Psych-DS dataset for each file and ran psychds-validator — all 15 .jsonl exports now validate with 0 errors (they failed at parse time before).

Also in this PR: synthesize a `participant_id` join key (commit `be70c00`)

Parsing JSONL surfaced a follow-on problem: raw jsPsych exports carry no per-row participant identifier, so once JSONL is flattened (one participant per line) trial_index repeats across participants and can't uniquely key the extracted array/object sidecar CSVs — every participant's trial 0 collapses onto the same (trial_index, element_index) key, so a sidecar row can't be joined back to a single parent trial.

parseJsonData gains an opt-in { tagParticipantId } flag: in the JSON-Lines path it stamps each line's object rows with a 0-based participant_id (a no-op on the single-array fast path; never overwrites an existing value) and reports via an optional stats out-param whether it actually synthesized one.
generate() enables this for JSON input and promotes participant_id to the leading join key (['participant_id', 'trial_index']) when rows carry one, so sidecars join unambiguously. CSV inputs are unaffected.
Honest labelling: only when the id was actually synthesized (absent from the source) does it get an explicit description — "Synthetic participant identifier … NOT a real subject ID from the experiment …" — so a downstream user can't mistake it for a real subject ID. This also avoids serializing an empty {} description (an object with no @type → OBJECT_TYPE_MISSING). A participant_id already present in the data is left untouched.
The CLI's join-key pre-analysis/prompt and the frontend's pre-flight mirror this promotion, so multi-participant JSONL is no longer falsely flagged as having a non-unique join key.

Verification (this commit): new tests in the metadata JSONL suite (tagging, promotion, synthetic-vs-real description, single-array no-op). End-to-end on the raw .jsonl exports in githubpsyche/homophily: all three files generate metadata, pass psychds-validator, and write sidecars whose (participant_id, trial_index, element_index) keys are fully unique — e.g. view_history at 385/385 rows (vs. 7 colliding keys without participant_id).

Out of scope (pre-existing, not touched here)

Non-jsPsych CSVs (e.g. bonuses.csv, sortablerank analysis CSVs) fail VARIABLE_MISSING_FROM_CSV_COLUMNS because JsPsychMetadata seeds trial_type/trial_index/time_elapsed into variableMeasured even when the CSV lacks them.
OBJECT_TYPE_MISSING / INVALID_SCHEMAORG_PROPERTY warnings on the JSONL datasets (warnings only). Note: the synthesized participant_id no longer contributes to OBJECT_TYPE_MISSING, but other sources remain.

Stacked on fix/unnamed-columns-shared-builder (#114).

🤖 Generated with Claude Code

changeset-bot · 2026-06-17T14:41:02Z

🦋 Changeset detected

Latest commit: 5a905a9

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 3 packages

Name	Type
@jspsych/metadata	Patch
@jspsych/metadata-cli	Patch
frontend	Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

Several jsPsych labs and JATOS export data as newline-delimited JSON — one JSON value per line (typically one participant's trial array per line) — rather than a single JSON array. generate() ran JSON.parse on the whole string, so every such file failed with "Unexpected non-whitespace character after JSON" and produced no metadata. The CLI and frontend also filter data files by extension, so .jsonl files were skipped before reaching generate(). - metadata: new exported parseJsonData() helper accepts both a single JSON document (returned unchanged — no behaviour change for existing callers) and JSON-Lines, flattening per-line arrays into one observation stream. Wired into generate(). - cli: treat .jsonl as JSON everywhere (isDataExt/isJsonDataExt) — directory reader, join-key pre-pass, filename-normalization, and CSV conversion. - frontend: normalise .jsonl uploads to the JSON path; join-key pre-flight and file builder use parseJsonData. Verified against the raw .jsonl exports in vucml/online_experiments: all 15 files now generate metadata and pass the Psych-DS validator with zero errors. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…SONL Multi-participant JSON-Lines exports carry no per-row participant id, so after flattening, trial_index repeats across participants and can't uniquely key the extracted array/object sidecars. parseJsonData now opt-in tags each line's rows with a 0-based participant_id (reporting whether it synthesized one), and generate() promotes it to the leading join key for JSON input so sidecars join unambiguously. When the id is actually synthesized it gets an explicit "not a real subject ID" description (also avoiding an empty {} that trips OBJECT_TYPE_MISSING); a pre-existing participant_id is left untouched. The CLI pre-analysis/prompt and frontend pre-flight mirror the promotion so multi-participant JSONL isn't falsely flagged as non-unique. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…e_record_id The per-line join key synthesized for JSON-Lines input was named participant_id, which overclaims: a JSONL line is only guaranteed to be one source record, not necessarily one participant. Rename it to source_record_id across the library, CLI, and frontend. - parseJsonData: option tagParticipantId -> tagSourceRecordId, stat synthesizedParticipantId -> synthesizedSourceRecordId; stamps source_record_id. Synthesis now defers to a real participant_id (or an existing source_record_id) already in the data, so a genuine subject id is never duplicated or mislabeled. - generate(): promotes the identifier as the leading join key, preferring the synthesized source_record_id and falling back to a real participant_id. The synthetic-origin description now describes a "source record" (usually but not always one participant). - CLI: emits a one-line info log when it adds the column ("Detected JSON-Lines input; added synthetic source_record_id ..."), surfaced via a new optional out-param on preAnalyzeDirectory (no extra parse pass, return contract unchanged). data.ts/index.ts pre-analysis mirror the new id. - frontend: pre-flight + builder use source_record_id. - Tests + changeset updated. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

jodeleeuw · 2026-06-18T22:16:38Z

Rebased onto current main (now includes #103, #110, #105, #102, #111, #114). Conflicts resolved in cli/src/index.ts (combined #111's canPrompt headless join-key gating with this PR's source_record_id promotion — both branches now resolve off preResult.keys) and the @jspsych/metadata import lines in cli/src/index.ts and frontend/DataUpload.tsx (unioned).

Note: the "Out of scope" bullet about VARIABLE_MISSING_FROM_CSV_COLUMNS from JsPsychMetadata seeding trial_type/trial_index/time_elapsed is now stale — #110 (merged) made system-variable registration lazy, so those columns are only declared when present.

Local: metadata 181/181, frontend 44/44, cli 135/136 (the one failure is the pre-existing node-pty pty e2e that needs a native build; the two new headless tests pass). CI green.

…changes The DataUpload component gained a parseJsonData-based join-key preflight (#115) and a buildPsychDSDataFiles conversion step (#103/#114) since these tests were written. Expand the @jspsych/metadata mock (parseJsonData, parseCSV, buildPsychDSDataFiles, deriveFallbackBase, isValidPsychDSDataFilename, PSYCHDS_IGNORE_*) and the metadata stub (getExtractedArrays/Objects, getArrayJoinKeys) so the join-key chooser and session-sync paths exercise correctly. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* test(frontend): add unit tests for remaining page/UI components Covers DataUpload, Variables, Authors, JsonViewer, PageHeader, PreviewDrawer, Sidebar, ProjectInfo, and AppShell (201 tests total). Closes #9. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * test(frontend): update DataUpload tests for merged preflight/builder changes The DataUpload component gained a parseJsonData-based join-key preflight (#115) and a buildPsychDSDataFiles conversion step (#103/#114) since these tests were written. Expand the @jspsych/metadata mock (parseJsonData, parseCSV, buildPsychDSDataFiles, deriveFallbackBase, isValidPsychDSDataFilename, PSYCHDS_IGNORE_*) and the metadata stub (getExtractedArrays/Objects, getArrayJoinKeys) so the join-key chooser and session-sync paths exercise correctly. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Josh de Leeuw <josh.deleeuw@gmail.com>

Mandyx22 changed the title ~~feat(metadata,cli,frontend): ingest JSON-Lines (JSONL) experiment data~~ feat(metadata,cli,frontend): ingest JSON-Lines (JSONL) data + synthesize participant_id join key Jun 17, 2026

jodeleeuw force-pushed the fix/unnamed-columns-shared-builder branch from 9770ac0 to 840dc2b Compare June 18, 2026 12:37

Mandyx22 and others added 3 commits June 18, 2026 10:43

jodeleeuw force-pushed the feat/jsonl-ingestion branch from b81483d to 5a905a9 Compare June 18, 2026 22:01

jodeleeuw changed the base branch from fix/unnamed-columns-shared-builder to main June 18, 2026 22:03

jodeleeuw closed this Jun 18, 2026

jodeleeuw reopened this Jun 18, 2026

jodeleeuw merged commit 3c7d1f7 into main Jun 18, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(metadata,cli,frontend): ingest JSON-Lines (JSONL) data + synthesize participant_id join key#115

feat(metadata,cli,frontend): ingest JSON-Lines (JSONL) data + synthesize participant_id join key#115
jodeleeuw merged 3 commits into
mainfrom
feat/jsonl-ingestion

Mandyx22 commented Jun 17, 2026 •

edited

Loading

Uh oh!

changeset-bot Bot commented Jun 17, 2026 •

edited

Loading

Uh oh!

jodeleeuw commented Jun 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Mandyx22 commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Change

Verification

Also in this PR: synthesize a participant_id join key (commit be70c00)

Out of scope (pre-existing, not touched here)

Uh oh!

changeset-bot Bot commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🦋 Changeset detected

Uh oh!

jodeleeuw commented Jun 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Mandyx22 commented Jun 17, 2026 •

edited

Loading

Also in this PR: synthesize a `participant_id` join key (commit `be70c00`)

changeset-bot Bot commented Jun 17, 2026 •

edited

Loading