feat(metadata,cli,frontend): ingest JSON-Lines (JSONL) data + synthesize participant_id join key#115
Conversation
🦋 Changeset detectedLatest commit: 5a905a9 The changes in this PR will be included in the next version bump. This PR includes changesets to release 3 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
9770ac0 to
840dc2b
Compare
Several jsPsych labs and JATOS export data as newline-delimited JSON — one JSON value per line (typically one participant's trial array per line) — rather than a single JSON array. generate() ran JSON.parse on the whole string, so every such file failed with "Unexpected non-whitespace character after JSON" and produced no metadata. The CLI and frontend also filter data files by extension, so .jsonl files were skipped before reaching generate(). - metadata: new exported parseJsonData() helper accepts both a single JSON document (returned unchanged — no behaviour change for existing callers) and JSON-Lines, flattening per-line arrays into one observation stream. Wired into generate(). - cli: treat .jsonl as JSON everywhere (isDataExt/isJsonDataExt) — directory reader, join-key pre-pass, filename-normalization, and CSV conversion. - frontend: normalise .jsonl uploads to the JSON path; join-key pre-flight and file builder use parseJsonData. Verified against the raw .jsonl exports in vucml/online_experiments: all 15 files now generate metadata and pass the Psych-DS validator with zero errors. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…SONL
Multi-participant JSON-Lines exports carry no per-row participant id, so
after flattening, trial_index repeats across participants and can't uniquely
key the extracted array/object sidecars. parseJsonData now opt-in tags each
line's rows with a 0-based participant_id (reporting whether it synthesized
one), and generate() promotes it to the leading join key for JSON input so
sidecars join unambiguously. When the id is actually synthesized it gets an
explicit "not a real subject ID" description (also avoiding an empty {} that
trips OBJECT_TYPE_MISSING); a pre-existing participant_id is left untouched.
The CLI pre-analysis/prompt and frontend pre-flight mirror the promotion so
multi-participant JSONL isn't falsely flagged as non-unique.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…e_record_id
The per-line join key synthesized for JSON-Lines input was named
participant_id, which overclaims: a JSONL line is only guaranteed to be one
source record, not necessarily one participant. Rename it to source_record_id
across the library, CLI, and frontend.
- parseJsonData: option tagParticipantId -> tagSourceRecordId, stat
synthesizedParticipantId -> synthesizedSourceRecordId; stamps source_record_id.
Synthesis now defers to a real participant_id (or an existing source_record_id)
already in the data, so a genuine subject id is never duplicated or mislabeled.
- generate(): promotes the identifier as the leading join key, preferring the
synthesized source_record_id and falling back to a real participant_id. The
synthetic-origin description now describes a "source record" (usually but not
always one participant).
- CLI: emits a one-line info log when it adds the column ("Detected JSON-Lines
input; added synthetic source_record_id ..."), surfaced via a new optional
out-param on preAnalyzeDirectory (no extra parse pass, return contract
unchanged). data.ts/index.ts pre-analysis mirror the new id.
- frontend: pre-flight + builder use source_record_id.
- Tests + changeset updated.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
b81483d to
5a905a9
Compare
|
Rebased onto current Note: the "Out of scope" bullet about Local: metadata 181/181, frontend 44/44, cli 135/136 (the one failure is the pre-existing |
…changes The DataUpload component gained a parseJsonData-based join-key preflight (#115) and a buildPsychDSDataFiles conversion step (#103/#114) since these tests were written. Expand the @jspsych/metadata mock (parseJsonData, parseCSV, buildPsychDSDataFiles, deriveFallbackBase, isValidPsychDSDataFilename, PSYCHDS_IGNORE_*) and the metadata stub (getExtractedArrays/Objects, getArrayJoinKeys) so the join-key chooser and session-sync paths exercise correctly. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* test(frontend): add unit tests for remaining page/UI components Covers DataUpload, Variables, Authors, JsonViewer, PageHeader, PreviewDrawer, Sidebar, ProjectInfo, and AppShell (201 tests total). Closes #9. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * test(frontend): update DataUpload tests for merged preflight/builder changes The DataUpload component gained a parseJsonData-based join-key preflight (#115) and a buildPsychDSDataFiles conversion step (#103/#114) since these tests were written. Expand the @jspsych/metadata mock (parseJsonData, parseCSV, buildPsychDSDataFiles, deriveFallbackBase, isValidPsychDSDataFilename, PSYCHDS_IGNORE_*) and the metadata stub (getExtractedArrays/Objects, getArrayJoinKeys) so the join-key chooser and session-sync paths exercise correctly. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Josh de Leeuw <josh.deleeuw@gmail.com>
Problem
Several jsPsych labs — and JATOS — export experiment data as JSON-Lines (
.jsonl): one JSON value per line, typically one participant's full trial array per line, rather than a single JSON array. Two things broke on these files:generate()ranJSON.parseon the whole string, so every.jsonlfile failed withUnexpected non-whitespace character after JSONand produced no metadata..json/.csv), so.jsonlfiles were skipped before they ever reachedgenerate().Found while running the real raw exports in
vucml/online_experimentsthrough the tool — all 15 raw.jsonlfiles failed at the parse step.Change
@jspsych/metadata— new exportedparseJsonData()helper. A well-formed single document is returned unchanged (no behaviour change for existing single-array callers); only when whole-string parsing fails does it fall back to line-by-line parsing, flattening per-line arrays into one observation stream. Wired intogenerate().@jspsych/metadata-cli—.jsonlis now treated exactly like.json(isDataExt/isJsonDataExthelpers): directory reader, join-key pre-pass, filename-normalization pre-pass, raw-original preservation, and CSV conversion.frontend—.jsonluploads are normalised to the JSON path; join-key pre-flight and the Psych-DS file builder useparseJsonData.Verification
psychds-validator— all 15.jsonlexports now validate with 0 errors (they failed at parse time before).Also in this PR: synthesize a
participant_idjoin key (commit be70c00)Parsing JSONL surfaced a follow-on problem: raw jsPsych exports carry no per-row participant identifier, so once JSONL is flattened (one participant per line)
trial_indexrepeats across participants and can't uniquely key the extracted array/object sidecar CSVs — every participant's trial 0 collapses onto the same(trial_index, element_index)key, so a sidecar row can't be joined back to a single parent trial.parseJsonDatagains an opt-in{ tagParticipantId }flag: in the JSON-Lines path it stamps each line's object rows with a 0-basedparticipant_id(a no-op on the single-array fast path; never overwrites an existing value) and reports via an optionalstatsout-param whether it actually synthesized one.generate()enables this for JSON input and promotesparticipant_idto the leading join key (['participant_id', 'trial_index']) when rows carry one, so sidecars join unambiguously. CSV inputs are unaffected.{}description (an object with no@type→OBJECT_TYPE_MISSING). Aparticipant_idalready present in the data is left untouched.Verification (this commit): new tests in the metadata JSONL suite (tagging, promotion, synthetic-vs-real description, single-array no-op). End-to-end on the raw
.jsonlexports ingithubpsyche/homophily: all three files generate metadata, passpsychds-validator, and write sidecars whose(participant_id, trial_index, element_index)keys are fully unique — e.g.view_historyat 385/385 rows (vs. 7 colliding keys withoutparticipant_id).Out of scope (pre-existing, not touched here)
bonuses.csv, sortablerank analysis CSVs) failVARIABLE_MISSING_FROM_CSV_COLUMNSbecauseJsPsychMetadataseedstrial_type/trial_index/time_elapsedintovariableMeasuredeven when the CSV lacks them.OBJECT_TYPE_MISSING/INVALID_SCHEMAORG_PROPERTYwarnings on the JSONL datasets (warnings only). Note: the synthesizedparticipant_idno longer contributes toOBJECT_TYPE_MISSING, but other sources remain.🤖 Generated with Claude Code