Fix double-encoding of PostgreSQL json/jsonb columns; add step.json_parse#297
Fix double-encoding of PostgreSQL json/jsonb columns; add step.json_parse#297
Conversation
Co-authored-by: intel352 <77607+intel352@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Fixes PostgreSQL json/jsonb values being double-encoded by ensuring DB row scanners parse JSON []byte into structured Go values, and adds a step.json_parse escape hatch for non-DB JSON strings.
Changes:
- Added
parseJSONBytesOrStringand applied it across DB row scanning (step.db_query,step.db_query_cached, andWorkflowDatabase.Query) to avoid returning JSON as opaque strings. - Introduced
step.json_parse(with tests) to explicitly parse JSON strings/bytes at a dot-path into structured values. - Registered the new step in the pipelinesteps plugin and updated documentation.
Reviewed changes
Copilot reviewed 9 out of 12 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
module/pipeline_step_db_query.go |
Parses []byte columns as JSON when possible; adds shared helper. |
module/pipeline_step_db_query_cached.go |
Uses the shared helper for []byte columns. |
module/database.go |
Uses the shared helper so WorkflowDatabase.Query returns structured JSON too. |
module/pipeline_step_db_query_test.go |
Adds unit tests for parseJSONBytesOrString. |
module/pipeline_step_json_parse.go |
New step.json_parse implementation. |
module/pipeline_step_json_parse_test.go |
Unit tests for step.json_parse. |
plugins/pipelinesteps/plugin.go |
Registers step.json_parse factory/type. |
plugins/pipelinesteps/plugin_test.go |
Ensures the step factory list includes step.json_parse. |
DOCUMENTATION.md |
Documents step.json_parse in the step table. |
module/scan_provider_test.go |
Formatting-only changes (gofmt). |
module/platform_do_database.go |
Comment alignment/formatting (gofmt). |
module/platform_do_database_test.go |
Formatting-only changes (gofmt). |
| // Execute resolves the source path, parses the value as JSON if it is a string, | ||
| // and stores the result under the configured target key. | ||
| func (s *JSONParseStep) Execute(_ context.Context, pc *PipelineContext) (*StepResult, error) { | ||
| raw := resolveBodyFrom(s.source, pc) |
There was a problem hiding this comment.
json_parse resolves the source via resolveBodyFrom, which returns nil for missing/unresolvable paths. That means a typo in source will silently produce {target: nil} and the step will appear to succeed. Consider using a strict resolver (e.g. build a data map like JQStep.resolveInput and call resolveDottedPath) or otherwise detecting “path not found” and returning an error so misconfigurations fail fast.
| raw := resolveBodyFrom(s.source, pc) | |
| raw := resolveBodyFrom(s.source, pc) | |
| if raw == nil { | |
| return nil, fmt.Errorf("json_parse step %q: source %q not found or resolved to nil", s.name, s.source) | |
| } |
module/pipeline_step_db_query.go
Outdated
| func parseJSONBytesOrString(b []byte) any { | ||
| if len(b) == 0 { | ||
| return string(b) | ||
| } | ||
| var v any | ||
| if err := json.Unmarshal(b, &v); err == nil { | ||
| return v |
There was a problem hiding this comment.
parseJSONBytesOrString attempts json.Unmarshal on every non-empty []byte column. For large bytea/blob fields that happen to be mostly text, json.Unmarshal can scan a lot of data before failing and add avoidable CPU overhead. Consider a cheap pre-check before unmarshalling (e.g. trim leading whitespace and require the first non-space byte to be one of {, [, ", t, f, n, -, or a digit; optionally also require utf8.Valid) so non-JSON byte slices skip the full JSON parser.
module/pipeline_step_db_query.go
Outdated
| // parseJSONBytesOrString attempts to unmarshal b as JSON. If successful the | ||
| // parsed Go value is returned (map[string]any, []any, string, float64, bool, | ||
| // or nil). This transparently handles PostgreSQL json/jsonb columns, which the | ||
| // pgx driver delivers as raw JSON bytes rather than pre-typed Go values. | ||
| // | ||
| // If b is not valid JSON (e.g. PostgreSQL bytea binary data), string(b) is | ||
| // returned so that the existing string-fallback behaviour is preserved. | ||
| func parseJSONBytesOrString(b []byte) any { | ||
| if len(b) == 0 { | ||
| return string(b) | ||
| } | ||
| var v any | ||
| if err := json.Unmarshal(b, &v); err == nil { | ||
| return v | ||
| } | ||
| return string(b) | ||
| } |
There was a problem hiding this comment.
parseJSONBytesOrString is defined in pipeline_step_db_query.go but is now used by WorkflowDatabase.Query and DBQueryCachedStep as well. To avoid coupling core database code to a pipeline-step file (and to make the helper easier to discover/reuse), consider moving this helper to a shared location (e.g. module/database_scan_helpers.go or similar) and keep the pipeline steps calling into that shared helper.
|
@copilot apply changes based on the comments in this thread Build failing, lint failing, test failing, validation failing |
Co-authored-by: intel352 <77607+intel352@users.noreply.github.com>
All three reviewer suggestions applied in 50bca53:
|
⏱ Benchmark Results✅ No significant performance regressions detected. benchstat comparison (baseline → PR)
|
The pgx driver returns
json/jsonbcolumns as raw[]byteJSON text. All three row-scanning sites (DBQueryStep,DBQueryCachedStep,WorkflowDatabase.Query) converted[]byte→string, making these values opaque strings that downstream steps (step.json_response,step.jq,step.transform) would JSON-encode a second time.Changes
parseJSONBytesOrStringhelper (module/pipeline_step_db_query.go) — attemptsjson.Unmarshalon any[]bytecolumn value; falls back tostring(b)for non-JSON bytes (e.g.bytea). Idempotent and driver-agnostic.DBQueryStep,DBQueryCachedStep, andWorkflowDatabase.Queryall use the same helper, so json/jsonb values arrive in the pipeline context asmap[string]any/[]anyrather than strings.step.json_parse— explicit escape hatch for pipelines that source JSON strings from non-DB steps (HTTP calls, legacy drivers, text columns). Parses a string or[]bytevalue at a dot-path into a structured object; pass-through for already-parsed values.For
json/jsonbcolumns via pgx no YAML change is needed —step.db_querynow delivers structured values automatically.Original prompt
This section details on the original issue you should resolve
<issue_title>step.json_response: double-encoding of PostgreSQL json/jsonb types via pgx driver</issue_title>
<issue_description>## Problem
When
step.db_queryreturns columns of PostgreSQLjsonorjsonbtype, the pgx driver returns them as raw JSON strings (e.g.,"[{\"id\": 1}]"), not parsed Go maps/slices. Whenstep.json_responsewith_fromcopies these values into the response body, they are treated as strings and JSON-encoded again, resulting in double-encoded output:{ "data": "[{\"id\": 1, \"type\": \"follow-ups\"}]" }Instead of:
{ "data": [{"id": 1, "type": "follow-ups"}] }This affects any pipeline that constructs complex JSON structures in PostgreSQL (using
json_build_object,json_agg, etc.) and tries to return them viastep.json_response.Use Case
APIs that need to return structured responses (e.g., JSON:API format with nested
data,included,metaobjects) where the response structure is built in SQL for efficiency. This is common when:Example: a follow-ups endpoint returning JSON:API format with relationships and included resources, all computed in a single SQL query using
json_build_objectandjson_agg.Proposed Solution
Option A: JSON-aware
_fromresolutionWhen
step.json_responseresolves a_fromvalue, detect if the value is a string that looks like JSON (starts with{or[) and parse it before embedding. This could be opt-in:Option B:
step.raw_responsewithbody_from(already exists)step.raw_responsewithbody_fromworks as a complete workaround when the entire response body is a single JSON column cast to::text. However, this only works for the whole response — you can't compose a response from multiple_fromfields that are individually JSON-typed.Option C:
step.json_parseintermediate stepA step that parses a JSON string value back into a structured object in the pipeline context, so downstream steps see it as a map/slice rather than a string:
Current Workaround
Cast the entire response to
::textin PostgreSQL and usestep.raw_responsewithbody_from:This works for simple cases but forces the entire response to be a single SQL column and prevents composing responses from multiple pipeline steps or mixing SQL-generated JSON with static fields.</issue_description>
Comments on the Issue (you are @copilot in this section)
@intel352 **Additional context**: `step.transform` (Go templates) and `step.jq` could both serve as response mappers to build JSON:API / HAL / custom envelope formats from flat DB rows — but they're blocked by this same issue. When `step.db_query` returns rows with json/jsonb columns, the values arrive as strings in the pipeline context. `step.transform` templates and `step.jq` expressions then treat them as strings rather than objects, making it impossible to compose structured responses from DB-generated JSON fragments.Resolving this would unlock using
step.jqorstep.transformto handle response shaping entirely in YAML, eliminating the need to push JSON envelope construction into SQL withjson_build_object.</comment_new><comment_new>@intel352
Partial workaround via foreach + transform: For responses built from scalar DB columns (strings, ints, timestamps),
step.foreach+step.transformcan reshape rows without hitting the double-encoding issue:This...
✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.