Allow regex variable extraction to run against the original filename in addition to the extracted document text. Add a source field to ExtractedVariable.
Example Config
variables:
extracted:
- name: account_id
source: filename # NEW — extract from filename instead of text
pattern: "[0-9]{8}_.*(?P<account_id>[0-9]{4}\\.[0-9]{4}\\.[0-9]{4})"
- name: invoice_num
source: text # default behavior
pattern: "INV-(?P<invoice_num>\\d+)"
Implementation
Schema change:
- File:
crates/paporg/src/config/schema.rs
- Add
source: Option<VariableSource> to ExtractedVariable (line ~65)
- Add enum:
enum VariableSource { Text, Filename }
Default: Text
Variable engine change:
- File:
crates/paporg/src/config/variables.rs
- Update
extract_variables() (line ~36) to accept both text: &str and filename: &str
- For each pattern, check
source to decide which string to match against
Pipeline change:
- File:
crates/paporg/src/pipeline/runner.rs
- In
step_extract_variables() (line ~222): pass the original filename alongside the text
Acceptance Criteria
Allow regex variable extraction to run against the original filename in addition to the extracted document text. Add a
sourcefield toExtractedVariable.Example Config
Implementation
Schema change:
crates/paporg/src/config/schema.rssource: Option<VariableSource>toExtractedVariable(line ~65)TextVariable engine change:
crates/paporg/src/config/variables.rsextract_variables()(line ~36) to accept bothtext: &strandfilename: &strsourceto decide which string to match againstPipeline change:
crates/paporg/src/pipeline/runner.rsstep_extract_variables()(line ~222): pass the original filename alongside the textAcceptance Criteria
source: filenameextracts variables from the original filenamesource: text(default) preserves current behaviorsourcedefaults totext