VistA Documentation Library (VDL) — Inventory Enrichment Pipeline

This repository produces vdl_inventory_enriched.csv, a normalized, machine-processable metadata inventory of all documents in the VA's VistA Digital Library (VDL). The enriched CSV is the gold standard reference for YAML frontmatter in docx→markdown conversions, gap analysis, and reporting.

What Is the VDL?

The VistA Digital Library (VDL) is the VA's official documentation catalog for VistA, the Department of Veterans Affairs' legacy MUMPS-based electronic health record system. The VDL lists 8,834 document files across 221 applications, grouped into 5 functional sections.

Directory Structure

~/projects/vista-docs/
  src/vista_docs/          ← canonical ETL pipeline (stages 1-5: crawl → sync)
  pipeline/                ← post-ingest stages 6-6.7 over md-img/frontmatter.db
  scripts/                 ← ad-hoc / one-off tools (enrich_inventory.py lives here)
  tests/
  guides/                  ← synthesised reference docs
  README.md                ← this file

~/data/vista-docs/inventory/
  vdl_inventory.csv             ← source (raw crawler output, 12 columns)
  vdl_inventory_enriched.csv    ← output (enriched, 27 columns) ← USE THIS
  vdl_inventory_schema.json     ← field type manifest (generated by script)
  vdl_inventory.json            ← JSON version of enriched inventory
  snapshots/                    ← dated CSV snapshots

Regenerating from Scratch

Requirements: Python 3.10+, no external dependencies (stdlib only).

cd ~/projects/vista-docs
python3 scripts/enrich_inventory.py

Output:

Writes vdl_inventory_enriched.csv (27 columns, UTF-8)
Writes vdl_inventory_schema.json (field type manifest)
Prints a fill-rate and distribution summary to stdout

The script is idempotent: running it twice on the same input produces identical output.

Corpus Overview

Metric	Value
Total records	8,834
Genuine VistA docs	7,493
Active app docs (clean)	~5,231
Unique document titles	3,646
VistA applications	221
Functional sections	5

5 Sections

section_code	section_name	Records	Apps
CLI	Clinical	5,790	111
FIN	Financial-Administrative	1,485	38
GUI	VistA/GUI Hybrids (formerly HealtheVet)	780	23
INF	Infrastructure	777	48
MON	Monograph	2	1

Document Formats

.pdf: 5,097 (57.6%)
.docx: 3,730 (42.2%)
.doc (legacy): 7 (0.1%)
Nearly every document exists in both PDF and DOCX — pairs share identical metadata except URL/filename fields.

App Status

Active: 5,330 records / 192 apps
Archive: 3,380 records / 188 apps
Decommissioned: 124 records / 16 apps

Enriched Schema (27 columns)

Column order: section identity → app identity → patch identity → document identity → flags → URLs.

Section identity

Column	Fill	Description
`section_name`	100%	Full VDL section name
`section_code`	100%	`CLI` / `FIN` / `GUI` / `INF` / `MON` — use for grouping/filtering

App identity

Column	Fill	Description
`app_name_full`	100%	Full application name (parenthetical abbreviation stripped) — display only
`app_name_abbrev`	100%	Primary app grouping key. VDL-assigned for 166 apps; curated fallback for 58 apps with no VDL code
`app_status`	100%	`active` / `archive` / `decommissioned`
`decommission_date`	1.3%	Normalized to `YYYY-MM`. Only 115 rows populated — do not use as a coverage signal

Patch identity

Column	Fill	Description
`pkg_ns`	56.4%	VistA MUMPS package namespace (e.g. `DG`, `PSO`). Differs from `app_name_abbrev` in 27 apps (16%)
`patch_ver`	61.3%	Version string (e.g. `5.3`). Do not sort as string — use `patch_ver_major`/`patch_ver_minor`
`patch_ver_major`	61.3%	Major version integer
`patch_ver_minor`	61.3%	Minor version integer; `0` when major-only
`patch_num`	40.6%	Patch number integer (leading zeros stripped)
`patch_id`	56.4%	Canonical VistA patch ID: `NSVP` for patch docs; `NS*V` for anchor docs
`patch_id_full`	0.5%	Full raw multi-namespace prefix (e.g. `DG5.3554/TIU1184`); set only when `multi_ns=1`
`multi_ns`	—	`1` if title contains multiple `NSVP` segments; `0` otherwise
`group_key`	61.3%	Functional group key: `app_name_abbrev:pkg_ns:patch_ver` — groups anchor + patch docs together

Document identity

Column	Fill	Description
`doc_code`	91.6%	Normalized doc type abbreviation (e.g. `TM`, `RN`, `DIBR`, `UG`) — use for filtering/grouping
`doc_label`	91.6%	Full canonical doc type label — use for display only
`doc_layer`	100%	Structural role: `anchor` / `patch` / `plain` — see below
`doc_title`	100%	Original VDL title — do not parse/normalize; enrichment fields extract structured parts
`doc_filename`	100%	Document filename. Named `doc_filename` to avoid confusion with VistA FileMan database "files"
`doc_slug`	100%	URL-safe stable identifier from filename stem (lowercase, non-alnum→`_`). PDF/DOCX pairs share slug
`doc_format`	100%	`pdf` / `docx` / `doc` — file format without dot
`doc_subject`	~68%	Best-effort qualifier from title (subsystem, role, GUI sub-version). Empty = no residual after stripping — not missing data

Flags

Column	Description
`noise_type`	`''` = genuine VistA doc; `vba_form` = VBA benefits form (1,192 rows); `va_ref` = non-VistA VA reference (149 rows). Exclude non-empty from all analysis.

URLs

Column	Fill	Description
`app_url`	100%	URL to the app's VDL page
`doc_url`	100%	Direct URL to the document file
`companion_url`	84%	URL of the paired format (PDF↔DOCX). Empty if no pair

doc_layer — Structural Role

Derived from patch_ver and patch_num presence. Use directly — do not re-derive at query time.

Value	Rule	Count	Meaning
`anchor`	`patch_ver` set AND `patch_num` empty	1,918	Stable versioned base doc (Technical Manual v5.3, User Guide v7.0)
`patch`	`patch_num` set	3,584	Per-patch change doc (Release Notes, DIBR for a specific patch number)
`plain`	Neither set	3,332	No version or patch identity parsed — plain reference docs, forms, unclassified

doc_code Vocabulary

doc_code	doc_label	Role
TM	Technical Manual	Server internals, options, files
UM	User Manual	End-user interface
UG	User Guide	End-user interface (alternate label)
IG	Installation Guide	Install/upgrade instructions
IG-IMP	Implementation Guide	Site setup and configuration
DIBR	Deployment, Installation, Back-Out, and Rollback Guide	Modern patch deployment
RN	Release Notes	Per-patch change summary
PDD	Patch Description Document	Alternate per-patch change doc
VDD	Version Description Document	Version-level change doc
SG	Security Guide	Security configuration
POM	Production Operations Manual	Ops procedures
CFG	Configuration Guide	Configuration reference
QRG	Quick Reference Guide	Quick reference
API	API Manual	Programmatic interface
INT	Interface Specification	HL7 / integration specs
TG	Technical Guide	Technical reference
TRG	Training Guide/Manual	Training material
CRU	Clinical Reminder Update	PXRM update docs
CVG	Conversion Guide	Data migration
FAQ	Frequently Asked Questions	FAQ document
APX	Appendix	Supplemental appendix
DESC	Description Document	Generic description
REF	Reference	Miscellaneous reference
FORM	VBA Form	VBA benefits form — noise, exclude from analysis

Standard Analysis Filters

Apply to every analysis unless explicitly studying the excluded rows:

import pandas as pd

df = pd.read_csv('vdl_inventory_enriched.csv', dtype=str).fillna('')

# Exclude noise (VBA forms + VA reference docs)
df = df[df['noise_type'] == '']

# Exclude archived/decommissioned apps — unless studying history
df = df[df['app_status'] == 'active']

Version sorting — always use integer columns, never the string:

df.sort_values(['patch_ver_major', 'patch_ver_minor', 'patch_num'])
# Cast to int first: df['patch_ver_major'].replace('', '0').astype(int)

Count unique documents (deduplicate PDF/DOCX pairs):

df[df['doc_format'] == 'pdf'].shape[0]

YAML Frontmatter Template

For docx→markdown conversion, use the enriched CSV as the source for all frontmatter:

row = df[(df['doc_slug'] == 'your_slug') & (df['doc_format'] == 'docx')].iloc[0]
frontmatter = {
    'title':        row['doc_title'],
    'app':          row['app_name_abbrev'],
    'app_name':     row['app_name_full'],
    'section':      row['section_code'],
    'status':       row['app_status'],
    'pkg_ns':       row['pkg_ns'] or None,
    'patch_id':     row['patch_id'] or None,
    'patch_num':    int(row['patch_num']) if row['patch_num'] else None,
    'patch_ver':    row['patch_ver'] or None,
    'doc_type':     row['doc_code'] or None,
    'doc_layer':    row['doc_layer'],
    'doc_subject':  row['doc_subject'] or None,
    'doc_slug':     row['doc_slug'],
    'group_key':    row['group_key'] or None,
    'pdf_url':      row['companion_url'] or None,
    'doc_url':      row['doc_url'],
    'app_url':      row['app_url'],
}

Known Data Quality Issues

VBA Forms (noise_type = 'vba_form')

1,192 rows (8 VBA benefit forms × 149 app pages) are not VistA documentation. The forms all share VDL app node appid=373, which the VDL replicates across every app page. The crawler did not malfunction — the VDL itself lists these links. URLs point to vba.va.gov or benefits.va.gov.

Always exclude: noise_type != ''

VA Strategic Plan (noise_type = 'va_ref')

149 rows (the VA 2014–2020 Strategic Plan) are similarly attached to every app page via a shared URL. The document is on va.gov but not under the /vdl/ path. Classified as va_ref.

58 Apps with No VDL-Assigned Abbreviation

These apps never had a parenthetical code in app_name in the source CSV. Resolved via:

APP_ABBREV_FALLBACK dict in the script (curated for known apps)
pkg_ns as last resort

Fallback abbreviations are derived, not VDL-assigned — do not treat them as authoritative VDL identifiers. Affected apps include: CPRS: Clinical Reminder Updates (→PXRM), KAAJEE, HL7 (VistA Messaging), Laboratory sub-packages, Patient Record Flags (→PRF), etc.

patch_ver String Sort Is Broken

'5.9' > '5.10' alphabetically. Always sort by patch_ver_major / patch_ver_minor integers.

`pkg_ns` ≠ `app_name_abbrev` in 27 Apps (16%)

VDL "application" and VistA "package" are not 1:1. Examples:

ACR (Ambulatory Care Reporting) → pkg_ns=SD (delivered inside Scheduling)
ADT → pkg_ns is both ADT and DG
ASU → pkg_ns=USR

Use app_name_abbrev for VDL-level grouping; use pkg_ns for patch history linkage.

Fields Dropped from Source CSV

Dropped Field	Reason
`doc_type`	File format (PDF/DOCX) — identical to `doc_file_ext`; had 1,341 nulls vs 0 in `doc_file_ext`
`doc_file_ext`	Redundant with `doc_format` (same values with a leading dot); `doc_format` is cleaner for YAML
`doc_date`	Only 26/8,834 rows populated; free-form text from title parentheticals; not sortable
`app_code`	Identical to `app_name_abbrev` in all 166 named apps; `app_name_abbrev` fills 2 additional gaps

Source Field Renames

Source name	Enriched name	Reason
`filename`	`doc_filename`	Avoid confusion with VistA FileMan database "files"
`file_ext`	`doc_file_ext`	Consistency with `doc_` prefix (then dropped for `doc_format`)
`app_name`	`app_name_full`	Clarifies this is the full name after stripping the abbreviation
`vista_pkg_ns`	`pkg_ns`	Shorter; namespace concept is clear from context

doc_subject Cleaning Rules

doc_subject is derived entirely from doc_title by stripping the patch prefix and doc_label. The following artifact patterns are cleared automatically by the script (clean_doc_subject()):

Equals app_name_abbrev (case-insensitive) — redundant app echo
Equals doc_title exactly — full title echo (parsing failure)
Equals doc_label exactly — redundant type echo
Starts with / — multi-namespace title continuation fragment (e.g. /WEBP*1*1 PCMM Web)
Bare 4-digit year — e.g. 2019
Bare digits/dots only — version fragments (e.g. .01, 2.0)
Pure punctuation/whitespace — e.g. * -
Starts with *digits — patch artifact (e.g. *1064)
Entirely a NS*V*P or NS*V*P/NS*V*P pattern — patch ID residue
Length ≤ 2 AND no alphabetic characters — clears .1, .7 but keeps AP, IO

Empty doc_subject means no residual text after stripping structured parts — not missing data.

Name		Name	Last commit message	Last commit date
Latest commit History 66 Commits
.github		.github
guides		guides
pipeline		pipeline
scripts		scripts
src/vista_docs		src/vista_docs
tests		tests
.env.example		.env.example
.envrc		.envrc
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
CHANGES.md		CHANGES.md
CLAUDE.md		CLAUDE.md
Makefile		Makefile
README.md		README.md
fetch_vista.sh		fetch_vista.sh
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VistA Documentation Library (VDL) — Inventory Enrichment Pipeline

What Is the VDL?

Directory Structure

Regenerating from Scratch

Corpus Overview

5 Sections

Document Formats

App Status

Enriched Schema (27 columns)

Section identity

App identity

Patch identity

Document identity

Flags

URLs

doc_layer — Structural Role

doc_code Vocabulary

Standard Analysis Filters

YAML Frontmatter Template

Known Data Quality Issues

VBA Forms (noise_type = 'vba_form')

VA Strategic Plan (noise_type = 'va_ref')

58 Apps with No VDL-Assigned Abbreviation

patch_ver String Sort Is Broken

`pkg_ns` ≠ `app_name_abbrev` in 27 Apps (16%)

Fields Dropped from Source CSV

Source Field Renames

doc_subject Cleaning Rules

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

VistA Documentation Library (VDL) — Inventory Enrichment Pipeline

What Is the VDL?

Directory Structure

Regenerating from Scratch

Corpus Overview

5 Sections

Document Formats

App Status

Enriched Schema (27 columns)

Section identity

App identity

Patch identity

Document identity

Flags

URLs

doc_layer — Structural Role

doc_code Vocabulary

Standard Analysis Filters

YAML Frontmatter Template

Known Data Quality Issues

VBA Forms (noise_type = 'vba_form')

VA Strategic Plan (noise_type = 'va_ref')

58 Apps with No VDL-Assigned Abbreviation

patch_ver String Sort Is Broken

pkg_ns ≠ app_name_abbrev in 27 Apps (16%)

Fields Dropped from Source CSV

Source Field Renames

doc_subject Cleaning Rules

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`pkg_ns` ≠ `app_name_abbrev` in 27 Apps (16%)

Packages