This repository produces vdl_inventory_enriched.csv, a normalized, machine-processable metadata
inventory of all documents in the VA's VistA Digital Library (VDL). The enriched CSV is the gold
standard reference for YAML frontmatter in docx→markdown conversions, gap analysis, and reporting.
The VistA Digital Library (VDL) is the VA's official documentation catalog for VistA, the Department of Veterans Affairs' legacy MUMPS-based electronic health record system. The VDL lists 8,834 document files across 221 applications, grouped into 5 functional sections.
~/projects/vista-docs/
src/vista_docs/ ← canonical ETL pipeline (stages 1-5: crawl → sync)
pipeline/ ← post-ingest stages 6-6.7 over md-img/frontmatter.db
scripts/ ← ad-hoc / one-off tools (enrich_inventory.py lives here)
tests/
guides/ ← synthesised reference docs
README.md ← this file
~/data/vista-docs/inventory/
vdl_inventory.csv ← source (raw crawler output, 12 columns)
vdl_inventory_enriched.csv ← output (enriched, 27 columns) ← USE THIS
vdl_inventory_schema.json ← field type manifest (generated by script)
vdl_inventory.json ← JSON version of enriched inventory
snapshots/ ← dated CSV snapshots
Requirements: Python 3.10+, no external dependencies (stdlib only).
cd ~/projects/vista-docs
python3 scripts/enrich_inventory.pyOutput:
- Writes
vdl_inventory_enriched.csv(27 columns, UTF-8) - Writes
vdl_inventory_schema.json(field type manifest) - Prints a fill-rate and distribution summary to stdout
The script is idempotent: running it twice on the same input produces identical output.
| Metric | Value |
|---|---|
| Total records | 8,834 |
| Genuine VistA docs | 7,493 |
| Active app docs (clean) | ~5,231 |
| Unique document titles | 3,646 |
| VistA applications | 221 |
| Functional sections | 5 |
| section_code | section_name | Records | Apps |
|---|---|---|---|
| CLI | Clinical | 5,790 | 111 |
| FIN | Financial-Administrative | 1,485 | 38 |
| GUI | VistA/GUI Hybrids (formerly HealtheVet) | 780 | 23 |
| INF | Infrastructure | 777 | 48 |
| MON | Monograph | 2 | 1 |
.pdf: 5,097 (57.6%).docx: 3,730 (42.2%).doc(legacy): 7 (0.1%)- Nearly every document exists in both PDF and DOCX — pairs share identical metadata except URL/filename fields.
- Active: 5,330 records / 192 apps
- Archive: 3,380 records / 188 apps
- Decommissioned: 124 records / 16 apps
Column order: section identity → app identity → patch identity → document identity → flags → URLs.
| Column | Fill | Description |
|---|---|---|
section_name |
100% | Full VDL section name |
section_code |
100% | CLI / FIN / GUI / INF / MON — use for grouping/filtering |
| Column | Fill | Description |
|---|---|---|
app_name_full |
100% | Full application name (parenthetical abbreviation stripped) — display only |
app_name_abbrev |
100% | Primary app grouping key. VDL-assigned for 166 apps; curated fallback for 58 apps with no VDL code |
app_status |
100% | active / archive / decommissioned |
decommission_date |
1.3% | Normalized to YYYY-MM. Only 115 rows populated — do not use as a coverage signal |
| Column | Fill | Description |
|---|---|---|
pkg_ns |
56.4% | VistA MUMPS package namespace (e.g. DG, PSO). Differs from app_name_abbrev in 27 apps (16%) |
patch_ver |
61.3% | Version string (e.g. 5.3). Do not sort as string — use patch_ver_major/patch_ver_minor |
patch_ver_major |
61.3% | Major version integer |
patch_ver_minor |
61.3% | Minor version integer; 0 when major-only |
patch_num |
40.6% | Patch number integer (leading zeros stripped) |
patch_id |
56.4% | Canonical VistA patch ID: NS*V*P for patch docs; NS*V for anchor docs |
patch_id_full |
0.5% | Full raw multi-namespace prefix (e.g. DG*5.3*554/TIU*1*184); set only when multi_ns=1 |
multi_ns |
— | 1 if title contains multiple NS*V*P segments; 0 otherwise |
group_key |
61.3% | Functional group key: app_name_abbrev:pkg_ns:patch_ver — groups anchor + patch docs together |
| Column | Fill | Description |
|---|---|---|
doc_code |
91.6% | Normalized doc type abbreviation (e.g. TM, RN, DIBR, UG) — use for filtering/grouping |
doc_label |
91.6% | Full canonical doc type label — use for display only |
doc_layer |
100% | Structural role: anchor / patch / plain — see below |
doc_title |
100% | Original VDL title — do not parse/normalize; enrichment fields extract structured parts |
doc_filename |
100% | Document filename. Named doc_filename to avoid confusion with VistA FileMan database "files" |
doc_slug |
100% | URL-safe stable identifier from filename stem (lowercase, non-alnum→_). PDF/DOCX pairs share slug |
doc_format |
100% | pdf / docx / doc — file format without dot |
doc_subject |
~68% | Best-effort qualifier from title (subsystem, role, GUI sub-version). Empty = no residual after stripping — not missing data |
| Column | Description |
|---|---|
noise_type |
'' = genuine VistA doc; vba_form = VBA benefits form (1,192 rows); va_ref = non-VistA VA reference (149 rows). Exclude non-empty from all analysis. |
| Column | Fill | Description |
|---|---|---|
app_url |
100% | URL to the app's VDL page |
doc_url |
100% | Direct URL to the document file |
companion_url |
84% | URL of the paired format (PDF↔DOCX). Empty if no pair |
Derived from patch_ver and patch_num presence. Use directly — do not re-derive at query time.
| Value | Rule | Count | Meaning |
|---|---|---|---|
anchor |
patch_ver set AND patch_num empty |
1,918 | Stable versioned base doc (Technical Manual v5.3, User Guide v7.0) |
patch |
patch_num set |
3,584 | Per-patch change doc (Release Notes, DIBR for a specific patch number) |
plain |
Neither set | 3,332 | No version or patch identity parsed — plain reference docs, forms, unclassified |
| doc_code | doc_label | Role |
|---|---|---|
| TM | Technical Manual | Server internals, options, files |
| UM | User Manual | End-user interface |
| UG | User Guide | End-user interface (alternate label) |
| IG | Installation Guide | Install/upgrade instructions |
| IG-IMP | Implementation Guide | Site setup and configuration |
| DIBR | Deployment, Installation, Back-Out, and Rollback Guide | Modern patch deployment |
| RN | Release Notes | Per-patch change summary |
| PDD | Patch Description Document | Alternate per-patch change doc |
| VDD | Version Description Document | Version-level change doc |
| SG | Security Guide | Security configuration |
| POM | Production Operations Manual | Ops procedures |
| CFG | Configuration Guide | Configuration reference |
| QRG | Quick Reference Guide | Quick reference |
| API | API Manual | Programmatic interface |
| INT | Interface Specification | HL7 / integration specs |
| TG | Technical Guide | Technical reference |
| TRG | Training Guide/Manual | Training material |
| CRU | Clinical Reminder Update | PXRM update docs |
| CVG | Conversion Guide | Data migration |
| FAQ | Frequently Asked Questions | FAQ document |
| APX | Appendix | Supplemental appendix |
| DESC | Description Document | Generic description |
| REF | Reference | Miscellaneous reference |
| FORM | VBA Form | VBA benefits form — noise, exclude from analysis |
Apply to every analysis unless explicitly studying the excluded rows:
import pandas as pd
df = pd.read_csv('vdl_inventory_enriched.csv', dtype=str).fillna('')
# Exclude noise (VBA forms + VA reference docs)
df = df[df['noise_type'] == '']
# Exclude archived/decommissioned apps — unless studying history
df = df[df['app_status'] == 'active']Version sorting — always use integer columns, never the string:
df.sort_values(['patch_ver_major', 'patch_ver_minor', 'patch_num'])
# Cast to int first: df['patch_ver_major'].replace('', '0').astype(int)Count unique documents (deduplicate PDF/DOCX pairs):
df[df['doc_format'] == 'pdf'].shape[0]For docx→markdown conversion, use the enriched CSV as the source for all frontmatter:
row = df[(df['doc_slug'] == 'your_slug') & (df['doc_format'] == 'docx')].iloc[0]
frontmatter = {
'title': row['doc_title'],
'app': row['app_name_abbrev'],
'app_name': row['app_name_full'],
'section': row['section_code'],
'status': row['app_status'],
'pkg_ns': row['pkg_ns'] or None,
'patch_id': row['patch_id'] or None,
'patch_num': int(row['patch_num']) if row['patch_num'] else None,
'patch_ver': row['patch_ver'] or None,
'doc_type': row['doc_code'] or None,
'doc_layer': row['doc_layer'],
'doc_subject': row['doc_subject'] or None,
'doc_slug': row['doc_slug'],
'group_key': row['group_key'] or None,
'pdf_url': row['companion_url'] or None,
'doc_url': row['doc_url'],
'app_url': row['app_url'],
}1,192 rows (8 VBA benefit forms × 149 app pages) are not VistA documentation. The forms all share
VDL app node appid=373, which the VDL replicates across every app page. The crawler did not
malfunction — the VDL itself lists these links. URLs point to vba.va.gov or benefits.va.gov.
Always exclude: noise_type != ''
149 rows (the VA 2014–2020 Strategic Plan) are similarly attached to every app page via a shared
URL. The document is on va.gov but not under the /vdl/ path. Classified as va_ref.
These apps never had a parenthetical code in app_name in the source CSV. Resolved via:
APP_ABBREV_FALLBACKdict in the script (curated for known apps)pkg_nsas last resort
Fallback abbreviations are derived, not VDL-assigned — do not treat them as authoritative VDL identifiers. Affected apps include: CPRS: Clinical Reminder Updates (→PXRM), KAAJEE, HL7 (VistA Messaging), Laboratory sub-packages, Patient Record Flags (→PRF), etc.
'5.9' > '5.10' alphabetically. Always sort by patch_ver_major / patch_ver_minor integers.
VDL "application" and VistA "package" are not 1:1. Examples:
ACR(Ambulatory Care Reporting) →pkg_ns=SD(delivered inside Scheduling)ADT→pkg_nsis bothADTandDGASU→pkg_ns=USR
Use app_name_abbrev for VDL-level grouping; use pkg_ns for patch history linkage.
| Dropped Field | Reason |
|---|---|
doc_type |
File format (PDF/DOCX) — identical to doc_file_ext; had 1,341 nulls vs 0 in doc_file_ext |
doc_file_ext |
Redundant with doc_format (same values with a leading dot); doc_format is cleaner for YAML |
doc_date |
Only 26/8,834 rows populated; free-form text from title parentheticals; not sortable |
app_code |
Identical to app_name_abbrev in all 166 named apps; app_name_abbrev fills 2 additional gaps |
| Source name | Enriched name | Reason |
|---|---|---|
filename |
doc_filename |
Avoid confusion with VistA FileMan database "files" |
file_ext |
doc_file_ext |
Consistency with doc_ prefix (then dropped for doc_format) |
app_name |
app_name_full |
Clarifies this is the full name after stripping the abbreviation |
vista_pkg_ns |
pkg_ns |
Shorter; namespace concept is clear from context |
doc_subject is derived entirely from doc_title by stripping the patch prefix and doc_label.
The following artifact patterns are cleared automatically by the script (clean_doc_subject()):
- Equals
app_name_abbrev(case-insensitive) — redundant app echo - Equals
doc_titleexactly — full title echo (parsing failure) - Equals
doc_labelexactly — redundant type echo - Starts with
/— multi-namespace title continuation fragment (e.g./WEBP*1*1 PCMM Web) - Bare 4-digit year — e.g.
2019 - Bare digits/dots only — version fragments (e.g.
.01,2.0) - Pure punctuation/whitespace — e.g.
* - - Starts with
*digits— patch artifact (e.g.*1064) - Entirely a
NS*V*PorNS*V*P/NS*V*Ppattern — patch ID residue - Length ≤ 2 AND no alphabetic characters — clears
.1,.7but keepsAP,IO
Empty doc_subject means no residual text after stripping structured parts — not missing data.