Skip to content

rafael5/vista-docs

Repository files navigation

VistA Documentation Library (VDL) — Inventory Enrichment Pipeline

This repository produces vdl_inventory_enriched.csv, a normalized, machine-processable metadata inventory of all documents in the VA's VistA Digital Library (VDL). The enriched CSV is the gold standard reference for YAML frontmatter in docx→markdown conversions, gap analysis, and reporting.


What Is the VDL?

The VistA Digital Library (VDL) is the VA's official documentation catalog for VistA, the Department of Veterans Affairs' legacy MUMPS-based electronic health record system. The VDL lists 8,834 document files across 221 applications, grouped into 5 functional sections.


Directory Structure

~/projects/vista-docs/
  src/vista_docs/          ← canonical ETL pipeline (stages 1-5: crawl → sync)
  pipeline/                ← post-ingest stages 6-6.7 over md-img/frontmatter.db
  scripts/                 ← ad-hoc / one-off tools (enrich_inventory.py lives here)
  tests/
  guides/                  ← synthesised reference docs
  README.md                ← this file

~/data/vista-docs/inventory/
  vdl_inventory.csv             ← source (raw crawler output, 12 columns)
  vdl_inventory_enriched.csv    ← output (enriched, 27 columns) ← USE THIS
  vdl_inventory_schema.json     ← field type manifest (generated by script)
  vdl_inventory.json            ← JSON version of enriched inventory
  snapshots/                    ← dated CSV snapshots

Regenerating from Scratch

Requirements: Python 3.10+, no external dependencies (stdlib only).

cd ~/projects/vista-docs
python3 scripts/enrich_inventory.py

Output:

  • Writes vdl_inventory_enriched.csv (27 columns, UTF-8)
  • Writes vdl_inventory_schema.json (field type manifest)
  • Prints a fill-rate and distribution summary to stdout

The script is idempotent: running it twice on the same input produces identical output.


Corpus Overview

Metric Value
Total records 8,834
Genuine VistA docs 7,493
Active app docs (clean) ~5,231
Unique document titles 3,646
VistA applications 221
Functional sections 5

5 Sections

section_code section_name Records Apps
CLI Clinical 5,790 111
FIN Financial-Administrative 1,485 38
GUI VistA/GUI Hybrids (formerly HealtheVet) 780 23
INF Infrastructure 777 48
MON Monograph 2 1

Document Formats

  • .pdf: 5,097 (57.6%)
  • .docx: 3,730 (42.2%)
  • .doc (legacy): 7 (0.1%)
  • Nearly every document exists in both PDF and DOCX — pairs share identical metadata except URL/filename fields.

App Status

  • Active: 5,330 records / 192 apps
  • Archive: 3,380 records / 188 apps
  • Decommissioned: 124 records / 16 apps

Enriched Schema (27 columns)

Column order: section identity → app identity → patch identity → document identity → flags → URLs.

Section identity

Column Fill Description
section_name 100% Full VDL section name
section_code 100% CLI / FIN / GUI / INF / MON — use for grouping/filtering

App identity

Column Fill Description
app_name_full 100% Full application name (parenthetical abbreviation stripped) — display only
app_name_abbrev 100% Primary app grouping key. VDL-assigned for 166 apps; curated fallback for 58 apps with no VDL code
app_status 100% active / archive / decommissioned
decommission_date 1.3% Normalized to YYYY-MM. Only 115 rows populated — do not use as a coverage signal

Patch identity

Column Fill Description
pkg_ns 56.4% VistA MUMPS package namespace (e.g. DG, PSO). Differs from app_name_abbrev in 27 apps (16%)
patch_ver 61.3% Version string (e.g. 5.3). Do not sort as string — use patch_ver_major/patch_ver_minor
patch_ver_major 61.3% Major version integer
patch_ver_minor 61.3% Minor version integer; 0 when major-only
patch_num 40.6% Patch number integer (leading zeros stripped)
patch_id 56.4% Canonical VistA patch ID: NS*V*P for patch docs; NS*V for anchor docs
patch_id_full 0.5% Full raw multi-namespace prefix (e.g. DG*5.3*554/TIU*1*184); set only when multi_ns=1
multi_ns 1 if title contains multiple NS*V*P segments; 0 otherwise
group_key 61.3% Functional group key: app_name_abbrev:pkg_ns:patch_ver — groups anchor + patch docs together

Document identity

Column Fill Description
doc_code 91.6% Normalized doc type abbreviation (e.g. TM, RN, DIBR, UG) — use for filtering/grouping
doc_label 91.6% Full canonical doc type label — use for display only
doc_layer 100% Structural role: anchor / patch / plain — see below
doc_title 100% Original VDL title — do not parse/normalize; enrichment fields extract structured parts
doc_filename 100% Document filename. Named doc_filename to avoid confusion with VistA FileMan database "files"
doc_slug 100% URL-safe stable identifier from filename stem (lowercase, non-alnum→_). PDF/DOCX pairs share slug
doc_format 100% pdf / docx / doc — file format without dot
doc_subject ~68% Best-effort qualifier from title (subsystem, role, GUI sub-version). Empty = no residual after stripping — not missing data

Flags

Column Description
noise_type '' = genuine VistA doc; vba_form = VBA benefits form (1,192 rows); va_ref = non-VistA VA reference (149 rows). Exclude non-empty from all analysis.

URLs

Column Fill Description
app_url 100% URL to the app's VDL page
doc_url 100% Direct URL to the document file
companion_url 84% URL of the paired format (PDF↔DOCX). Empty if no pair

doc_layer — Structural Role

Derived from patch_ver and patch_num presence. Use directly — do not re-derive at query time.

Value Rule Count Meaning
anchor patch_ver set AND patch_num empty 1,918 Stable versioned base doc (Technical Manual v5.3, User Guide v7.0)
patch patch_num set 3,584 Per-patch change doc (Release Notes, DIBR for a specific patch number)
plain Neither set 3,332 No version or patch identity parsed — plain reference docs, forms, unclassified

doc_code Vocabulary

doc_code doc_label Role
TM Technical Manual Server internals, options, files
UM User Manual End-user interface
UG User Guide End-user interface (alternate label)
IG Installation Guide Install/upgrade instructions
IG-IMP Implementation Guide Site setup and configuration
DIBR Deployment, Installation, Back-Out, and Rollback Guide Modern patch deployment
RN Release Notes Per-patch change summary
PDD Patch Description Document Alternate per-patch change doc
VDD Version Description Document Version-level change doc
SG Security Guide Security configuration
POM Production Operations Manual Ops procedures
CFG Configuration Guide Configuration reference
QRG Quick Reference Guide Quick reference
API API Manual Programmatic interface
INT Interface Specification HL7 / integration specs
TG Technical Guide Technical reference
TRG Training Guide/Manual Training material
CRU Clinical Reminder Update PXRM update docs
CVG Conversion Guide Data migration
FAQ Frequently Asked Questions FAQ document
APX Appendix Supplemental appendix
DESC Description Document Generic description
REF Reference Miscellaneous reference
FORM VBA Form VBA benefits form — noise, exclude from analysis

Standard Analysis Filters

Apply to every analysis unless explicitly studying the excluded rows:

import pandas as pd

df = pd.read_csv('vdl_inventory_enriched.csv', dtype=str).fillna('')

# Exclude noise (VBA forms + VA reference docs)
df = df[df['noise_type'] == '']

# Exclude archived/decommissioned apps — unless studying history
df = df[df['app_status'] == 'active']

Version sorting — always use integer columns, never the string:

df.sort_values(['patch_ver_major', 'patch_ver_minor', 'patch_num'])
# Cast to int first: df['patch_ver_major'].replace('', '0').astype(int)

Count unique documents (deduplicate PDF/DOCX pairs):

df[df['doc_format'] == 'pdf'].shape[0]

YAML Frontmatter Template

For docx→markdown conversion, use the enriched CSV as the source for all frontmatter:

row = df[(df['doc_slug'] == 'your_slug') & (df['doc_format'] == 'docx')].iloc[0]
frontmatter = {
    'title':        row['doc_title'],
    'app':          row['app_name_abbrev'],
    'app_name':     row['app_name_full'],
    'section':      row['section_code'],
    'status':       row['app_status'],
    'pkg_ns':       row['pkg_ns'] or None,
    'patch_id':     row['patch_id'] or None,
    'patch_num':    int(row['patch_num']) if row['patch_num'] else None,
    'patch_ver':    row['patch_ver'] or None,
    'doc_type':     row['doc_code'] or None,
    'doc_layer':    row['doc_layer'],
    'doc_subject':  row['doc_subject'] or None,
    'doc_slug':     row['doc_slug'],
    'group_key':    row['group_key'] or None,
    'pdf_url':      row['companion_url'] or None,
    'doc_url':      row['doc_url'],
    'app_url':      row['app_url'],
}

Known Data Quality Issues

VBA Forms (noise_type = 'vba_form')

1,192 rows (8 VBA benefit forms × 149 app pages) are not VistA documentation. The forms all share VDL app node appid=373, which the VDL replicates across every app page. The crawler did not malfunction — the VDL itself lists these links. URLs point to vba.va.gov or benefits.va.gov.

Always exclude: noise_type != ''

VA Strategic Plan (noise_type = 'va_ref')

149 rows (the VA 2014–2020 Strategic Plan) are similarly attached to every app page via a shared URL. The document is on va.gov but not under the /vdl/ path. Classified as va_ref.

58 Apps with No VDL-Assigned Abbreviation

These apps never had a parenthetical code in app_name in the source CSV. Resolved via:

  1. APP_ABBREV_FALLBACK dict in the script (curated for known apps)
  2. pkg_ns as last resort

Fallback abbreviations are derived, not VDL-assigned — do not treat them as authoritative VDL identifiers. Affected apps include: CPRS: Clinical Reminder Updates (→PXRM), KAAJEE, HL7 (VistA Messaging), Laboratory sub-packages, Patient Record Flags (→PRF), etc.

patch_ver String Sort Is Broken

'5.9' > '5.10' alphabetically. Always sort by patch_ver_major / patch_ver_minor integers.

pkg_nsapp_name_abbrev in 27 Apps (16%)

VDL "application" and VistA "package" are not 1:1. Examples:

  • ACR (Ambulatory Care Reporting) → pkg_ns=SD (delivered inside Scheduling)
  • ADTpkg_ns is both ADT and DG
  • ASUpkg_ns=USR

Use app_name_abbrev for VDL-level grouping; use pkg_ns for patch history linkage.


Fields Dropped from Source CSV

Dropped Field Reason
doc_type File format (PDF/DOCX) — identical to doc_file_ext; had 1,341 nulls vs 0 in doc_file_ext
doc_file_ext Redundant with doc_format (same values with a leading dot); doc_format is cleaner for YAML
doc_date Only 26/8,834 rows populated; free-form text from title parentheticals; not sortable
app_code Identical to app_name_abbrev in all 166 named apps; app_name_abbrev fills 2 additional gaps

Source Field Renames

Source name Enriched name Reason
filename doc_filename Avoid confusion with VistA FileMan database "files"
file_ext doc_file_ext Consistency with doc_ prefix (then dropped for doc_format)
app_name app_name_full Clarifies this is the full name after stripping the abbreviation
vista_pkg_ns pkg_ns Shorter; namespace concept is clear from context

doc_subject Cleaning Rules

doc_subject is derived entirely from doc_title by stripping the patch prefix and doc_label. The following artifact patterns are cleared automatically by the script (clean_doc_subject()):

  1. Equals app_name_abbrev (case-insensitive) — redundant app echo
  2. Equals doc_title exactly — full title echo (parsing failure)
  3. Equals doc_label exactly — redundant type echo
  4. Starts with / — multi-namespace title continuation fragment (e.g. /WEBP*1*1 PCMM Web)
  5. Bare 4-digit year — e.g. 2019
  6. Bare digits/dots only — version fragments (e.g. .01, 2.0)
  7. Pure punctuation/whitespace — e.g. * -
  8. Starts with *digits — patch artifact (e.g. *1064)
  9. Entirely a NS*V*P or NS*V*P/NS*V*P pattern — patch ID residue
  10. Length ≤ 2 AND no alphabetic characters — clears .1, .7 but keeps AP, IO

Empty doc_subject means no residual text after stripping structured parts — not missing data.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages