Requires Python 3.14+ and uv for dependency management.
git clone https://github.com/Jybbs/chalkline.git
cd chalkline
uv syncChalkline operates in two stages. First, fit encodes the posting corpus with a sentence transformer, clusters the embeddings into career families, assigns O*NET occupations, and builds a stepwise career graph with credential enrichment. Results are cached to disk so that subsequent runs with unchanged code and config serve instantly.
uv run chalkline fit # fit the pipeline, print a summary
uv run chalkline fit -v # same, with diagnostic logsThen launch starts the Marimo reactive notebook where you upload a resume and receive a personalized career report.
uv run chalkline launch # open the career report in your browserThe posting corpus from AGC Maine is proprietary and not included in the repository. Place posting data in
data/postings/before fitting.
The Green Buildings Career Map organized 55 jobs across 4 sectors with 300+ advancement routes, demonstrating that structured career maps change how workers navigate trades1. Chalkline asks whether the same kind of structure can be constructed algorithmically from job postings, complementing expert-curated maps with a data-driven approach that can be re-fitted as the labor market shifts.
The premise is that postings encode implicit structure about how occupations relate to one another, which skills bridge adjacent roles, and what credentials separate one career level from the next. Occupational modeling at scale has confirmed this, showing that millions of unstructured postings yield taxonomies comparable to expert-curated frameworks2, and network models built from skill overlap reveal the same latent mobility structure34. Data-driven taxonomies extracted directly from online adverts have reached similar conclusions at smaller scale5, reinforcing that the signal is in the postings themselves.
Chalkline works with 922 postings from Maine's construction industry, scraped from AGC Maine's listings and covering 21 O*NET SOC codes across three sectors. A sentence transformer6 encodes each posting into a 768-dimensional embedding, Ward-linkage HAC7 clusters those embeddings into 20 career families, and a stepwise k-NN graph maps 96 advancement and lateral edges enriched by 325 credentials. Upload a resume, and the system projects it into the same space for personalized skill gap analysis8.
A chalk line snaps a straight reference path between two points. Chalkline does the same for careers.
Chalkline is a single-track embedding pipeline orchestrated by Hamilton9, where each processing step is a DAG node whose parameter names declare dependencies. Hamilton resolves execution order automatically, caches every node result to disk, and serves from cache on subsequent calls with unchanged code and config. The pipeline draws on recent work in job ad segmentation via NLP and clustering10 and end-to-end transformer pipelines for resume matching11.
| Step | Node | Technique | Module |
|---|---|---|---|
| 1 | Corpus Loading | Filter and key postings from JobSpy collection | collection.collector |
| 2 | Sentence Encoding |
Alibaba-NLP/gte-base-en-v1.5 via ONNX with CLS pooling |
pipeline.encoder |
| 3 | Dimensionality Reduction | L2-normalize embeddings, then TruncatedSVD to 10 components | pipeline.steps |
| 4 | Clustering |
Ward-linkage HAC at |
pipeline.steps |
| 5 | SOC Assignment | Top-3 median cosine similarity against O*NET Task+DWA embeddings | pipeline.steps |
| 6 | Career Graph | Stepwise k-NN backbone with per-edge dual-threshold credential enrichment | pathways.graph |
| 7 | Resume Matching | SVD projection, nearest centroid, per-task cosine gap analysis | matching.matcher |
The SentenceEncoder in pipeline/encoder.py downloads the ONNX model from HuggingFace on first use (cached locally thereafter), runs inference via onnxruntime in fixed-size batches with CLS pooling, and L2-normalizes the output. Because the ~430 MB model file should not be serialized into Hamilton's disk cache, the orchestrator in pipeline/orchestrator.py instantiates the encoder outside the DAG and passes it as an input alongside the PipelineConfig, so that all encoding node outputs (NumPy arrays) cache normally while the encoder itself is excluded.
The fitted pipeline is assembled into a Chalkline dataclass that exposes a single match(pdf_bytes) method. This method extracts text from an uploaded PDF via pdfplumber, encodes it with the same sentence transformer, projects through the fitted SVD, assigns the nearest career family, computes per-task gap analysis, and returns a MatchResult with reach exploration and credential metadata.
Each posting description is fed through a sentence transformer (gte-base-en-v1.5) that converts text into a 768-dimensional vector capturing its semantic meaning. Every vector is then scaled to unit length (L2-normalized) so that
768 dimensions is far more than the pipeline needs, and high-dimensional spaces introduce a well-documented problem where all pairwise distances converge toward the same value12, making it harder to tell similar postings apart from dissimilar ones. TruncatedSVD13 compresses the space by decomposing the embedding matrix into its most informative components:
The pipeline retains
The pipeline groups postings into career families using Ward-linkage hierarchical agglomerative clustering7. Starting with each posting as its own cluster, the algorithm repeatedly merges the two clusters whose combination increases total within-cluster variance the least. The cost of merging clusters
This builds a full dendrogram (a tree of every possible merge), which is then cut at
Each cluster needs an occupational identity. The pipeline computes the mean posting embedding for each cluster (in the full 768-dimensional space, L2-normalized), then compares it against sentence embeddings of all 21 O*NET occupations' Task and DWA descriptions. The most similar occupation by cosine similarity becomes the cluster's label.
Job Zone assignment (ranging from 1 for minimal preparation to 5 for extensive) uses a smoothed vote from the three most similar occupations rather than relying on a single nearest neighbor:
where
Cosine similarity17 is the central metric throughout the pipeline. It measures how closely two embeddings point in the same direction, regardless of their magnitude. For any two vectors
A score of 1.0 means identical direction (maximally similar), 0 means unrelated, and negative values mean opposing. Since all vectors in the pipeline are L2-normalized, this simplifies to a dot product. The same metric drives SOC assignment, per-task gap analysis, and credential filtering. Recent work on O*NET-enriched transformer matching18 and shared embedding space approaches19 has validated cosine similarity as an effective signal for occupational proximity when both job descriptions and resumes are encoded by the same model.
When a user uploads a resume, the system encodes it with the same sentence transformer and projects it through the fitted SVD into the 10-dimensional space shared with all postings11. The resume is then assigned to whichever career family's centroid is closest:
The full distance ranking across all 20 clusters is preserved, letting the career report show proximity to every family rather than only the assigned one. After assignment, each of the matched cluster's O*NET tasks is individually compared against the resume embedding in the full 768-dimensional space. A median-split threshold separates tasks the resume demonstrates from tasks it does not8:
where
The career graph connects the 20 career families with directed, weighted edges representing plausible career moves3. Graph-based representations of occupational transitions capture mobility patterns that flat taxonomies miss2021, and the stepwise constraint here ensures edges only link clusters at the same Job Zone (lateral pivots) or one level apart (upward advancement), preventing unrealistic tier-skipping jumps22. Each cluster gets
Each edge is then annotated with relevant credentials (19 apprenticeships, 276 certifications, 30 educational programs) using a dual-threshold filter. For an edge from a worker's current cluster
The destination threshold
The Marimo notebook opens to a splash page showing the fitted landscape at a glance with a drag-and-drop upload zone. Drop a PDF resume, and the system encodes it with the same sentence transformer, projects through the fitted SVD, and matches to the nearest career family.
The resulting report is an eight-panel accordion that expands lazily as you open each section:
- Career Landscape: Scatter plot of every career family in the SVD coordinate space, with node sizes scaled by betweenness centrality23 and the resume overlaid as a gold star showing where you sit relative to the full landscape
- Skill Analysis: Demonstrated competencies and skill gaps ranked by cosine similarity against the cluster's O*NET tasks, with gaps ordered by deficit magnitude
- Career Pathways: Spring-layout network of advancement and lateral edges from a target cluster, with apprenticeship programs and hour requirements annotated on each edge
- Dendrogram: Ward-linkage hierarchical tree over all career families, with the matched cluster highlighted to show where it sits in the broader hierarchy
- Education & Training: Registered apprenticeships with RAPIDS codes and educational programs reachable through career graph edges from the target cluster
- Employer Connections: Posting companies fuzzy-matched against the AGC Maine member directory with career page URLs
- Job Boards: Maine and national boards filtered by sector relevance
- Pipeline Details: Underlying DAG visualization, cluster profiles, and model metadata for technical audiences
A downloadable plain-text report is available from the sidebar at any point.
Chalkline's CLI is built on Typer with Rich markup. Running chalkline with no arguments prints help.
uv run chalkline --helpEncode postings, cluster into career families, build the career graph, and cache the fitted pipeline. All directory flags are optional, defaulting to sensible project-relative paths that work when running from the repository root.
uv run chalkline fit # fit with default paths
uv run chalkline fit --verbose # same, with debug-level logs| Option | Short | Default | Description |
|---|---|---|---|
--postings-dir |
data/postings |
Custom path to posting corpus (optional) | |
--lexicon-dir |
data/lexicons |
Custom path to lexicon JSONs (optional) | |
--output-dir |
.cache/pipeline |
Custom cache directory (optional) | |
--verbose |
-v |
False |
Show diagnostic logs |
Start marimo run on the career report notebook. Must be run from the project root where app/main.py exists.
uv run chalkline launch| Component | Technology | Role |
|---|---|---|
| Sentence Encoding | onnxruntime + tokenizers |
ONNX inference for gte-base-en-v1.5 with HuggingFace fast tokenization |
| Machine Learning | scikit-learn |
TruncatedSVD, Ward HAC, L2 normalization, cosine similarity |
| Pipeline Orchestration | sf-hamilton |
DAG resolution from function signatures with node-level disk caching |
| Career Graph | NetworkX |
Directed weighted graph for stepwise k-NN backbone and reach queries |
| Corpus Collection | python-jobspy |
Multi-board job aggregation from Indeed and other sources |
| PDF Extraction | pdfplumber |
Resume text extraction with layout-aware parsing |
| UI | Marimo |
Reactive notebook with drag-and-drop resume upload |
| Visualization | Plotly |
Interactive career landscape scatters and graph visualizations |
| CLI | Typer |
fit and launch subcommands with Rich markup |
| Configuration | Pydantic |
PipelineConfig with extra="forbid" and tuned defaults |
| Logging | Loguru |
Structured pipeline progress and debug output |
chalkline/
βββ app/
β βββ main.py Marimo reactive notebook (career report)
β
βββ data/
β βββ lexicons/ Curated domain knowledge (committed)
β β βββ credentials.json Apprenticeships, certifications, programs
β β βββ onet.json 21 SOC codes with Tasks, DWAs, Technology Skills, KSAs
β βββ postings/ Scraped AGC corpus (gitignored)
β βββ stakeholder/ AGC Maine reference data (gitignored)
β βββ reference/ 7 JSON files: members, apprenticeships, programs, etc.
β
βββ scripts/ Repeatable data curation (not part of the package)
β βββ curate_onet.py Fetch O*NET tasks, DWAs, technology skills
β βββ curate_osha.py Fetch OSHA standards from eCFR API
β βββ parse_stakeholder.py Extract AGC workbook into reference JSONs
β
βββ src/chalkline/
β βββ cli/ Typer CLI with fit and launch subcommands
β β βββ __init__.py App registration and subcommand wiring
β β βββ fit.py Pipeline fitting with cache-or-compute
β β βββ launch.py Marimo notebook launcher
β β
β βββ collection/ Corpus loading and posting schemas
β β βββ collector.py Filter and key postings from storage
β β βββ schemas.py Posting Pydantic models
β β βββ storage.py File-backed posting persistence
β β
β βββ display/ Notebook presentation layer
β β βββ figures.py Plotly figure builders (landscape, pathways, dendrogram)
β β βββ layout.py Marimo layout helpers (stat rows, filtered accordions)
β β βββ tables.py Row builders for panel tables and text reports
β β
β βββ matching/ Resume-to-career matching
β β βββ matcher.py SVD projection, nearest centroid, cosine gap analysis
β β βββ reader.py PDF text extraction via pdfplumber
β β βββ schemas.py MatchResult and gap/demonstrated models
β β
β βββ pathways/ Career graph construction
β β βββ graph.py NetworkX stepwise k-NN backbone with credential edges
β β βββ loaders.py Credential and cluster data loading
β β βββ schemas.py Clusters, ClusterProfile, Reach, Edge models
β β
β βββ pipeline/ Orchestration and shared types
β βββ encoder.py ONNX sentence transformer wrapper
β βββ orchestrator.py Hamilton DAG driver β fitted Chalkline dataclass
β βββ progress.py Loguru progress reporting
β βββ schemas.py PipelineConfig (Pydantic, extra="forbid")
β βββ steps.py Hamilton node functions (the full DAG)
β
βββ tests/ Pytest suite mirroring src/ structure
βββ pyproject.toml Build config, dependencies, CLI entry point
βββ uv.lock Locked dependency versions
Each domain subpackage (collection/, matching/, pathways/, pipeline/, display/) owns its schemas and logic. The pipeline/ subpackage orchestrates the others through Hamilton, where each function in steps.py is a DAG node whose parameter names declare its dependencies. The display/ subpackage separates figure construction, table building, and layout composition so that app/main.py stays thin, wiring Marimo cells to display methods without inline chart logic.
AGC Maine (Associated General Contractors of Maine) represents 222 member companies and has been the state's primary construction trade association since 1951. The association operates the Maine Construction Academy with tuition-free pre-apprenticeship programs expanding to five community colleges in 2026 and manages 19 registered apprenticeship pathways spanning trades from carpentry and welding to crane operation and solar installation.
AGC provided the posting corpus, the stakeholder reference data defining the project's 21 SOC codes and three sectors, and the credential records (apprenticeships, certifications, educational programs) that enrich the career graph. The collaboration connects algorithmic career mapping to a real training pipeline2418, where outputs directly inform which programs AGC recommends to workers entering or advancing through the trades.
Footnotes
-
Hamilton. 2012. "Career Pathway and Cluster Skill Development: Promising Models from the United States." OECD Local Economic and Employment Development (LEED) Papers 2012/14. https://doi.org/10.1787/5k94g1s6f7td-en β©
-
Dixon, et al. 2023. "Occupational Models from 42 Million Unstructured Job Postings." Patterns 4 (7): 100757. https://doi.org/10.1016/j.patter.2023.100757 β©
-
del Rio-Chanona, et al. 2021. "Occupational Mobility and Automation: A Data-Driven Network Model." Journal of the Royal Society Interface 18 (174): 20200898. https://doi.org/10.1098/rsif.2020.0898 β© β©2
-
Alabdulkareem, et al. 2018. "Unpacking the Polarization of Workplace Skills." Science Advances 4 (7): eaao6030. https://doi.org/10.1126/sciadv.aao6030 β©
-
Djumalieva & Sleeman. 2018. "An Open and Data-driven Taxonomy of Skills Extracted from Online Job Adverts." ESCoE Discussion Paper 2018-13. https://www.escoe.ac.uk/publications/an-open-and-data-driven-taxonomy-of-skills-extracted-from-online-job-adverts/ β©
-
Ortakci. 2024. "Revolutionary Text Clustering: Investigating Transfer Learning Capacity of SBERT Models through Pooling Techniques." Engineering Science and Technology, an International Journal 55: 101730. https://doi.org/10.1016/j.jestch.2024.101730 β©
-
Ward. 1963. "Hierarchical Grouping to Optimize an Objective Function." Journal of the American Statistical Association 58 (301): 236-244. https://doi.org/10.1080/01621459.1963.10500845 β© β©2
-
de Groot, et al. 2021. "Job Posting-Enriched Knowledge Graph for Skills-based Matching." RecSys in HR '21 Workshop, CEUR Workshop Proceedings, Vol. 2967. https://arxiv.org/abs/2109.02554 β© β©2
-
Krawczyk, et al. 2022. "Hamilton: Enabling Software Engineering Best Practices for Data Transformations via Generalized Dataflow Graphs." 1st International Workshop on Data Ecosystems (DEco@VLDB 2022), CEUR Workshop Proceedings, Vol. 3306: 41-50. https://ceur-ws.org/Vol-3306/paper5.pdf β©
-
Lukauskas, et al. 2023. "Enhancing Skills Demand Understanding through Job Ad Segmentation Using NLP and Clustering Techniques." Applied Sciences 13 (10): 6119. https://doi.org/10.3390/app13106119 β©
-
Khelkhal & Lanasri. 2025. "Smart-Hiring: An Explainable End-to-End Pipeline for CV Information Extraction and Job Matching." arXiv preprint arXiv:2511.02537. https://doi.org/10.48550/arXiv.2511.02537 β© β©2
-
Aggarwal, Hinneburg & Keim. 2001. "On the Surprising Behavior of Distance Metrics in High Dimensional Space." Database Theory (ICDT 2001), Lecture Notes in Computer Science 1973: 420-434. https://doi.org/10.1007/3-540-44503-X_27 β©
-
Halko, Martinsson & Tropp. 2011. "Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions." SIAM Review 53 (2): 217-288. https://doi.org/10.1137/090771806 β© β©2
-
Deerwester, Dumais, Furnas, Landauer & Harshman. 1990. "Indexing by Latent Semantic Analysis." Journal of the American Society for Information Science 41 (6): 391-407. https://doi.org/10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9 β©
-
Zhang, Zhou & Bollegala. 2024. "Evaluating Unsupervised Dimensionality Reduction Methods for Pretrained Sentence Embeddings." Proceedings of LREC-COLING 2024: 6530-6543. https://aclanthology.org/2024.lrec-main.579/ β©
-
Rousseeuw. 1987. "Silhouettes: A Graphical Aid to the Interpretation and Validation of Cluster Analysis." Journal of Computational and Applied Mathematics 20: 53-65. https://doi.org/10.1016/0377-0427(87)90125-7 β©
-
Levy, Shalom & Chalamish. 2025. "A Guide to Similarity Measures and Their Data Science Applications." Journal of Big Data 12: 188. https://doi.org/10.1186/s40537-025-01227-1 β©
-
Alonso, et al. 2025. "A Novel Approach for Job Matching and Skill Recommendation Using Transformers and the O*NET Database." Big Data Research 39: 100509. https://doi.org/10.1016/j.bdr.2025.100509 β© β©2
-
Rosenberger, et al. 2025. "CareerBERT: Matching Resumes to ESCO Jobs in a Shared Embedding Space for Generic Job Recommendations." Expert Systems with Applications 275: 127043. https://doi.org/10.1016/j.eswa.2025.127043 β©
-
Avlonitis, et al. 2023. "Career Path Recommendations for Long-term Income Maximization: A Reinforcement Learning Approach." RecSys in HR '23 Workshop, CEUR Workshop Proceedings, Vol. 3490. https://ceur-ws.org/Vol-3490/RecSysHR2023-paper_2.pdf β©
-
BoΕ‘koski, et al. 2024. "Career Path Discovery through Bipartite Graphs." Journal of Decision Systems 33 (sup1): 140-153. https://doi.org/10.1080/12460125.2024.2354585 β©
-
Senger, et al. 2025. "Toward More Realistic Career Path Prediction: Evaluation and Methods." Frontiers in Big Data 8: 1564521. https://doi.org/10.3389/fdata.2025.1564521 β©
-
Freeman. 1977. "A Set of Measures of Centrality Based on Betweenness." Sociometry 40 (1): 35-41. https://doi.org/10.2307/3033543 β©
-
Frej, et al. 2024. "Course Recommender Systems Need to Consider the Job Market." Proceedings of the 47th ACM SIGIR Conference. https://doi.org/10.1145/3626772.3657847 β©