Skip to content

opendatahub-io/data-processing

Repository files navigation

Data Processing

Status dev-preview GitHub License GitHub Commits

This repository provides reference data-processing pipelines and examples for Open Data Hub / Red Hat OpenShift AI. It focuses on document conversion and chunking using the Docling toolkit, packaged as Kubeflow Pipelines (KFP), example Jupyter Notebooks, and helper scripts.

📦 Repository Structure

data-processing
|
|- kubeflow-pipelines
|   |- docling-standard
|   |- docling-vlm
|
|- notebooks
    |- tutorials
    |- use-cases
|
|- scripts
    |- subset_selection

✨ Getting Started

Kubeflow Pipelines

Refer to the Data Processing Kubeflow Pipelines documentation for instructions on how to install, run, and customize the Standard and VLM pipelines.

Notebooks

Data processing related jupyter notebooks are broken down into use-cases and tutorials.

Scripts

Curated scripts related to data processing live in this directory.

For example the subset selection scripts allows users to identify representative samples from large training datasets.

🤝 Contributing

We welcome issues and pull requests. Please:

  • Open an issue describing the change.
  • Include testing instructions.
  • For pipeline/component changes, recompile the pipeline and update generated YAML if applicable.
  • Keep parameter names and docs consistent between code and README.

📄 License

Apache License 2.0

About

Data preprocessing and post processing scripts and notebooks for model customization and for an example KFP pipeline for docling

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors