From 06e8f2a05f4afb6d3d018d70c7e9b361c3eff779 Mon Sep 17 00:00:00 2001 From: carlengh Date: Mon, 20 Apr 2026 12:37:04 +0200 Subject: [PATCH 01/14] Create 2026-04-20-unix-style-cpp-pipelines.md Added new blog post with optional email form --- blog/2026-04-20-unix-style-cpp-pipelines.md | 270 ++++++++++++++++++++ 1 file changed, 270 insertions(+) create mode 100644 blog/2026-04-20-unix-style-cpp-pipelines.md diff --git a/blog/2026-04-20-unix-style-cpp-pipelines.md b/blog/2026-04-20-unix-style-cpp-pipelines.md new file mode 100644 index 0000000..70337d9 --- /dev/null +++ b/blog/2026-04-20-unix-style-cpp-pipelines.md @@ -0,0 +1,270 @@ +--- +title: Building Unix-Style C++ Processing Pipelines Using the Pipe Operator +authors: reeshabh +tags: [c++, pipeline, parsing, sdk] +--- + +**Summary:** How to build a C++ processing pipeline using the pipe operator (`|`). To build a C++ processing pipeline using the pipe operator, you must define a base chain element class with virtual processing methods and overload the bitwise OR operator to couple these elements together. This object-oriented approach ensures that the output of one processing node seamlessly feeds into the next, closely mimicking the behavior of Unix terminal pipes. + +--- + +## Why Read This Article? + +If you have ever used a Unix terminal, you are likely familiar with the elegance of the pipe operator (`|`)—it takes the output of one command and seamlessly feeds it as the input to the next. In C++, we can replicate this exact pattern to build data processing pipelines. By overloading the `|` operator, developers can transform highly coupled, nested function calls into a clean, declarative conveyor belt of operations. In this article, we will explore the engineering behind how DocWire implements this pattern to process over 100+ file formats cleanly and efficiently. + +DocWire is a data extraction tool, developed in Modern C++, that converts text from various file formats into searchable and editable data. Using the Tesseract OCR engine, DocWire digitizes text from image types, MS Office files, emails, or email attachments. DocWire outputs data to plain text that may be transmitted for further processing. + +One aspect of the Docwire SDK is its ability to process documents locally (or make an OpenAI API call) through a series of customizable steps that can be added or removed as required. For example, consider the following code: + +```cpp +std::filesystem::path("data_processing_definition.doc") | content_type::detector{} | office_formats_parser{} | PlainTextExporter() | out_stream; +``` + +In the above pipeline processing, a document is selected, its content type is detected, the required parser is applied, and the text output is exported. Additional steps can be added to the pipeline, for example: + +```cpp +std::filesystem::path("data_processing_definition.doc") | content_type::detector{} | office_formats_parser{} | PlainTextExporter() | local_ai::model_chain_element("Translate to spanish:\n\n") | out_stream; +``` + +With this addition, a local model translates the document's text to Spanish and streams the output. The necessary customizations can be applied such that the output of the previous step acts as an input for the next, precisely how a pipeline chain functions. In software terms, this emulates how the Unix pipe operator `|` works in the terminal. + +## Defining Core Entities and Message Types + +Before examining the exact implementation, let us establish the intuition for this structure. We are building a processing pipeline. The element to be processed can be of various types, as Docwire supports more than 100 file formats. For brevity, we use a simple example to focus on how the pipe chaining operates. + +We start by defining the entity we want to process: + +```cpp +/** + * A simple Message struct or `class` + */ +struct Message { + virtual ~Message() = default; +}; + +struct StartMessage : Message {}; +struct TextMessage : Message { + std::string text; + TextMessage(std::string t) : text(std::move(t)) {} +}; +struct EndMessage : Message {}; +``` + +We have defined a base entity and various types of such entities. Based on the types, the parsing steps will decide how to act. + +> **Note:** In C++, classes are equivalent to structs aside from default visibility. The struct approach is used here to keep the implementation minimal. + +## Structuring Pipeline Callbacks and Chain Elements + +The intended behavior is as follows: a message entity is passed around. At each stage of processing in the pipeline, based on the processing result, the program decides what output to forward to the next step, or whether to propagate something upstream (such as errors or cancellations). + +First, we define the structure to capture these behaviors, and then we define the structure for chain elements. This base entity ensures the necessary behaviors are inherited by different chain elements while parsing. + +```cpp +// Whether to continue or not +enum class Continue { Yes, No }; + +//Aliases +using Msg = std::shared_ptr; +using Callback = std::function; + +// Whether to forward a message or bubble out +struct MessageCallbacks { + Callback front; + Callback back; +}; + +// Structure of a basic pipeline chain element +struct ChainElement { + // the main processing function which will be custom implemented by respective Chain Elements + virtual Continue process(Msg msg, MessageCallbacks next) = 0; + // If yes: element consumes message and propagate + virtual bool is_generator() const { return false; } + // If yes: element consumes message but does not propagate + virtual bool is_leaf() const { return false; } + // Destructor + virtual ~ChainElement() = default; +}; +``` + +## Creating Custom Parsing, Filtering, and Exporting Nodes + +Next, we define different parsing chain elements: + +```cpp +struct SimpleParser : ChainElement { + + bool is_generator() const override { return true; } + Continue process(Msg msg, MessageCallbacks next) override { + if (dynamic_cast(msg.get())) { + std::cout << "Parser reading file...\n"; + + next.front(std::make_shared("Hello ")); + next.front(std::make_shared("DocWire ")); + next.front(std::make_shared("Pipeline!")); + next.front(std::make_shared()); + return Continue::Yes; + } + return next.front(msg); + } +}; + +struct TextFilter : ChainElement { + bool is_generator() const override { return false; } + Continue process(Msg msg, MessageCallbacks next) override { + if (!dynamic_cast(msg.get())) + return Continue::No; + return next.front(msg); + } +}; + +struct TextExporter : ChainElement { + bool is_leaf() const override { return true; } + Continue process(Msg msg, MessageCallbacks) override { + if (auto t = dynamic_cast(msg.get())) + std::cout << "Exported: " << t->text << "\n"; + return Continue::Yes; + ; + } +}; +``` + +> **Note:** `TextExporter` is a leaf node in this chain; it does not propagate the message forward. It acts as the final step of the pipeline processing. +> +> **Developer Note:** For the sake of brevity and clarity, this example relies on `dynamic_cast` for message type checking. In a highly optimized production environment, this could be refactored using `std::variant` and `std::visit`, or a custom type-tagging system to avoid the overhead of Run-Time Type Information (RTTI). + +## Managing Object Ownership with a Reference Template + +One remaining requirement is how to chain the pipeline through the `|` operator, specifically whether we use references of elements in the processing chain or take ownership of them. + +```cpp +/** + * A Class template to own or borrow references + */ +template class ref_or_owned { + std::shared_ptr owned; + T *ref = nullptr; + + // move ownership of a heap object into owned, + // and we store a raw pointer alias (ref) for fast, uniform access. +public: + // reference + ref_or_owned(T &t) : ref(&t) {} + + // owned + ref_or_owned(std::shared_ptr t) : owned(std::move(t)), ref(owned.get()) {} + + T &get() { return *ref; } + const T &get() const { return *ref; } +}; +``` + +In C++, objects can come from different places: + +**Example 1: Owned Objects** +```cpp +auto parser = std::make_shared(); +``` +This means the program is responsible for keeping the object alive. Multiple parts of the program can safely share it. + +**Example 2: Borrowed Objects** +```cpp +SimpleParser parser; +``` +The object lives elsewhere, and the pipeline is simply borrowing it. + +Our pipeline must support both cases. The helper class template `ref_or_owned` manages objects regardless of whether they are borrowed or owned. For a borrowed object, it stores a reference, and for an owned object, it takes ownership and keeps it alive. + +> **Developer Note:** This implementation uses `std::shared_ptr` for both message passing and chain elements to maximize flexibility in a shared ownership model. If strict zero-cost abstraction is required, developers could adapt this pattern to utilize `std::unique_ptr` where exclusive ownership is guaranteed. + +## Coupling Elements and Overloading the Pipe Operator in C++ + +We define the structure for a basic parsing engine that inherits the properties of a `ChainElement`. Its purpose is to couple two chain elements: `lhs` (left-side element of the processing chain) and `rhs` (right-side element). + +For example, `parser 1 | parser 2` means that the output of parser 1 will be fed to parser 2 for further processing. + +```cpp +// Shared object pointer +using Element = std::shared_ptr; + +// Basic Parsing engine +struct ParsingChain : ChainElement { + // should handle elements whether borrowed or owned + ref_or_owned lhs; + ref_or_owned rhs; + + // Constructors + ParsingChain(Element a, Element b) : lhs(std::move(a)), rhs(std::move(b)) {} + ParsingChain(ChainElement &a, ChainElement &b) : lhs(a), rhs(b) {} + ParsingChain(ref_or_owned a, ref_or_owned b) + : lhs(a), rhs(b) {} + + bool is_generator() const override { return lhs.get().is_generator(); } + + bool is_leaf() const override { return rhs.get().is_leaf(); } + + // Processes the `msg` arriving at the chain, and passes it to `lhs` + // When `lhs` wants to propagate the message, it redirects to `rhs` + Continue process(Msg msg, MessageCallbacks cb) override { + MessageCallbacks lhs_cb{ + // front of lhs → rhs + [&](Msg m) { return rhs.get().process(m, cb); }, + // back of lhs → back of chain + cb.back}; + return lhs.get().process(msg, lhs_cb); + } +}; +``` + +Beyond handling chain elements in its constructor, this structure facilitates the processing and routing of elements from one stage to another. + +We then make use of the `|` operator to perform the chaining and execute the pipeline once it is complete: + +```cpp +Element operator|(Element a, Element b) { + return std::make_shared(a, b); +} + +Element operator|(ChainElement &a, ChainElement &b) { + return std::make_shared(a, b); +} + +ParsingChain operator|(ref_or_owned a, + ref_or_owned b) { + ParsingChain chain{a, b}; + + if (chain.is_generator() && chain.is_leaf()) { + chain.process(std::make_shared(), + MessageCallbacks{[](Msg) { return Continue::Yes; }, + [](Msg) { return Continue::Yes; }}); + } + + return chain; +} +``` + +The third overload of the `|` operator checks if the pipeline has both generator and leaf nodes. If affirmative, it automatically starts execution. + +## Executing the C++ Pipeline Chain + +The implementation can be tested via the following program: + +```cpp +int main() { + SimpleParser parser; + TextFilter filter; + TextExporter exporter; + // auto chain = parser | filter | exporter; + + // chain.process(std::make_shared(), + // [](Msg) { return Continue::Yes; }); + + auto chain = std::make_shared() | + std::make_shared() | + std::make_shared(); + + MessageCallbacks root{[](Msg) { return Continue::Yes; }, + [](Msg) { return Continue::Yes; }}; + + chain->process(std::make_shared(), root); +} From eae914d531e246c44ad0588fccca2601bd71d059 Mon Sep 17 00:00:00 2001 From: carlengh Date: Mon, 20 Apr 2026 17:22:42 +0200 Subject: [PATCH 02/14] Update 2026-04-20-unix-style-cpp-pipelines.md Added some missing text --- blog/2026-04-20-unix-style-cpp-pipelines.md | 20 ++++++++++++++++++++ 1 file changed, 20 insertions(+) diff --git a/blog/2026-04-20-unix-style-cpp-pipelines.md b/blog/2026-04-20-unix-style-cpp-pipelines.md index 70337d9..1d9ef82 100644 --- a/blog/2026-04-20-unix-style-cpp-pipelines.md +++ b/blog/2026-04-20-unix-style-cpp-pipelines.md @@ -268,3 +268,23 @@ int main() { chain->process(std::make_shared(), root); } +``` + +## A Note on Concurrency + +One of the hidden benefits of this message-passing architecture is how naturally it lends itself to multithreading. Because each node operates independently on the messages it receives, this pattern lays the groundwork to run individual elements on separate threads, passing messages via concurrent, thread-safe queues. + +*The actual Docwire implementation of this feature can be found in the Docwire source repository under the `src/parsing_chain.h` and `src/parsing_chain.cpp` files.* + +
+
+ + From fcbcb5ff2effccd343a0d1823c528744e739fb2b Mon Sep 17 00:00:00 2001 From: carlengh Date: Mon, 20 Apr 2026 18:05:35 +0200 Subject: [PATCH 03/14] Update blog/2026-04-20-unix-style-cpp-pipelines.md Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> --- blog/2026-04-20-unix-style-cpp-pipelines.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/blog/2026-04-20-unix-style-cpp-pipelines.md b/blog/2026-04-20-unix-style-cpp-pipelines.md index 1d9ef82..e50d061 100644 --- a/blog/2026-04-20-unix-style-cpp-pipelines.md +++ b/blog/2026-04-20-unix-style-cpp-pipelines.md @@ -112,9 +112,9 @@ struct SimpleParser : ChainElement { struct TextFilter : ChainElement { bool is_generator() const override { return false; } Continue process(Msg msg, MessageCallbacks next) override { - if (!dynamic_cast(msg.get())) - return Continue::No; - return next.front(msg); + if (dynamic_cast(msg.get()) || dynamic_cast(msg.get())) + return next.front(msg); + return Continue::Yes; } }; From b990c36eedd024b623161988f312af23c512ab74 Mon Sep 17 00:00:00 2001 From: carlengh Date: Mon, 20 Apr 2026 18:05:53 +0200 Subject: [PATCH 04/14] Update blog/2026-04-20-unix-style-cpp-pipelines.md Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> --- blog/2026-04-20-unix-style-cpp-pipelines.md | 1 - 1 file changed, 1 deletion(-) diff --git a/blog/2026-04-20-unix-style-cpp-pipelines.md b/blog/2026-04-20-unix-style-cpp-pipelines.md index e50d061..b2eea6d 100644 --- a/blog/2026-04-20-unix-style-cpp-pipelines.md +++ b/blog/2026-04-20-unix-style-cpp-pipelines.md @@ -124,7 +124,6 @@ struct TextExporter : ChainElement { if (auto t = dynamic_cast(msg.get())) std::cout << "Exported: " << t->text << "\n"; return Continue::Yes; - ; } }; ``` From a819d8834a7af9c0b65e75b52473d5c54c6f6b85 Mon Sep 17 00:00:00 2001 From: carlengh Date: Mon, 20 Apr 2026 19:23:17 +0200 Subject: [PATCH 05/14] Update blog/2026-04-20-unix-style-cpp-pipelines.md Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> --- blog/2026-04-20-unix-style-cpp-pipelines.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/blog/2026-04-20-unix-style-cpp-pipelines.md b/blog/2026-04-20-unix-style-cpp-pipelines.md index b2eea6d..09e89aa 100644 --- a/blog/2026-04-20-unix-style-cpp-pipelines.md +++ b/blog/2026-04-20-unix-style-cpp-pipelines.md @@ -207,7 +207,7 @@ struct ParsingChain : ChainElement { Continue process(Msg msg, MessageCallbacks cb) override { MessageCallbacks lhs_cb{ // front of lhs → rhs - [&](Msg m) { return rhs.get().process(m, cb); }, + [rhs = rhs, cb](Msg m) { return rhs.get().process(m, cb); }, // back of lhs → back of chain cb.back}; return lhs.get().process(msg, lhs_cb); From 1409cbe4c720407438c537aa6e493785bd06e4ae Mon Sep 17 00:00:00 2001 From: carlengh Date: Mon, 20 Apr 2026 19:48:00 +0200 Subject: [PATCH 06/14] Update 2026-04-20-unix-style-cpp-pipelines.md --- blog/2026-04-20-unix-style-cpp-pipelines.md | 19 ++++++++++--------- 1 file changed, 10 insertions(+), 9 deletions(-) diff --git a/blog/2026-04-20-unix-style-cpp-pipelines.md b/blog/2026-04-20-unix-style-cpp-pipelines.md index 09e89aa..ed15bce 100644 --- a/blog/2026-04-20-unix-style-cpp-pipelines.md +++ b/blog/2026-04-20-unix-style-cpp-pipelines.md @@ -1,9 +1,10 @@ --- title: Building Unix-Style C++ Processing Pipelines Using the Pipe Operator -authors: reeshabh -tags: [c++, pipeline, parsing, sdk] +tags: ["c++", "pipeline", "parsing", "sdk"] --- +**By Reeshabh Choudhary, DocWire.io** + **Summary:** How to build a C++ processing pipeline using the pipe operator (`|`). To build a C++ processing pipeline using the pipe operator, you must define a base chain element class with virtual processing methods and overload the bitwise OR operator to couple these elements together. This object-oriented approach ensures that the output of one processing node seamlessly feeds into the next, closely mimicking the behavior of Unix terminal pipes. --- @@ -112,9 +113,9 @@ struct SimpleParser : ChainElement { struct TextFilter : ChainElement { bool is_generator() const override { return false; } Continue process(Msg msg, MessageCallbacks next) override { - if (dynamic_cast(msg.get()) || dynamic_cast(msg.get())) - return next.front(msg); - return Continue::Yes; + if (!dynamic_cast(msg.get())) + return Continue::No; + return next.front(msg); } }; @@ -124,6 +125,7 @@ struct TextExporter : ChainElement { if (auto t = dynamic_cast(msg.get())) std::cout << "Exported: " << t->text << "\n"; return Continue::Yes; + ; } }; ``` @@ -207,7 +209,7 @@ struct ParsingChain : ChainElement { Continue process(Msg msg, MessageCallbacks cb) override { MessageCallbacks lhs_cb{ // front of lhs → rhs - [rhs = rhs, cb](Msg m) { return rhs.get().process(m, cb); }, + [&](Msg m) { return rhs.get().process(m, cb); }, // back of lhs → back of chain cb.back}; return lhs.get().process(msg, lhs_cb); @@ -275,11 +277,10 @@ One of the hidden benefits of this message-passing architecture is how naturally *The actual Docwire implementation of this feature can be found in the Docwire source repository under the `src/parsing_chain.h` and `src/parsing_chain.cpp` files.* -
-
+
+ title="DocWire Engineering Updates" +/> From 4bb2dd4c6b02b8d00aadd02752fca78b63d06245 Mon Sep 17 00:00:00 2001 From: carlengh Date: Mon, 20 Apr 2026 20:15:49 +0200 Subject: [PATCH 09/14] Update 2026-04-20-unix-style-cpp-pipelines.md Another syntax change. --- blog/2026-04-20-unix-style-cpp-pipelines.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/blog/2026-04-20-unix-style-cpp-pipelines.md b/blog/2026-04-20-unix-style-cpp-pipelines.md index 00a1f45..c761194 100644 --- a/blog/2026-04-20-unix-style-cpp-pipelines.md +++ b/blog/2026-04-20-unix-style-cpp-pipelines.md @@ -278,7 +278,7 @@ One of the hidden benefits of this message-passing architecture is how naturally *The actual Docwire implementation of this feature can be found in the Docwire source repository under the `src/parsing_chain.h` and `src/parsing_chain.cpp` files.* From 4bf95d714d43c8e837e47778947ff35e9d82ede4 Mon Sep 17 00:00:00 2001 From: carlengh Date: Wed, 22 Apr 2026 09:53:44 +0200 Subject: [PATCH 11/14] Update 2026-04-20-unix-style-cpp-pipelines.md --- blog/2026-04-20-unix-style-cpp-pipelines.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/blog/2026-04-20-unix-style-cpp-pipelines.md b/blog/2026-04-20-unix-style-cpp-pipelines.md index 2ae2371..dfd2154 100644 --- a/blog/2026-04-20-unix-style-cpp-pipelines.md +++ b/blog/2026-04-20-unix-style-cpp-pipelines.md @@ -1,9 +1,10 @@ --- title: Building Unix-Style C++ Processing Pipelines Using the Pipe Operator -authors: reeshabh tags: ["cpp", "pipeline", "parsing", "sdk"] --- +**By Reeshabh Choudhary, DocWire.io** + **Summary:** How to build a C++ processing pipeline using the pipe operator (`|`). To build a C++ processing pipeline using the pipe operator, you must define a base chain element class with virtual processing methods and overload the bitwise OR operator to couple these elements together. This object-oriented approach ensures that the output of one processing node seamlessly feeds into the next, closely mimicking the behavior of Unix terminal pipes. ## Why Read This Article? @@ -274,4 +275,4 @@ One of the hidden benefits of this message-passing architecture is how naturally *The actual Docwire implementation of this feature can be found in the Docwire source repository under the `src/parsing_chain.h` and `src/parsing_chain.cpp` files.* - +