CppOpenScraper

C++ web scraping library that fetches web pages, cleans the HTML (removes scripts, ads, navigation, boilerplate), and converts the result to Markdown preserving structure (tables, headings, links, images).

Architecture

Raw HTML (from HTTP or headless browser)
    |
    v
LexborDocument::parse()          -- HTML5 DOM tree
    |
    v
ContentExtractor::extract()      -- Clean DOM:
    |  1. removeUnwantedElements  (script, style, noscript, iframe)
    |  2. removeBoilerplate       (nav, header, footer, aside, menu)
    |  3. removeAdElements        (rel="sponsored", adsbygoogle, data-ad*, dialogs, cookie/consent)
    |
    v
LexborDocument::serializeHtml()  -- Cleaned HTML string
    |
    v
html2md::Converter               -- Markdown with tables, headings, links

Design decisions

Why not text-density scoring? Earlier versions used heuristic scoring to pick a single "main content" container. This failed on data-rich pages (financial tables, anime databases) where content spans multiple containers. The current approach removes only unambiguous noise and keeps everything else.

Why not class/id pattern matching? Substring matching on CSS class names (e.g. removing elements with "nav" in class) was too aggressive -- class="row-nav-main" matched and removed entire content sections. Only semantic HTML tags and standard attributes (rel, role, data-ad) are used for filtering.

Why Markdown? Markdown preserves document structure (tables, headings, links, emphasis) while being compact and readable by both humans and LLMs. Plain text extraction loses all formatting.

Dependencies

Dependency	Type	License	Purpose
Qt6 (Core, Network)	System	LGPL-3	HTTP networking, URL handling
lexbor	Submodule	Apache-2.0	HTML5 parsing and DOM
html2md	Submodule	MIT	HTML to Markdown conversion
poppler-cpp	Optional system	GPL-2	PDF text extraction

For JavaScript-rendered pages

CppOpenScraper processes HTML strings -- it does not run a browser. For pages that require JavaScript rendering (SPAs, dynamically loaded content), the caller must provide the rendered HTML. A common approach is headless Chromium via the Chrome DevTools Protocol (CDP):

Launch chromium --headless --remote-debugging-port=9222
Navigate and wait for networkAlmostIdle lifecycle event
Extract document.documentElement.outerHTML via Runtime.evaluate
Pass the HTML string to CppOpenScraper

Build

git submodule update --init --recursive
mkdir -p build && cd build
cmake ..
cmake --build . -j$(nproc)

CMake options

Option	Default	Effect
`CPPSCRAPER_BUILD_EXAMPLES`	`OFF`	Build example programs
`CPPSCRAPER_BUILD_TESTS`	`OFF`	Build unit tests

Running tests

cmake .. -DCPPSCRAPER_BUILD_TESTS=ON
cmake --build . -j$(nproc)
ctest --output-on-failure

Usage

#include <Scraper.hpp>

CppScrap::Scraper scraper;
auto page = scraper.scrape("https://example.com");

if (page.ok())
{
    // page.text contains Markdown
    std::cout << page.text << std::endl;
}

License

LGPL-3.0-or-later

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
examples		examples
external		external
src		src
tests		tests
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CppOpenScraper

Architecture

Design decisions

Dependencies

For JavaScript-rendered pages

Build

CMake options

Running tests

Usage

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CppOpenScraper

Architecture

Design decisions

Dependencies

For JavaScript-rendered pages

Build

CMake options

Running tests

Usage

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages