C++ web scraping library that fetches web pages, cleans the HTML (removes scripts, ads, navigation, boilerplate), and converts the result to Markdown preserving structure (tables, headings, links, images).
Raw HTML (from HTTP or headless browser)
|
v
LexborDocument::parse() -- HTML5 DOM tree
|
v
ContentExtractor::extract() -- Clean DOM:
| 1. removeUnwantedElements (script, style, noscript, iframe)
| 2. removeBoilerplate (nav, header, footer, aside, menu)
| 3. removeAdElements (rel="sponsored", adsbygoogle, data-ad*, dialogs, cookie/consent)
|
v
LexborDocument::serializeHtml() -- Cleaned HTML string
|
v
html2md::Converter -- Markdown with tables, headings, links
Why not text-density scoring? Earlier versions used heuristic scoring to pick a single "main content" container. This failed on data-rich pages (financial tables, anime databases) where content spans multiple containers. The current approach removes only unambiguous noise and keeps everything else.
Why not class/id pattern matching? Substring matching on CSS class names (e.g. removing elements with "nav" in class) was too aggressive -- class="row-nav-main" matched and removed entire content sections. Only semantic HTML tags and standard attributes (rel, role, data-ad) are used for filtering.
Why Markdown? Markdown preserves document structure (tables, headings, links, emphasis) while being compact and readable by both humans and LLMs. Plain text extraction loses all formatting.
| Dependency | Type | License | Purpose |
|---|---|---|---|
| Qt6 (Core, Network) | System | LGPL-3 | HTTP networking, URL handling |
| lexbor | Submodule | Apache-2.0 | HTML5 parsing and DOM |
| html2md | Submodule | MIT | HTML to Markdown conversion |
| poppler-cpp | Optional system | GPL-2 | PDF text extraction |
CppOpenScraper processes HTML strings -- it does not run a browser. For pages that require JavaScript rendering (SPAs, dynamically loaded content), the caller must provide the rendered HTML. A common approach is headless Chromium via the Chrome DevTools Protocol (CDP):
- Launch
chromium --headless --remote-debugging-port=9222 - Navigate and wait for
networkAlmostIdlelifecycle event - Extract
document.documentElement.outerHTMLviaRuntime.evaluate - Pass the HTML string to CppOpenScraper
git submodule update --init --recursive
mkdir -p build && cd build
cmake ..
cmake --build . -j$(nproc)| Option | Default | Effect |
|---|---|---|
CPPSCRAPER_BUILD_EXAMPLES |
OFF |
Build example programs |
CPPSCRAPER_BUILD_TESTS |
OFF |
Build unit tests |
cmake .. -DCPPSCRAPER_BUILD_TESTS=ON
cmake --build . -j$(nproc)
ctest --output-on-failure#include <Scraper.hpp>
CppScrap::Scraper scraper;
auto page = scraper.scrape("https://example.com");
if (page.ok())
{
// page.text contains Markdown
std::cout << page.text << std::endl;
}LGPL-3.0-or-later