Skip to content

html_cleaner

Matthew Harris edited this page Jan 21, 2016 · 2 revisions

a script to strip out html and leave the raw text.

> python html_cleaner.py -h
usage: html_cleaner.py [-h] [--preserve PRESERVE] infile outfile

extracts text from body of an HTML document

positional arguments:
  infile                path to html (raw) files
  outfile               path to desired output file

optional arguments:
  -h, --help            show this help message and exit
  --preserve PRESERVE, -p PRESERVE
                        repeatable parameter that adds a tag name to preserve
                        (e.g. htmlcleaner.py -p a -p img)

Clone this wiki locally