-
Notifications
You must be signed in to change notification settings - Fork 0
html_cleaner
Matthew Harris edited this page Jan 21, 2016
·
2 revisions
a script to strip out html and leave the raw text.
> python html_cleaner.py -h
usage: html_cleaner.py [-h] [--preserve PRESERVE] infile outfile
extracts text from body of an HTML document
positional arguments:
infile path to html (raw) files
outfile path to desired output file
optional arguments:
-h, --help show this help message and exit
--preserve PRESERVE, -p PRESERVE
repeatable parameter that adds a tag name to preserve
(e.g. htmlcleaner.py -p a -p img)