Skip to content

Latest commit

 

History

History
46 lines (25 loc) · 820 Bytes

File metadata and controls

46 lines (25 loc) · 820 Bytes

python-tika - Python bindings for Apache Tika

Requirements

  • Java >= 1.5
  • JCC

Installation

$ python setup.py build
$ python setup.py install

Or,

$ pip install git+https://github.com/sudharsh/python-tika.git

Usage

To use the AutoDetectParser,

import tika
tika.initVM()

from tika import parser

print parser.from_buffer("<html><body>Hello World</body></html>
# Or directly from a file, 
# print parser.from_file("/tmp/foo.doc")

returns a dict,

{'content': u'Hello Cruel World',
 'metadata': {u'Content-Encoding': u'ISO-8859-1',
				  u'Content-Type': u'text/html',
				  u'title': u'Hello world'}
}

Thanks

setup.py script derived from aptivate/python-tika