forked from DigitalPebble/behemoth
-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathREADME
More file actions
3 lines (2 loc) · 868 Bytes
/
README
File metadata and controls
3 lines (2 loc) · 868 Bytes
1
2
3
Behemoth is an open source platform for large scale document analysis based on Apache Hadoop. It allows to deploy GATE or UIMA applications and uses a simple representation format which can be used as a common ground between UIMA and GATE-generated annotations, hence achieving compatibility between both systems. Since it is Hadoop-based it benefits from all its features, namely scalability, fault-tolerance and most notably the back up of a thriving open source community. Behemoth already does or will interact with quite a few open source projects such as Nutch, Tika, Mahout or HBase.
One of the main aspects of Behemoth is to simplify the deployment of document analysers on a large scale but also to provide converters from common data formats (Warc, Nutch, etc...) and a sandbox for users to share applications using the annotations using Hadoop Map Reduce.