forked from myleott/ark-twokenize-py
-
Notifications
You must be signed in to change notification settings - Fork 1
Expand file tree
/
Copy pathsetup.py
More file actions
58 lines (47 loc) · 1.98 KB
/
setup.py
File metadata and controls
58 lines (47 loc) · 1.98 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
from setuptools import setup
setup (
name = "twokenize",
packages = ["twokenize"],
version = "1.0.0",
description = "Word segmentation / tokenization focussed on Twitter",
author = "Richard Townsend",
author_email = "richard@sentimentron.co.uk",
keywords = ["tokenizer"],
classifiers = [
"Programming Language :: Python"
"Programming Language :: Python :: 3",
"Development Status :: 5 - Production/Stable",
"Intended Audience :: Developers",
"License :: OSI Approved :: GNU General Public License v3",
"Environment :: Console"
],
long_description = """\
ark-twokenize-py
================
This is a crude Python port of the [Twokenize class from ark-tweet-nlp](https://github.com/brendano/ark-tweet-nlp/blob/master/src/cmu/arktweetnlp/Twokenize.java).
It produces nearly identical output to the original Java tokenizer, except in a
few infrequent situations. In particular, Python does not support partial
case-insensitivity in regular expressions and this causes some tokenization
differences for ``Eastern" style emoticons, particularly when the left and right
halves are of different cases. For example:
Java (original): v.V
Python (port): v . V
Emoticons of this kind are seemingly pretty rare. Nevertheless, I have included
a fix for one special case:
Java (original): o.O
Python (port, w/o fix): o . O
Python (port, w/ fix): o.O
Evaluation
----------
A comparison on 1 million tweets found 83 instances (0.0083%) where tokenization
differed between the original Java version and this Python port. The differences
were primarily related to the emoticon issue discussed above, and it was not
clear in general which output was more desirable. For example:
Text:
Profit-Taking Hits Nikkei http://t.co/hVWpiDQ1 http://t.co/xJSPwE2z RT @WSJmarkets
Java (original):
Profi t-T aking Hits Nikkei http://t.co/hVWpiDQ1 http://t.co/xJSPwE2z RT @WSJmarkets
Python (port):
Profit-Taking Hits Nikkei http://t.co/hVWpiDQ1 http://t.co/xJSPwE2z RT @WSJmarket
"""
)