Skip to content

Added Stanford tokenizer, sentence splitter, and part of speech tagging to varaha.text#4

Open
rjurney wants to merge 18 commits into
alienrobotwizard:masterfrom
rjurney:master
Open

Added Stanford tokenizer, sentence splitter, and part of speech tagging to varaha.text#4
rjurney wants to merge 18 commits into
alienrobotwizard:masterfrom
rjurney:master

Conversation

@rjurney

@rjurney rjurney commented Dec 24, 2013

Copy link
Copy Markdown
Contributor

register ../../lib/stanford-postagger-withModel.jar
register ../../target/varaha-1.0-SNAPSHOT.jar

reviews = LOAD 'data/ten.avro' USING AvroStorage;
foo = FOREACH reviews GENERATE business_id, varaha.text.StanfordTokenize(text) AS tagged;
DUMP foo

(41J1FgfIsmsLRCZ3QILG6w,{(truly),(impressive),(facility),(came),(for),(two),(books),(not),(knowing),(this),(location),(-LRB-),(normally),(Appaloosa),(-RRB-),(The),(staff),(was),(very),(helpful),(and),(found),(what),(wanted),(very),(quickly),(was),(there),(minutes),(tops),(would),(highly),(recommend),(this),(Library),(anyone),(interested),('ll),(coming),(back),(very),(soon),(for),(next),(batch)})
(4YX4ZtUqs6xtcc4AdjbpeQ,{(Other),(circle),(are),(much),(cleaner),(than),(this),(one),(The),(best),(thing),(about),(this),(store),(the),(Employees),(are),(friendly),(and),(nice),('ve),(been),(this),(location),(the),(morning),(and),(the),(evening),(and),(there),(must),(point),(where),(the),(shift),(changes),(and),(they),(stop),(cleaning),(the),(bathrooms),(and),(emptying),(the),(trash),(the),(morning),(everything),(clean),(the),(time),(evening),(rolls),(around),(there),(are),(odd),(smells),(all),(over),(the),(store),(shame),(since),(larger),(newer),(looking),(store),(that),(n't),(cleaner),('ll),(back),(hopes),(they),(clean),(little),(more)})
(5kRug3bEienrpovtPRVVwg,{(Went),(with),(husband),(Richardson),(Rokerij),(for),(the),(first),(time),(raved),(about),(this),(place),(went),(Wednesday),(night),(with),(reservation),(The),(wait),(was),(about),(hour),(Luckily),(there),(were),(bar),(seats),(that),(became),(available),(took),(them),(ordered),(the),(cheese),(flatbread),(appetizer),(and),(was),(delicious),(had),(large),(salad),(for),(dinner),(which),(was),(perfect),(was),(not),(very),(hungry),(husband),(had),(the),(chicken),(enchiladas),(that),(tasted),(and),(were),(very),(good),(The),(food),(cooked),(order),(did),(take),(while),(get),(our),(meal),(but),(was),(worth),(the),(wait),(and),(service),(was),(excellent),(While),(waiting),(chatted),(with),(several),(people),(the),(bar),(and),(one),(couple),(offered),(taste),(their),(appetizer),(returned),(the),(favor),(when),(flatbread),(came),(One),(more),(thing),(not),(leave),(without),(getting),(the),(decadent),(truffle),(dessert),(Heavenly),(but),(not),(over),(done),(any),(way),(All),(all),(great),(experience),(recommend),(reservations)})

reviews = LOAD 'data/ten.avro' USING AvroStorage();
reviews = LIMIT reviews 1000;
bar = FOREACH reviews GENERATE business_id, FLATTEN(varaha.text.SentenceTokenize(text)) AS tokenized_sentences;
bar = FOREACH bar GENERATE business_id, varaha.text.StanfordPOSTagger(tokenized_sentences) AS tagged;
DUMP bar

(6VRbbNQe5ouWmwsMebUMkg,{(My,PRP$),(friend,NN),(added,VBD),(some,DT),(sugar,NN),(to,TO),(it,PRP),(and,CC),(it,PRP),(turned,VBD),(okay/good,NN),(.,.)})
(6VRbbNQe5ouWmwsMebUMkg,{(Entrees,NNS),(average,VBP),(about,IN),($,$),(10,CD),(-,:),($,$),(13,CD),(.,.)})
(6VRbbNQe5ouWmwsMebUMkg,{(Naan,NN),(ranges,NNS),(from,IN),(about,IN),($,$),(1.50,CD),(-,:),($,$),(3,CD),(.,.)})
(6VRbbNQe5ouWmwsMebUMkg,{(Appetizers,NNS),(during,IN),(happy,JJ),(hour,NN),(range,NN),(from,IN),($,$),(3,CD),(-,:),($,$),(8,CD),(+,CC),(.,.)})
(6VRbbNQe5ouWmwsMebUMkg,{(Add,VB),(in,IN),(alcohol,NN),(and,CC),(you,PRP),('re,VBP),(looking,VBG),(at,IN),(a,DT),(not,RB),(inexpensive,JJ),(meal,NN),(but,CC),(definitely,RB),(good,JJ),(quality,NN),(.,.)})
(6oRAC4uyJCsJl1X0WZpVSA,{(love,VB),(the,DT),(gyro,NN),(plate,NN),(.,.)})
(6oRAC4uyJCsJl1X0WZpVSA,{(Rice,NNP),(is,VBZ),(so,RB),(good,JJ),(and,CC),(I,PRP),(also,RB),(dig,VBP),(their,PRP$),(candy,NN),(selection,NN),(:,:),(-RRB-,-RRB-)})

reviews = LOAD 'data/ten.avro' USING AvroStorage();
reviews = LIMIT reviews 1000;
bar = FOREACH reviews GENERATE business_id, varaha.text.StanfordPOSTagger(varaha.text.StanfordTokenize(text)) AS tokens;
DUMP bar

(-UnYs8XvV1M983xZoREdng,{(have,VB),(say,VB),(loved,NN),(Vino,NNP),(First,NNP),(off,RB),(very,RB),(unpretentious,JJ),(not,RB),(very,RB),(knowledgeable,JJ),(about,IN),(wine,NN),(tend,VBP),(shy,JJ),(away,RB),(from,IN),(places,NNS),(that,WDT),(have,VBP),(attitude,NN),(also,RB),(had,VBD),(one,CD),(the,DT),(1000,CD),(outstanding,JJ),(Groupons,NNS),(about,IN),(expire,VBP),(And,CC),(spite,NN),(the,DT),(fact,NN),(that,IN),(just,RB),(about,IN),(everyone,NN),(coming,VBG),(that,IN),(evening,NN),(had,VBD),(Groupon,NNP),(the,DT),(staff,NN),(was,VBD),(fantastic,JJ),(they,PRP),(not,RB),(have,VBP),(kitchen,NN),(all,DT),(appetizers,NNS),(are,VBP),(cold,JJ),(but,CC),(had,VBD),(nice,JJ),(cheese,NN),(plate,NN),(which,WDT),(included,VBD),(cheeses,NNS),(olives,NNS),(nuts,NNS),(grapes,NNS),(and,CC),(dried,VBD),(fruit,NN),(only,RB),(complaint,NN),(was,VBD),(that,IN),(the,DT),(lahvosh-like,JJ),(crackers,NNS),(were,VBD),(really,RB),(oily,JJ),(and,CC),(not,RB),(good,JJ),(all,DT),(Lose,VB),(those,DT),(and,CC),(would,MD),(have,VB),(been,VBN),(much,RB),(better,RBR),(for,IN),(the,DT),(wine,NN),(was,VBD),(actually,RB),(better,JJR),(than,IN),(expected,VBN),(Although,IN),(n't,RB),(generally,RB),(care,VB),(for,IN),(really,RB),(sweet,JJ),(wines,NNS),(both,CC),(the,DT),(Summer,NN),(Rain,NN),(and,CC),(Peachy,JJ),(Keen,JJ),(were,VBD),(really,RB),(enjoyable,JJ),(just,RB),(think,VB),(them,PRP),(more,RBR),(crisp,JJ),(summer,NN),(beverage,NN),(than,IN),(wine,NN),(was,VBD),(surprised,VBN),(like,IN),(the,DT),(Pinot,NNP),(Grigio,NNP),(much,RB),(did,VBD),(and,CC),(may,MD),(have,VB),(purchased,VBN),(bottle,NN),(but,CC),(was,VBD),(not,RB),(available,JJ),(that,IN),(evening,NN),(The,DT),(Miscela,NNP),(Italian,NNP),(blend,VB),(was,VBD),(miss,VB),(for,IN),(-LRB-,-LRB-),(too,RB),(acidic,JJ),(for,IN),(taste,NN),(-RRB-,-RRB-),(but,CC),(the,DT),(Malbec,NNP),(was,VBD),(better,JJR),(For,IN),(after,IN),(dinner,NN),(wines,NNS),(the,DT),(Grande,NNP),(Finale,NNP),(was,VBD),(over-the-top,JJ),(sweet,JJ),(would,MD),(probably,RB),(not,RB),(drink,VB),(more,JJR),(than,IN),(tasting,NN),(The,DT),(Porto,NNP),(Cocoa,NNP),(however,RB),(was,VBD),(fantastic,JJ),(generally,RB),(stay,VB),(away,RB),(from,IN),(Port,NNP),(because,IN),(dislike,NN),(the,DT),(brandy,NN),(burn,VBP),(But,CC),(one,CD),(whiff,NN),(this,DT),(and,CC),(was,VBD),(hooked,VBN),(before,IN),(tasted,VBN),(While,IN),(not,RB),(like,IN),(terribly,RB),(sweet,JJ),(you,PRP),(definitely,RB),(get,VBP),(the,DT),(essence,NN),(chocolate,NN),(bought,VBD),(bottle,NN),(take,VB),(home,NN),(fact,NN),(but,CC),(only,RB),(saw,VBD),(one,CD),(wee,NN),(little,JJ),(glass,NN),(husband,NN),(apparently,RB),(mistook,VBD),(for,IN),(Yoo-hoo,NN),(and,CC),(drank,VBD),(the,DT),(rest,NN),(Great,JJ),(place,NN),(begin,VB),(your,PRP$),(evening,NN),(And,CC),(because,IN),(many,JJ),(these,DT),(young,JJ),(wines,NNS),(are,VBP),(sweeter,JJR),(even,RB),(non-wine-drinking,JJ),(husband,NN),(enjoyed,VBN)})

@alienrobotwizard

Copy link
Copy Markdown
Owner

@rjurney Would you mind squashing these commits so I can look at a single diff?

@rjurney

rjurney commented Dec 29, 2013

Copy link
Copy Markdown
Contributor Author

Yeah, I can do that. I think you can also do that in the interface?

On Sunday, December 29, 2013, Jacob wrote:

@rjurney https://github.com/rjurney Would you mind squashing these
commits so I can look at a single diff?


Reply to this email directly or view it on GitHubhttps://github.com//pull/4#issuecomment-31319298
.

Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.com

@alienrobotwizard

Copy link
Copy Markdown
Owner

Sorry for taking so long to get to this. Overall it looks good. Can we put the udfs that rely strictly on the stanford nlp package in their own namespace? varaha.text is getting a little crowded.

@rjurney

rjurney commented Jan 15, 2014

Copy link
Copy Markdown
Contributor Author

Yeah, I'll do that.

On Tuesday, January 14, 2014, Jacob wrote:

Sorry for taking so long to get to this. Overall it looks good. Can we put
the udfs that rely strictly on the stanford nlp package in their own
namespace? varaha.text is getting a little crowded.


Reply to this email directly or view it on GitHubhttps://github.com//pull/4#issuecomment-32324389
.

Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.com

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants