Added Stanford tokenizer, sentence splitter, and part of speech tagging to varaha.text#4
Added Stanford tokenizer, sentence splitter, and part of speech tagging to varaha.text#4rjurney wants to merge 18 commits into
Conversation
|
@rjurney Would you mind squashing these commits so I can look at a single diff? |
|
Yeah, I can do that. I think you can also do that in the interface? On Sunday, December 29, 2013, Jacob wrote:
Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.com |
|
Sorry for taking so long to get to this. Overall it looks good. Can we put the udfs that rely strictly on the stanford nlp package in their own namespace? varaha.text is getting a little crowded. |
|
Yeah, I'll do that. On Tuesday, January 14, 2014, Jacob wrote:
Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.com |
register ../../lib/stanford-postagger-withModel.jar
register ../../target/varaha-1.0-SNAPSHOT.jar
reviews = LOAD 'data/ten.avro' USING AvroStorage;
foo = FOREACH reviews GENERATE business_id, varaha.text.StanfordTokenize(text) AS tagged;
DUMP foo
(41J1FgfIsmsLRCZ3QILG6w,{(truly),(impressive),(facility),(came),(for),(two),(books),(not),(knowing),(this),(location),(-LRB-),(normally),(Appaloosa),(-RRB-),(The),(staff),(was),(very),(helpful),(and),(found),(what),(wanted),(very),(quickly),(was),(there),(minutes),(tops),(would),(highly),(recommend),(this),(Library),(anyone),(interested),('ll),(coming),(back),(very),(soon),(for),(next),(batch)})
(4YX4ZtUqs6xtcc4AdjbpeQ,{(Other),(circle),(are),(much),(cleaner),(than),(this),(one),(The),(best),(thing),(about),(this),(store),(the),(Employees),(are),(friendly),(and),(nice),('ve),(been),(this),(location),(the),(morning),(and),(the),(evening),(and),(there),(must),(point),(where),(the),(shift),(changes),(and),(they),(stop),(cleaning),(the),(bathrooms),(and),(emptying),(the),(trash),(the),(morning),(everything),(clean),(the),(time),(evening),(rolls),(around),(there),(are),(odd),(smells),(all),(over),(the),(store),(shame),(since),(larger),(newer),(looking),(store),(that),(n't),(cleaner),('ll),(back),(hopes),(they),(clean),(little),(more)})
(5kRug3bEienrpovtPRVVwg,{(Went),(with),(husband),(Richardson),(Rokerij),(for),(the),(first),(time),(raved),(about),(this),(place),(went),(Wednesday),(night),(with),(reservation),(The),(wait),(was),(about),(hour),(Luckily),(there),(were),(bar),(seats),(that),(became),(available),(took),(them),(ordered),(the),(cheese),(flatbread),(appetizer),(and),(was),(delicious),(had),(large),(salad),(for),(dinner),(which),(was),(perfect),(was),(not),(very),(hungry),(husband),(had),(the),(chicken),(enchiladas),(that),(tasted),(and),(were),(very),(good),(The),(food),(cooked),(order),(did),(take),(while),(get),(our),(meal),(but),(was),(worth),(the),(wait),(and),(service),(was),(excellent),(While),(waiting),(chatted),(with),(several),(people),(the),(bar),(and),(one),(couple),(offered),(taste),(their),(appetizer),(returned),(the),(favor),(when),(flatbread),(came),(One),(more),(thing),(not),(leave),(without),(getting),(the),(decadent),(truffle),(dessert),(Heavenly),(but),(not),(over),(done),(any),(way),(All),(all),(great),(experience),(recommend),(reservations)})
reviews = LOAD 'data/ten.avro' USING AvroStorage();
reviews = LIMIT reviews 1000;
bar = FOREACH reviews GENERATE business_id, FLATTEN(varaha.text.SentenceTokenize(text)) AS tokenized_sentences;
bar = FOREACH bar GENERATE business_id, varaha.text.StanfordPOSTagger(tokenized_sentences) AS tagged;
DUMP bar
(6VRbbNQe5ouWmwsMebUMkg,{(My,PRP$),(friend,NN),(added,VBD),(some,DT),(sugar,NN),(to,TO),(it,PRP),(and,CC),(it,PRP),(turned,VBD),(okay/good,NN),(.,.)})$,$ ),(10,CD),(-,:),($,$ ),(13,CD),(.,.)})$,$ ),(1.50,CD),(-,:),($,$ ),(3,CD),(.,.)})$,$ ),(3,CD),(-,:),($,$ ),(8,CD),(+,CC),(.,.)})
(6VRbbNQe5ouWmwsMebUMkg,{(Entrees,NNS),(average,VBP),(about,IN),(
(6VRbbNQe5ouWmwsMebUMkg,{(Naan,NN),(ranges,NNS),(from,IN),(about,IN),(
(6VRbbNQe5ouWmwsMebUMkg,{(Appetizers,NNS),(during,IN),(happy,JJ),(hour,NN),(range,NN),(from,IN),(
(6VRbbNQe5ouWmwsMebUMkg,{(Add,VB),(in,IN),(alcohol,NN),(and,CC),(you,PRP),('re,VBP),(looking,VBG),(at,IN),(a,DT),(not,RB),(inexpensive,JJ),(meal,NN),(but,CC),(definitely,RB),(good,JJ),(quality,NN),(.,.)})
(6oRAC4uyJCsJl1X0WZpVSA,{(love,VB),(the,DT),(gyro,NN),(plate,NN),(.,.)})
(6oRAC4uyJCsJl1X0WZpVSA,{(Rice,NNP),(is,VBZ),(so,RB),(good,JJ),(and,CC),(I,PRP),(also,RB),(dig,VBP),(their,PRP$),(candy,NN),(selection,NN),(:,:),(-RRB-,-RRB-)})
reviews = LOAD 'data/ten.avro' USING AvroStorage();
reviews = LIMIT reviews 1000;
bar = FOREACH reviews GENERATE business_id, varaha.text.StanfordPOSTagger(varaha.text.StanfordTokenize(text)) AS tokens;
DUMP bar
(-UnYs8XvV1M983xZoREdng,{(have,VB),(say,VB),(loved,NN),(Vino,NNP),(First,NNP),(off,RB),(very,RB),(unpretentious,JJ),(not,RB),(very,RB),(knowledgeable,JJ),(about,IN),(wine,NN),(tend,VBP),(shy,JJ),(away,RB),(from,IN),(places,NNS),(that,WDT),(have,VBP),(attitude,NN),(also,RB),(had,VBD),(one,CD),(the,DT),(1000,CD),(outstanding,JJ),(Groupons,NNS),(about,IN),(expire,VBP),(And,CC),(spite,NN),(the,DT),(fact,NN),(that,IN),(just,RB),(about,IN),(everyone,NN),(coming,VBG),(that,IN),(evening,NN),(had,VBD),(Groupon,NNP),(the,DT),(staff,NN),(was,VBD),(fantastic,JJ),(they,PRP),(not,RB),(have,VBP),(kitchen,NN),(all,DT),(appetizers,NNS),(are,VBP),(cold,JJ),(but,CC),(had,VBD),(nice,JJ),(cheese,NN),(plate,NN),(which,WDT),(included,VBD),(cheeses,NNS),(olives,NNS),(nuts,NNS),(grapes,NNS),(and,CC),(dried,VBD),(fruit,NN),(only,RB),(complaint,NN),(was,VBD),(that,IN),(the,DT),(lahvosh-like,JJ),(crackers,NNS),(were,VBD),(really,RB),(oily,JJ),(and,CC),(not,RB),(good,JJ),(all,DT),(Lose,VB),(those,DT),(and,CC),(would,MD),(have,VB),(been,VBN),(much,RB),(better,RBR),(for,IN),(the,DT),(wine,NN),(was,VBD),(actually,RB),(better,JJR),(than,IN),(expected,VBN),(Although,IN),(n't,RB),(generally,RB),(care,VB),(for,IN),(really,RB),(sweet,JJ),(wines,NNS),(both,CC),(the,DT),(Summer,NN),(Rain,NN),(and,CC),(Peachy,JJ),(Keen,JJ),(were,VBD),(really,RB),(enjoyable,JJ),(just,RB),(think,VB),(them,PRP),(more,RBR),(crisp,JJ),(summer,NN),(beverage,NN),(than,IN),(wine,NN),(was,VBD),(surprised,VBN),(like,IN),(the,DT),(Pinot,NNP),(Grigio,NNP),(much,RB),(did,VBD),(and,CC),(may,MD),(have,VB),(purchased,VBN),(bottle,NN),(but,CC),(was,VBD),(not,RB),(available,JJ),(that,IN),(evening,NN),(The,DT),(Miscela,NNP),(Italian,NNP),(blend,VB),(was,VBD),(miss,VB),(for,IN),(-LRB-,-LRB-),(too,RB),(acidic,JJ),(for,IN),(taste,NN),(-RRB-,-RRB-),(but,CC),(the,DT),(Malbec,NNP),(was,VBD),(better,JJR),(For,IN),(after,IN),(dinner,NN),(wines,NNS),(the,DT),(Grande,NNP),(Finale,NNP),(was,VBD),(over-the-top,JJ),(sweet,JJ),(would,MD),(probably,RB),(not,RB),(drink,VB),(more,JJR),(than,IN),(tasting,NN),(The,DT),(Porto,NNP),(Cocoa,NNP),(however,RB),(was,VBD),(fantastic,JJ),(generally,RB),(stay,VB),(away,RB),(from,IN),(Port,NNP),(because,IN),(dislike,NN),(the,DT),(brandy,NN),(burn,VBP),(But,CC),(one,CD),(whiff,NN),(this,DT),(and,CC),(was,VBD),(hooked,VBN),(before,IN),(tasted,VBN),(While,IN),(not,RB),(like,IN),(terribly,RB),(sweet,JJ),(you,PRP),(definitely,RB),(get,VBP),(the,DT),(essence,NN),(chocolate,NN),(bought,VBD),(bottle,NN),(take,VB),(home,NN),(fact,NN),(but,CC),(only,RB),(saw,VBD),(one,CD),(wee,NN),(little,JJ),(glass,NN),(husband,NN),(apparently,RB),(mistook,VBD),(for,IN),(Yoo-hoo,NN),(and,CC),(drank,VBD),(the,DT),(rest,NN),(Great,JJ),(place,NN),(begin,VB),(your,PRP$),(evening,NN),(And,CC),(because,IN),(many,JJ),(these,DT),(young,JJ),(wines,NNS),(are,VBP),(sweeter,JJR),(even,RB),(non-wine-drinking,JJ),(husband,NN),(enjoyed,VBN)})