Ideally, an NLP pipeline in Rust could look something like,
preprocessor = DefaultPreprocessor::new()
tokenizer = RegexpTokenizer::new(r"\b\w\w+\b")
stemmer = SnowballStemmer::new("en")
analyzer = NgramAnalyzer(range=(1, 1))
pipe = collection
.map(preprocessor)
.map(tokenizer)
.map(|tokens| tokens.map(stemmer))
.map(analyzer)
where collection is an iterator over documents.
There are several chalenges with it though,
More investigation would be necessary, and both points are likely related.
Ideally, an NLP pipeline in Rust could look something like,
where
collectionis an iterator over documents.There are several chalenges with it though,
RegexpTokenizertakes a reference to the document and return anIterableof&strwith the same lifetime as the input document, but then borrow checker doesn't appear to be happy when it is used in the pipeline. This may be related to using closures (cf next point) though.collection.map(tokenizer)doesn't work,nor does
collection.map(tokenizer.tokenize)(i.e. using a method) for some reason. We can usecollection.map(|document| tokenizer.tokenize(&document))but then lifetime is not properly handled between input and output (described in the previous point).More investigation would be necessary, and both points are likely related.