NLP pipeline design

Ideally, an NLP pipeline in Rust could look something like,

```rust
preprocessor = DefaultPreprocessor::new()
tokenizer = RegexpTokenizer::new(r"\b\w\w+\b")
stemmer = SnowballStemmer::new("en")
analyzer = NgramAnalyzer(range=(1, 1))

pipe = collection
          .map(preprocessor)
          .map(tokenizer)
          .map(|tokens| tokens.map(stemmer))
          .map(analyzer)
```
where `collection` is an iterator over documents.


There are several chalenges with it though,
 - [ ] It is better to avoid allocating strings for tokens in each pre-processing step and instead use a slice of the original document. Performance depends very strongly on this. The current implementation e.g. of `RegexpTokenizer` takes a reference to the document and return an `Iterable` of `&str` with the same lifetime as the input document, but then borrow checker doesn't appear to be happy when it is used in the pipeline.  This may be related to using closures (cf next point) though.
 - [ ] Because structs are not callable, `collection.map(tokenizer)` doesn't work,
nor does `collection.map(tokenizer.tokenize)` (i.e. using a method) for some reason. We can use `collection.map(|document| tokenizer.tokenize(&document))` but then lifetime is not properly handled between input and output (described in the previous point).


More investigation would be necessary, and both points are likely related.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NLP pipeline design #21

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

NLP pipeline design #21

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions