Implementing NLP Techniques by krisbukovi · Pull Request #124 · adsabs/ADSfulltext

krisbukovi · 2020-03-26T02:47:09Z

Use spaCy (preferred) and/or NLTK to tokenize sentences+words from full-text, including shallow parsing to identify text chunks (e.g. noun-phrase chunks such as “The dark side”)
Normalize full-text by using lower case, and removing stop words, accents, punctuation and non-alphanumeric characters
Apply lemmatization and stemming techniques to reduce the number of distinct words with similar meaning
Store the intermediate and final processed text (including tags, PoS, etc.) in a way that can be easily recovered at a later stage.
Build a corpus vocabulary of a reasonable dimension that could be used to represent all the documents in a Document Term Matrix following at least three different strategies: bag of words, TF-IDF and embeddings (use Gensim’s word2vec and fasttext, spaCy’s GLoVE, AllenNLP’s ELMo, Google’s BERT)

coveralls · 2020-03-26T03:26:32Z

Coverage increased (+0.4%) to 82.466% when pulling db87456 on krisbukovi:implement_nlp into 760b9ca on adsabs:master.

marblestation

I missed the documentation/evaluation that justifies all the decision made in this PR (e.g., spacy small vs medium vs large language model, spacy vs other frameworks, why store data without its context?, is the parsing as good as we expect?, etc). If it already exists and I have forgotten, please, point me where I can find it.

We are also missing some functionalities such as dividing the full text into sentences or certain normalization (e.g. accents, UTF-8 characters). We are missing the spell checkers analysis. The unit test are less complete than what they could be.

Finally, the fact that we have some articles that are not in English, while we use an English language model, signals that it would be good to try to automatically detect the language to either use a different model or to skip the use of a language model.

We'll need to accomplish all this work to be more confident on the outcome from the full text pipeline. Let's talk about the parts that are not clear or that I may have misunderstood!

marblestation · 2020-03-30T18:20:52Z

+def compress_file(path, temp_name, file_name):
+    """
+    Converts the temporary file to a compressed file.
+    """
+    try:
+        shutil.copy(temp_name, file_name)
+        shutil.make_archive(base_name=file_name, format='gztar', root_dir=path, base_dir=os.path.split(file_name))
+    except Exception as err:
+        logger.error('Unexpected error from shutil while compressing file: {0}'.format(err))
+
+    try:
+        os.remove(temp_name)
+        os.remove(file_name)
+    except Exception as err:
+        logger.error(
+            'Unexpected error from os removing a file: {0}'.format(err))
+
+    logger.debug(
+        'Succeeded to compress: {0} to {1}'.format(temp_name, file_name)
+    )


Doing a compressed tar for a single file seems like an overkill when we can just use a compressed file (no need for a tar), for which python has good support too: https://docs.python.org/3/library/gzip.html

I can replace shutil.make_archive with

with open(file_name, 'rb') as f_in:
with gzip.open(file_name + '.gz', 'wb') as f_out:
shutil.copyfileobj(f_in, f_out)

Note that I am seeing a difference in the file size. I am looking into this for our meeting.
update: file size looks good, ex. 56K .gz file for a fulltext.txt file of size 56K

marblestation · 2020-03-30T18:23:21Z

+        with patch('adsft.writer.write_file', return_value=None) as task_write_text:
+            with patch('adsft.checker.load_meta_file', return_value=None) as load_meta:
+                with patch('adsft.reader.read_content', return_value=msg) as read_content:


This is starting to become too much indentation, with statements can be merged into one: https://stackoverflow.com/a/1073814

marblestation · 2020-03-30T18:25:54Z

+                    self.assertTrue(load_meta.called)
+                    self.assertTrue(read_content.called)
+                    self.assertTrue(task_write_text.called)
+                    actual = task_write_text.call_args[0]
+                    self.assertEqual(os.path.dirname(msg['meta_path'])+'/nlp.json', actual[0])


No assert of the actual result of the NLP task is done here such as we got the right PoS, entities... We could download the small English language model for this, contrary to the situation we had with the facilities NER where we had our own model, the language model that we are using here is public and could be downloaded easily for the unit tests.

The problem here is that I'm mocking spaCy for the facilities NER, so I can't mock one model and test the results of another. I can instead add a separate file for testing spacy's english model.

After our meeting yesterday, I implemented your recommendation to check if the directory exists before calling spacy.load(). I realized I had tried this before, but ran into the same problem where 'en_core_web_sm' will be installed in different locations. This time I was able to resolve this using pkgutils. The unit tests will not pass now unless 'en_core_web_en' is downloaded because the model is used as a callable, so that's one problem if someone is developing locally and trying to run unit tests. In light of this, I think it would be a good idea to remove the installation of the model in eb-deploy using python -m spacy download en_core_web_sm and move that to the requirements.txt file in this PR by adding https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.0/en_core_web_sm-2.2.0.tar.gz. I found that this is also recommended in spaCy's documentation here.

marblestation · 2020-03-30T19:12:32Z

+        doc = en_model(r['fulltext'])
+        out = doc.to_json()
+        out['ents'] = [{'text':ent.text, 'label':ent.label_} for ent in doc.ents]
+        out['noun-phrases'] = [chunk.text for chunk in doc.noun_chunks]
+        out['tokens'] = []
+        for tok in doc:
+            if not tok.is_punct and tok.is_alpha and not tok.is_stop and len(tok.text)>2:
+                out['tokens'].append({'token':tok.lower_,
+                                           'pos':tok.pos_,
+                                           'tag':tok.tag_,
+                                           'dep':tok.dep_,
+                                           'lemma':tok.lemma_})


Several comments here:

This function is extracting named entities and noun phrases without their context, what is the argument to do it this way? Isn't this be a limitation if we want to implement features in the future where we need to know when a certain named entity/noun phrase was used? This requires further justification.

The full text is not divided into sentences, which could be useful for future features.

Tokens are stored individually and we won't be able to easily connect back to the original sentence and original text.

The lemma is stored, but we wanted the stem too.

Tokens are stored with some normalization (e.g., lower case, no numbers, no stop words, no punctuation), but we have not done anything with accents and non-alphanumeric characters (e.g., UTF-8 have differents ways of representing what seems to be the same characters).

What is the impact of not considering tokens with lengths smaller than 2 characters? Common things such as "AB stars" would be affected.

This is not using any spell checker and an evaluation of that possibility was in the initial requirements.

An new requirement that fits very naturally in this work would be to be able to detect the language of the article. Because then it would be justified to use en_core_web_* instead of other language model (e.g., Spanish), and as a side effect, we could eventually allow our user to search for papers in a specific language.

Finally, where can I find results of this kind of analysis that show that it is performing as one would expect? Have you tried with a sample and documented the results somewhere that we can check?

Yes, we discussed how the text should be stored in a previous meeting in which the option was either to store the location that then corresponds to the fulltext file, or the actual entities. You said I could decide, and I chose to store the actual text knowing that there is a chance the fulltext and the nlp json file can get out of sync given the added config variables (RUN_NLP_AFTER_EXTRACTION = False). Not having the context is a weakness of this output style.

It does have sentence tokenization, but in this case, only the location of the sentences are stored.

See the first bullet point.

Normally in natural language processing you would choose either the lemma or the stem, this is something that I was told in my NLP course. Stemming is full of more errors so I chose lemmatization.

You're right, I missed removing accents. The condition for tok.is_alpha should be removing all non-alphanumeric characters as instructed, but we can talk more about modifications on the preprocessing steps in today's meeting.

"AB stars" would not be considered a token, "AB" and "stars" would be the tokens produced, and "AB" would be removed/ignored due to that condition. Keep in mind the noun-phrases do not have this same condition applied.

In our meeting, we discussed having the spell checker be a separate investigation, that could then be applied if it proves to be useful. You said it did not have to be part of this PR, but it is part of the project as a whole. "Only evaluate arXiv, if no time to incorporate into pipeline we can keep out of scope"

As discussed in person, let's consider saving the spacy object to keep the context

Understood, it is in the json that comes from the spacy object

Ok

I understand lemmatization is generally prefered, my thought is that stemming is cheaper so if we can store both we will be able to validate in the future if stemming would be enough for our purposes or we do really need lemmas

Let's improve the normalization then homogenizing accents and special characters

My whole point is that we miss "AB" as a token, so I am looking for the justification of that condition (i.e., in what scenarios it is more beneficial to have it than not have it)

It is not a priority to incorporate it into the pipeline (we agreed that's out of scope), but the analysis needs to be done with a conclusion recommending to include it or not in the future

Update: Do not forget about the language detection comment also

Regarding storing the spacy object (I'm all the time assuming we will stick to spaCy, that's up to your analysis), you mentioned by private communication that it would need 4.5MB instead of 59KB. To adjust to our current space capacity, you could reduce the first approximation to the vocabulary construction to articles that match this query: ((bibstem:(ApJ) collection:astronomy) AND year:2015-2020). This represents 16,136 entries, which translates to about 66GB of stored data.

I'll add stemming into the script for preprocessing that now occurs outside of the pipeline. All tokens are now decoded to take care of accents/special characters. I am planning on completing the spellcheck analysis by the deadline, does it need to be completed before this PR is closed?

I made a colab here that I was going to use to show the differences in preprocessing. The reason I removed tokens<2chars is because it cleaned up tokens that I didn't want in my output. I was taking a more aggressive approach knowing the corpus vocabulary/dictionary needed to be kept at a reasonable size, and that is the only specific future task I know of to base these preprocessing decisions around. Preprocessing steps are supposed to change according to the task, so from my perspective, this should be more of a discussion because you may have future tasks in mind that I don't. For example, I didn't remove tokens<2chars when analyzing the XML parsers because I wanted to see the difference in the parsers even if it was minimal.

I already wrote up some code to try out the language detection, but the issue I mentioned with celery also occurs with all of the non-English language models. I see that you have commented on some possible solutions, but as of right now I am blocked from implementing this additional feature. I was planning to create a dictionary of the models and use the output of langdetect as the key, see all non-English models below:

downloading the models: https://github.com/explosion/spacy-models/releases/download/de_core_news_sm-2.2.5/de_core_news_sm-2.2.5.tar.gz https://github.com/explosion/spacy-models/releases/download/fr_core_news_sm-2.2.5/fr_core_news_sm-2.2.5.tar.gz https://github.com/explosion/spacy-models/releases/download/es_core_news_sm-2.2.5/es_core_news_sm-2.2.5.tar.gz https://github.com/explosion/spacy-models/releases/download/pt_core_news_sm-2.2.5/pt_core_news_sm-2.2.5.tar.gz https://github.com/explosion/spacy-models/releases/download/it_core_news_sm-2.2.5/it_core_news_sm-2.2.5.tar.gz https://github.com/explosion/spacy-models/releases/download/nl_core_news_sm-2.2.5/nl_core_news_sm-2.2.5.tar.gz https://github.com/explosion/spacy-models/releases/download/el_core_news_sm-2.2.5/el_core_news_sm-2.2.5.tar.gz https://github.com/explosion/spacy-models/releases/download/nb_core_news_sm-2.2.5/nb_core_news_sm-2.2.5.tar.gz https://github.com/explosion/spacy-models/releases/download/lt_core_news_sm-2.2.5/lt_core_news_sm-2.2.5.tar.gz config: DE_MODEL = 'de_core_news_sm' FR_MODEL = 'fr_core_news_sm' ES_MODEL = 'es_core_news_sm' PT_MODEL = 'pt_core_news_sm' IT_MODEL = 'it_core_news_sm' NL_MODEL = 'nl_core_news_sm' EL_MODEL = 'el_core_news_sm' NB_MODEL = 'nb_core_news_sm' LT_MODEL = 'lt_core_news_sm' loading the models: de_model = spacy.load(app.conf['DE_MODEL']) fr_model = spacy.load(app.conf['FR_MODEL']) es_model = spacy.load(app.conf['ES_MODEL']) pt_model = spacy.load(app.conf['PT_MODEL']) it_model = spacy.load(app.conf['IT_MODEL']) nl_model = spacy.load(app.conf['NL_MODEL']) el_model = spacy.load(app.conf['EL_MODEL']) nb_model = spacy.load(app.conf['NB_MODEL']) lt_model = spacy.load(app.conf['LT_MODEL'])```

marblestation · 2020-03-30T19:14:24Z

+model2 = spacy.load(app.conf['NER_FACILITY_MODEL_FT'])

+logger.debug("Loading spacy model for implementing nlp techniques...")
+en_model = spacy.load(app.conf['NLP_MODEL'])


This is using spaCy, other than re-using what was developed for the facilities named entity recognition, why is it better than NLTK? or even have you evaluated other options (e.g., stanza)? Have you documented this decision somewhere so that we can go through it?

The instructions given for this project were to "Use spaCy (preferred) and/or nltk", and NLTK shows lower performance for the requested ents, pos, etc so I went with spaCy. I'll add a link showing that when I get a chance.

I had passively looked at one other option (Spark-NLP), but from the instructions and our meeting, I didn't think I was supposed to be spending time looking for new frameworks. Also, stanza requires python > 3.6, so that's not a possible option until we upgrade from python 2.

We also have transversal instructions such as documenting the analysis that justifies the changes and decisions made in any of the tasks. This PR relies on spaCy which was the prefered one at the moment of thinking about this project, but you have shown in the NER project that its tokenizer is worse than others and that makes me question if this is a good choice. I am looking for the analysis that you have made to justify this choice (i.e., this package and the algorithms it uses).

Regarding Python 3.6 requirements, we can call external Python 3 code from Python 2 workers as a temporary workaround until we have migrated all the code to Python 3. That is not a show stopper if we see that for instance stanza is superior.

marblestation · 2020-03-30T19:18:45Z

 NER_FACILITY_MODEL_FT = '/app/ner_models/ner_facility_ft/ner_model_facility/'
 RUN_NER_FACILITIES_AFTER_EXTRACTION = False

+NLP_MODEL = 'en_core_web_sm'


Where have you documented the evaluation of en_core_web_sm vs en_core_web_md or en_core_web_lg? They are quite different in size and I expect this to have some impact, we need to understand what we are losing or gaining from using the smallest one. See: https://spacy.io/models/en

I did not compare the different models because en_core_web_sm is the only one out of the three that doesn't try to kill the celery workers. It also shows high-performance measures.

update: I put a video of the celery issue with the medium and large models in the team drive. I confirmed that this occurs with v2.2.0 and v2.2.5 of the models. Updating to the newest release of spaCy (v2.2.4, 3/12/20) also does not fix this issue. It looks like the CPU is getting maxed out.

It does not seem a CPU problem. I tried to open an ipython terminal and load en_core_web_md in that server, it worked and it used about 1GB of RAM. The machine has 62GB but most of it seems to be used by another pipeline. Your workers get killed because all of them load the model, although only the NLP ones need it so you could search for a work around to make only the NLP worker load the model and reduce those workers to just one. We could move this or other pipelines to other machines with more free RAM.

Independently of this deployment difficulty, we need to feel reassured that the use of this language model instead of the medium or large version is justified. That's why I am looking for the analysis and comparison, which does not have to be done within a production worker.

krisbukovi added 2 commits March 25, 2020 22:44

add files for applying nlp techniques

5ccaf99

fix typo

ff707a2

only store sent locations, add unit test

f532ce3

krisbukovi requested a review from marblestation March 26, 2020 14:00

marblestation requested changes Mar 30, 2020

View reviewed changes

krisbukovi added 5 commits April 1, 2020 12:50

remove mocking for spacy, use .gz compression

3b103ad

store as binary file

0cc6f5e

add unit tests for small english model

26c02d7

add sentence check to unit test

23d04fb

add language detection

db87456

Conversation

krisbukovi commented Mar 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coveralls commented Mar 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

marblestation left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

krisbukovi Mar 31, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

krisbukovi Mar 31, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

marblestation Apr 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

krisbukovi Apr 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

krisbukovi Mar 31, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

krisbukovi commented Mar 26, 2020 •

edited

Loading

coveralls commented Mar 26, 2020 •

edited

Loading

krisbukovi Mar 31, 2020 •

edited

Loading

krisbukovi Mar 31, 2020 •

edited

Loading

marblestation Apr 2, 2020 •

edited

Loading

krisbukovi Apr 2, 2020 •

edited

Loading

krisbukovi Mar 31, 2020 •

edited

Loading