Processing all data, not just a subset, in step 2 (full-texts --> JSON data structure)

I'm running through the pipeline to see if it is all possible locally (see issue #23) and I think there is a problem with step 2 (cc @npch), as follows:

When using [EuPMCCodeReferences notebook](https://github.com/softwaresaved/code-cite/blob/master/notebooks/EuPMCCodeReferences.ipynb) to process the full-texts from getppapers text mining and extract URLs into a JSON data structure, with paper DOIs, etc., the notebook gets stuck at ln[5]:

```
KeyError                                  Traceback (most recent call last)
<ipython-input-21-d81bdb32fc4e> in <module>()
      1 # Process the papers and extract all the references to GitHub and Zenodo urls
----> 2 papers_info = process_eupmc.process_papers(paper_ids, data_dir)

[~]/Github/collabw18/code-cite/code-cite/notebooks/process_eupmc.py in process_papers(list_of_pmcids, data_dir)
     97
     98     for pmcid in list_of_pmcids:
---> 99         papers.append(process_paper(pmcid, data_dir))
    100
    101     return papers

[~]/Github/collabw18/code-cite/code-cite/notebooks/process_eupmc.py in process_paper(pmcid, data_dir)
     66             paper_json = json.load(f)
     67             # Get the DOI
---> 68             doi = get_doi(paper_json)
     69             pub_date = get_pub_date(paper_json)
     70     except IOError:

[~]/Github/collabw18/code-cite/code-cite/notebooks/process_eupmc.py in get_doi(paper_json)
     29
     30 def get_doi(paper_json):
---> 31     paper_doi = paper_json['doi'][0]
     32     return paper_doi
     33

KeyError: 'doi'
```

I think this is because if there is a folder without XML or JSON, or a JSON file without a DOI, the process_eupmc.py script cannot complete. This second situation is not fully tested, but what I can glean from simple checks. The notebook works when I run it on a small subset of data and remove problematic directories.

I propose amending the process_eupmc.py script to have run-throughs for when the getpapers result does not contain the expected info. So instead of pausing at these points, the script continues and that data is ignored.

I don't yet know how to do this, I'll try to give this a crack, feel free to jump in if anyone feels up to it.

- [ ] set a default? 
- [ ] try except?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Processing all data, not just a subset, in step 2 (full-texts --> JSON data structure) #27

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Processing all data, not just a subset, in step 2 (full-texts --> JSON data structure) #27

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions