Skip to content

Processing all data, not just a subset, in step 2 (full-texts --> JSON data structure) #27

@npscience

Description

@npscience

I'm running through the pipeline to see if it is all possible locally (see issue #23) and I think there is a problem with step 2 (cc @npch), as follows:

When using EuPMCCodeReferences notebook to process the full-texts from getppapers text mining and extract URLs into a JSON data structure, with paper DOIs, etc., the notebook gets stuck at ln[5]:

KeyError                                  Traceback (most recent call last)
<ipython-input-21-d81bdb32fc4e> in <module>()
      1 # Process the papers and extract all the references to GitHub and Zenodo urls
----> 2 papers_info = process_eupmc.process_papers(paper_ids, data_dir)

[~]/Github/collabw18/code-cite/code-cite/notebooks/process_eupmc.py in process_papers(list_of_pmcids, data_dir)
     97
     98     for pmcid in list_of_pmcids:
---> 99         papers.append(process_paper(pmcid, data_dir))
    100
    101     return papers

[~]/Github/collabw18/code-cite/code-cite/notebooks/process_eupmc.py in process_paper(pmcid, data_dir)
     66             paper_json = json.load(f)
     67             # Get the DOI
---> 68             doi = get_doi(paper_json)
     69             pub_date = get_pub_date(paper_json)
     70     except IOError:

[~]/Github/collabw18/code-cite/code-cite/notebooks/process_eupmc.py in get_doi(paper_json)
     29
     30 def get_doi(paper_json):
---> 31     paper_doi = paper_json['doi'][0]
     32     return paper_doi
     33

KeyError: 'doi'

I think this is because if there is a folder without XML or JSON, or a JSON file without a DOI, the process_eupmc.py script cannot complete. This second situation is not fully tested, but what I can glean from simple checks. The notebook works when I run it on a small subset of data and remove problematic directories.

I propose amending the process_eupmc.py script to have run-throughs for when the getpapers result does not contain the expected info. So instead of pausing at these points, the script continues and that data is ignored.

I don't yet know how to do this, I'll try to give this a crack, feel free to jump in if anyone feels up to it.

  • set a default?
  • try except?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions