This is sort of like cryptodb but includes conferences and journals in cryptology and security, including things outside IACR like Usenix Security. Ultimately we would like to find (name, orcid, affiliation) for all authors of security and cryptology conferences, along with their publications.
All of the data hat is collected can have problems with it, including:
-
names have collisions, and some people use multiple names on their publications (e.g., changes through marriage or variant spellings). DBLP appends a numeric value like 0001 when there is a collision, so we keep that key. DBLP also uses other keys and the data appears to use different keys in different places. The documentation for the DBLP XML schema is somewhat vague about this.
-
a person will likely have multiple affiliations during their publishing career, and these should show up on the publication rather than the author record. Schloss Dagstuhl has a project to build a better model for affiliations. Ideally the author should have the current affiliation, and individual papers should have author affiliations associated with them. Affiliations should also be identified by ROR IDs in much the same way that authors are identified by ORCID.
-
the DBLP XML data set has ORCID in at least two different places, perhaps reflecting the fact that they pick it up from multiple sources.
It isn't clear where to draw the line in conferences and journals to
include. There are a LOT of third-tier conferences and journals. See
the code in sax_parser.py where there are others listed in comments.
See also Google
scholar
This repository includes a python parser for DBLP. There are quite a few available, but this one uses minimal RAM and parses it incrementally using the pulldom package.
The code consists of several parts:
- the
sax_parser.pythat downloads and parses the DBLP data to produce a JSON file of articles. - the
create.sqlto create the database. Note that we store ORCIDs but we do not cross-reference cryptodb. - the
insertdb.pycode to insert into the database.
The algorithm for extracting data requires several passes because the dblp.xml file has an inconvenient schema. In the first pass we extract all publications that belong to the venues we want, and those are written to a file called articles.json. These will contain some orcids but no affiliations. In this first pass we also parse all of the 'www' fields that contain affiliation information. In particular, they may contain orcid and affiliation information so we extract that. Unfortunately we don't know which authors we need, so we save any authors that have either an orcid or an affiliation. Those are written to a file called authors.json.
In the second pass we read the articles.json file and build a lookup table for all authors in articles.json that we need. At the same time we replace the authors field in articles.json with just an array of author keys. We write out the new articles.json as articles3.json. We then stream through authors.json and add affiliation and orcid information to the lookup table for authors. When we finish doing that, we write out the list of authors into a file authors3.json.
After this we have everything we need for authors of papers, so we can insert them into the database.