GitHub - IACR/dblpconflicts: A database of publications and authors from DBLP

This is sort of like cryptodb but includes conferences and journals in cryptology and security, including things outside IACR like Usenix Security. Ultimately we would like to find (name, orcid, affiliation) for all authors of security and cryptology conferences, along with their publications.

All of the data hat is collected can have problems with it, including:

names have collisions, and some people use multiple names on their publications (e.g., changes through marriage or variant spellings). DBLP appends a numeric value like 0001 when there is a collision, so we keep that key. DBLP also uses other keys and the data appears to use different keys in different places. The documentation for the DBLP XML schema is somewhat vague about this.
a person will likely have multiple affiliations during their publishing career, and these should show up on the publication rather than the author record. Schloss Dagstuhl has a project to build a better model for affiliations. Ideally the author should have the current affiliation, and individual papers should have author affiliations associated with them. Affiliations should also be identified by ROR IDs in much the same way that authors are identified by ORCID.
the DBLP XML data set has ORCID in at least two different places, perhaps reflecting the fact that they pick it up from multiple sources.

It isn't clear where to draw the line in conferences and journals to include. There are a LOT of third-tier conferences and journals. See the code in sax_parser.py where there are others listed in comments. See also Google scholar

This repository includes a python parser for DBLP. There are quite a few available, but this one uses minimal RAM and parses it incrementally using the pulldom package.

The code consists of several parts:

the sax_parser.py that downloads and parses the DBLP data to produce a JSON file of articles.
the create.sql to create the database. Note that we store ORCIDs but we do not cross-reference cryptodb.
the insertdb.py code to insert into the database.

The algorithm for extracting data requires several passes because the dblp.xml file has an inconvenient schema. In the first pass we extract all publications that belong to the venues we want, and those are written to a file called articles.json. These will contain some orcids but no affiliations. In this first pass we also parse all of the 'www' fields that contain affiliation information. In particular, they may contain orcid and affiliation information so we extract that. Unfortunately we don't know which authors we need, so we save any authors that have either an orcid or an affiliation. Those are written to a file called authors.json.

In the second pass we read the articles.json file and build a lookup table for all authors in articles.json that we need. At the same time we replace the authors field in articles.json with just an array of author keys. We write out the new articles.json as articles3.json. We then stream through authors.json and add affiliation and orcid information to the lookup table for authors. When we finish doing that, we write out the list of authors into a file authors3.json.

After this we have everything we need for authors of papers, so we can insert them into the database.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.gitignore		.gitignore
README.md		README.md
create.sql		create.sql
insertdb.py		insertdb.py
sax_parser.py		sax_parser.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

IACR/dblpconflicts

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages