Skip to content

More effecient data processing workflow for external dataset #2

@mseok

Description

@mseok

Hello authors. Thanks for the nice work.

Recently, I've been trying to use graphinvent for benchmark experiments. In my experiment, I use the COCONUT dataset. The number of molecules in our curated dataset is almost ~700k. However, since each molecule in COCONUT has a large number of atoms, it seems that the current data processing strategy without multi-processing takes too much time (almost 8~9 days). The statistics of the data set are presented as follows:

            "atom_types"     : ['B', 'C', 'N', 'O', 'Si', 'P', 'S', 'Cl', 'As', 'Se', 'Br', 'I'],
            "formal_charge"  : [0],
            "max_n_nodes"    : 99,
  • train.smi: 526,853
  • valid.smi: 32,927
  • test.smi: 98,785

In this light, I have two questions:

  1. Is there any plan to support multi-processing or any way to make data processing more efficient with the current code?
  2. At this time, I'm planning to split train, valid, test smi files with fixed chunk size, and then collect all processed data into a single hdf file. Will this approach produce the same result as processing all the data at once?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions