Hello authors. Thanks for the nice work.
Recently, I've been trying to use graphinvent for benchmark experiments. In my experiment, I use the COCONUT dataset. The number of molecules in our curated dataset is almost ~700k. However, since each molecule in COCONUT has a large number of atoms, it seems that the current data processing strategy without multi-processing takes too much time (almost 8~9 days). The statistics of the data set are presented as follows:
"atom_types" : ['B', 'C', 'N', 'O', 'Si', 'P', 'S', 'Cl', 'As', 'Se', 'Br', 'I'],
"formal_charge" : [0],
"max_n_nodes" : 99,
train.smi: 526,853
valid.smi: 32,927
test.smi: 98,785
In this light, I have two questions:
- Is there any plan to support multi-processing or any way to make data processing more efficient with the current code?
- At this time, I'm planning to split train, valid, test smi files with fixed chunk size, and then collect all processed data into a single hdf file. Will this approach produce the same result as processing all the data at once?
Hello authors. Thanks for the nice work.
Recently, I've been trying to use graphinvent for benchmark experiments. In my experiment, I use the COCONUT dataset. The number of molecules in our curated dataset is almost ~700k. However, since each molecule in COCONUT has a large number of atoms, it seems that the current data processing strategy without multi-processing takes too much time (almost 8~9 days). The statistics of the data set are presented as follows:
train.smi: 526,853valid.smi: 32,927test.smi: 98,785In this light, I have two questions: