More effecient data processing workflow for external dataset

Hello authors. Thanks for the nice work.

Recently, I've been trying to use graphinvent for benchmark experiments. In my experiment, I use the [COCONUT](https://coconut.naturalproducts.net) dataset. The number of molecules in our curated dataset is almost \~700k. However, since each molecule in COCONUT has a large number of atoms, it seems that the current data processing strategy without multi-processing takes too much time (almost 8\~9 days). The statistics of the data set are presented as follows:

```python
            "atom_types"     : ['B', 'C', 'N', 'O', 'Si', 'P', 'S', 'Cl', 'As', 'Se', 'Br', 'I'],
            "formal_charge"  : [0],
            "max_n_nodes"    : 99,
```
- `train.smi`: 526,853
- `valid.smi`: 32,927
- `test.smi`: 98,785

In this light, I have two questions:
1. Is there any plan to support multi-processing or any way to make data processing more efficient with the current code?
2. At this time, I'm planning to split train, valid, test smi files with fixed chunk size, and then collect all processed data into a single hdf file. Will this approach produce the same result as processing all the data at once?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More effecient data processing workflow for external dataset #2

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

More effecient data processing workflow for external dataset #2

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions