We should follow a similar process to the BigScience workshop's dataset processing. They include many of the tools ready for us to use such as data deduplication, both exact match and near dedup, filtering of low information content examples, removal of potentially hateful documents, and removal of PII.
They have all their tools available and discussions of them here: https://github.com/bigscience-workshop/data_tooling
Here is an initial set of tasks to perform:
We should follow a similar process to the BigScience workshop's dataset processing. They include many of the tools ready for us to use such as data deduplication, both exact match and near dedup, filtering of low information content examples, removal of potentially hateful documents, and removal of PII.
They have all their tools available and discussions of them here: https://github.com/bigscience-workshop/data_tooling
Here is an initial set of tasks to perform: