This issue focuses on collecting ideas and formalizing the postprocessing steps and formatting of data instances for datasets in different categories, e.g., forums, articles, books, etc.
Initial draft of postprocessing:
- Exact duplication removal
- Near duplication removal
- Removal of specific html tags
Questions for formatting:
- How to format forums?
- How to format general website articles?
- How to format books?
This issue focuses on collecting ideas and formalizing the postprocessing steps and formatting of data instances for datasets in different categories, e.g., forums, articles, books, etc.
Initial draft of postprocessing:
Questions for formatting: