Skip to content
This repository was archived by the owner on Sep 27, 2022. It is now read-only.
This repository was archived by the owner on Sep 27, 2022. It is now read-only.

Split plaintext by sections and paragraphs #40

@appledora

Description

@appledora

In GitLab by @geohci on Aug 25, 2022, 19:48

Splitting on sections is easy but we'll want to identify all the different HTML elements that indicate a new paragraph (new line) so that we can return a more structured plaintext result. This will include the <p> tags but also list items and likely other types of new HTML nodes. This will provide better support for people who e.g., only want the first paragraph of the article or want to break it into chunks for input into language models.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions