Skip to content

Parse markdown found within HTML blocks#128

Closed
ron-at-swgy wants to merge 1 commit into
kristapsdz:masterfrom
ron-at-swgy:master
Closed

Parse markdown found within HTML blocks#128
ron-at-swgy wants to merge 1 commit into
kristapsdz:masterfrom
ron-at-swgy:master

Conversation

@ron-at-swgy
Copy link
Copy Markdown

@ron-at-swgy ron-at-swgy commented Nov 22, 2023

This changeset updates the parsing behavior to pass along the content between opening and closing tags of a LOWDOWN_BLOCKHTML segment for additional markdown parsing. The strict tag matching behavior has been removed, as it lacked the context needed to operate correctly when there may be nested tags (such as nested divs).

Consider a content segment such as:

<div class="container">
<div class="container column-1">

# Header for column 1

And some text
</div>
<div class="container column-2">

# Header for column 2

With more text
</div>
</div>

With these changes, an outer LOWDOWN_BLOCKHTML node will be the parent of all the content within the "container" div. The closing tag for each block is included as the final child as a LOWDOWN_RAW_HTML node.

This changeset updates the parsing behavior to pass along the content
between opening and closing tags of a LOWDOWN_BLOCKHTML segment for
additional markdown parsing. The strict tag matching behavior has been
removed, as it lacked the context needed to operate correctly when there
may be nested tags (such as nested divs).

Consider a content segment such as:

```
<div class="container">
<div class="container column-1">

And some text
</div>
<div class="container column-2">

With more text
</div>
</div>
```
With these changes, an outer LOWDOWN_BLOCKHTML node will be the parent
of all the content within the "container" div. The closing tag for each
block is included as the final child as a LOWDOWN_RAW_HTML node.
Comment thread document.c
return skip;
}

/*
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The definition for html_find_block was moved up to allow visibility in html_find_end.

Comment thread document.c
* Returns the length on match, 0 otherwise.
*/
static size_t
html_find_end_strict(const char *tag, size_t tag_len,
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed html_find_end_strict as the way it was coded did not allow enough context to know about nested HTML blocks that may have the same tag type (such as nested divs).

Comment thread html.c

result = hbuf_putc(ob, '\n');

if (!hbuf_putb(ob, content))
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Print out the inner content. Without this change, any child nodes of LOWDOWN_BLOCKHTML are passed over.

Copy link
Copy Markdown
Owner

@kristapsdz kristapsdz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this! I'll look at it over the next week or so. As this parsing is standard behaviour for pandoc, it should probably be enabled by default. Moving forward:

  • First, make sure that this flies with all regression tests (make regress).
    • This will probably require adding an option to disable the behaviour, as it conflicts with the original Markdown format, which some tests depend upon. I can help with this.
    • If regression tests for standard invocation fail by depending on interior bits not being parsed, these tests should be fixed or removed.
  • Second, add regression tests specifically for this behaviour.
  • Third, run the behaviour exhaustively through AFL to find any corner bugs. I can do this with access to bigger machines.

@ron-at-swgy
Copy link
Copy Markdown
Author

Thank you for your feedback. The Inline_HTML_Advanced regression test case breaks with my changes. I'll work towards solving that and any other regression failures.

@ron-at-swgy
Copy link
Copy Markdown
Author

I have not given up on this PR. I apologize for the lengthy delay, but still hope to get the behavior working without breaking existing functionality.

@ron-at-swgy ron-at-swgy closed this by deleting the head repository Jun 18, 2024
@bkmgit
Copy link
Copy Markdown

bkmgit commented Mar 28, 2026

@ron-at-swgy @kristapsdz Is this still of interest? Would like to complete it?

@ron-at-swgy
Copy link
Copy Markdown
Author

I’ve been working with a fork. I wasn’t able to get the tests updated etc. feel free to take over the work.

@bkmgit
Copy link
Copy Markdown

bkmgit commented Mar 28, 2026

Ok #182

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants