Add img to replaced tags which get preserved in HTML from slicing.#26
Open
andreip wants to merge 1 commit intocloseio:masterfrom
Open
Add img to replaced tags which get preserved in HTML from slicing.#26andreip wants to merge 1 commit intocloseio:masterfrom
andreip wants to merge 1 commit intocloseio:masterfrom
Conversation
|
Can this please be merged? |
|
I tested this PR in my project and it works nicely. Would be great if this can be merged. @wojcikstefan |
afzalIbnSH
pushed a commit
to freightwalla/quotequail
that referenced
this pull request
Sep 16, 2021
Originally solved by @andreip in closeio#26 In his words: "Couldn't think of a different approach, since an img isn't really a block, so it'll never have a text within it, so no point in generating a different html in get_line_info functions. Instead, what was missing was it being treated as a special case: don't want to slice a line from the HTML by just looking at the plain text lines, since that could slice an img, need to also look at the start/end refs for replaced tags. See more about a replaced element (https://developer.mozilla.org/en-US/docs/Web/CSS/Replaced_element). I think it might be worth adding a few more things to the list? e.g. video, embed etc. ; not sure about iframe and how that'd be treated in lxml parsing though, but I suppose you could have an iframe with just an image in it, in which case you'd still want to keep it? Full list would be a total of 9 replaced elements (or 10 if we also count input; although I'm not sure of all examples where that'd generate sth even if it apparently has no text in it)."
afzalIbnSH
added a commit
to freightwalla/quotequail
that referenced
this pull request
Sep 16, 2021
Image in beginning of reply is incorrectly ignored. Fix. Originally reported in closeio#22 and solved by @andreip in closeio#26 In his words: "Couldn't think of a different approach, since an img isn't really a block, so it'll never have a text within it, so no point in generating a different html in get_line_info functions. Instead, what was missing was it being treated as a special case: don't want to slice a line from the HTML by just looking at the plain text lines, since that could slice an img, need to also look at the start/end refs for replaced tags. See more about a replaced element (https://developer.mozilla.org/en-US/docs/Web/CSS/Replaced_element). I think it might be worth adding a few more things to the list? e.g. video, embed etc. ; not sure about iframe and how that'd be treated in lxml parsing though, but I suppose you could have an iframe with just an image in it, in which case you'd still want to keep it? Full list would be a total of 9 replaced elements (or 10 if we also count input; although I'm not sure of all examples where that'd generate sth even if it apparently has no text in it)."
afzalIbnSH
added a commit
to freightwalla/quotequail
that referenced
this pull request
Sep 16, 2021
Image in beginning of reply is incorrectly ignored. Fix. This is a copy of [PR](closeio#26) by @andreip In his words: > Couldn't think of a different approach, since an img isn't really a block, so it'll never have a text within it, so no point in generating a different html in get_line_info functions. Instead, what was missing was it being treated as a special case: don't want to slice a line from the HTML by just looking at the plain text lines, since that could slice an img, need to also look at the start/end refs for replaced tags. > > See more about a replaced element (https://developer.mozilla.org/en-US/docs/Web/CSS/Replaced_element). I think it might be worth adding a few more things to the list? e.g. video, embed etc. ; not sure about iframe and how that'd be treated in lxml parsing though, but I suppose you could have an iframe with just an image in it, in which case you'd still want to keep it? > > Full list would be a total of 9 replaced elements (or 10 if we also count input; although I'm not sure of all examples where that'd generate sth even if it apparently has no text in it).
thomasst
reviewed
Oct 15, 2021
Member
thomasst
left a comment
There was a problem hiding this comment.
This looks like a good start, but is still incomplete and not catching all of the cases:
For example, the following test string
html = "Test 2.<br><img src=\"top-image\"><br>On Jun 05, 2018, at 09:56 AM, John Doe <john@example.com> wrote:<br><blockquote><img src=\"https://example.com\" class=\"fr-fic fr-dib\"><br>Some text 1.<br><br>Bart</blockquote><img src=\"bottom-image\">"
should be parsed into:
{'date': 'Jun 05, 2018, at 09:56 AM',
'from': 'John Doe <john@example.com>',
'html': '<div><img src="https://example.com" class="fr-fic fr-dib"><br>Some '
'text 1.<br><br>Bart</div>',
'html_bottom': '<img src="bottom-image">',
'html_top': 'Test 2.<br><img src="top-image">',
'type': 'reply'}
In this PR, it ignores the top-image (rather than having it in html_top) and includes the bottom-image as part of the html (rather than html_bottom).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
fixes #22
Couldn't think of a different approach, since an
imgisn't really a block, so it'll never have a text within it, so no point in generating a different html inget_line_infofunctions. Instead, what was missing was it being treated as a special case: don't want to slice a line from the HTML by just looking at the plain textlines, since that could slice animg, need to also look at the start/end refs for replaced tags.See more about a
replacedelement (https://developer.mozilla.org/en-US/docs/Web/CSS/Replaced_element). I think it might be worth adding a few more things to the list? e.g.video,embedetc. ; not sure aboutiframeand how that'd be treated in lxml parsing though, but I suppose you could have an iframe with just an image in it, in which case you'd still want to keep it?Full list would be a total of 9 replaced elements (or 10 if we also count
input; although I'm not sure of all examples where that'd generate sth even if it apparently has no text in it).