Skip to content

HTML with leading comments still detected as application/xml (re #102) #149

@markedmondson

Description

@markedmondson

Following up on #102, which was closed last week — the reported behaviour still reproduces on 1.2.1. PR #130 tightened HTML magic to position 0 only, which doesn't help when there's a leading HTML comment before the opening tag.

Repro

require "marcel"
require "stringio"

html = StringIO.new(<<~HTML)
  <!--/* Throwaway comment but it has to be over 128 characters to fail AND have a lowercase HTML tag, we can pad this one out a bit to get it longer */-->
  <html>
    <head></head>
    <body><h1>Magic!</h1></body>
  </html>
HTML

Marcel::VERSION                                       # => "1.2.1"
Marcel::MimeType.for(html, name: "index.html")        # => "application/xml"
Marcel::MimeType.for(html)                            # => "application/xml"

Expected: text/html in both cases.

Suggested fix

Marcel already supports regex magic (see the <(html|head|body|title|div) entry in tables.rb), so allowing leading whitespace and any number of HTML comments before the opening tag should be a small addition to definitions.rb:

Marcel::MimeType.extend "text/html",
  extensions: %w( html htm ),
  magic: [
    # existing position-0 entries...
    [0, /\A(?:\xEF\xBB\xBF)?\s*(?:<!--.*?-->\s*)*(?:<!DOCTYPE\s+html\b|<\s*html\b)/mi]
  ]

That covers:

  • optional UTF-8 BOM
  • arbitrary leading whitespace
  • any number of HTML comments before the tag
  • both <!DOCTYPE html ...> and <html ...> (case-insensitive)

We've been carrying a local monkey patch doing roughly this plus a filename-extension fallback. Happy to open a PR if there's appetite.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions