Skip to content

UTF-8 Arbitrarily converted to UTF-16LE or UTF-16BE #41

@MSeal

Description

@MSeal

I was patching a fork of the roo gem to fix several xls parsing error and I ran into this weird state, where the characters are encoded in a different format than expected when loading files generated by this gem. I traced it down to write_string and found this nugget:

    # Handle utf8 strings
    if is_utf8?(str)
      str_utf16le = utf8_to_16le(str)
      return write_utf16le_string(row, col, str_utf16le, args[3])
    end

This causes some cells to be arbitrarily encoded in UTF-16LE, which gets loaded as UTF-8 by the spreadsheet gem. I noticed the same conversion, but to UTF-16BE in chart axis saving. --don't mix the streams--

Then I went to read the write_utf16le_string method to see if it was doing something odd. I did not expect it to convert the string again, but here it calls utf16be_to_16le on an already encoded utf-16le string.

  def write_utf16le_string(*args)
    # Check for a cell reference in A1 notation and substitute row and column
    args = row_col_notation(args)

    return -1 if (args.size < 3)                  # Check the number of args

    row, col, str, format = args

    # Change from UTF16 big-endian to little endian
    str = utf16be_to_16le(str)

    write_utf16be_string(row, col, str, format)
  end

Which in turn doesn't even use encoding but instead unpacks and repacks the string.

  def utf16be_to_16le(utf16be)
    utf16be.unpack('n*').pack('v*')
  end

And then the write_utf16be_string at the end of write_utf16le_string itself calls utf16be_to_16le to convert it back to 16le. At this point I starting losing track of be's and le's in trying to figure out what the actual encoding becomes.

def utf16be_to_16le(...)
    ...
    # Change from UTF16 big-endian to little endian
    str = utf16be_to_16le(str)

WHY does it do this??? Is it a mistake, or something to appease an older version of excel? There's 3 levels of arbitrary re-encoding for utf-8 without clarifying comments and it clearly produces an output that other readers can't correctly interpret and which differs from excel 2010's xls outputs.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions