I was patching a fork of the roo gem to fix several xls parsing error and I ran into this weird state, where the characters are encoded in a different format than expected when loading files generated by this gem. I traced it down to write_string and found this nugget:
# Handle utf8 strings
if is_utf8?(str)
str_utf16le = utf8_to_16le(str)
return write_utf16le_string(row, col, str_utf16le, args[3])
end
This causes some cells to be arbitrarily encoded in UTF-16LE, which gets loaded as UTF-8 by the spreadsheet gem. I noticed the same conversion, but to UTF-16BE in chart axis saving. --don't mix the streams--
Then I went to read the write_utf16le_string method to see if it was doing something odd. I did not expect it to convert the string again, but here it calls utf16be_to_16le on an already encoded utf-16le string.
def write_utf16le_string(*args)
# Check for a cell reference in A1 notation and substitute row and column
args = row_col_notation(args)
return -1 if (args.size < 3) # Check the number of args
row, col, str, format = args
# Change from UTF16 big-endian to little endian
str = utf16be_to_16le(str)
write_utf16be_string(row, col, str, format)
end
Which in turn doesn't even use encoding but instead unpacks and repacks the string.
def utf16be_to_16le(utf16be)
utf16be.unpack('n*').pack('v*')
end
And then the write_utf16be_string at the end of write_utf16le_string itself calls utf16be_to_16le to convert it back to 16le. At this point I starting losing track of be's and le's in trying to figure out what the actual encoding becomes.
def utf16be_to_16le(...)
...
# Change from UTF16 big-endian to little endian
str = utf16be_to_16le(str)
WHY does it do this??? Is it a mistake, or something to appease an older version of excel? There's 3 levels of arbitrary re-encoding for utf-8 without clarifying comments and it clearly produces an output that other readers can't correctly interpret and which differs from excel 2010's xls outputs.
I was patching a fork of the roo gem to fix several xls parsing error and I ran into this weird state, where the characters are encoded in a different format than expected when loading files generated by this gem. I traced it down to
write_stringand found this nugget:This causes some cells to be arbitrarily encoded in UTF-16LE, which gets loaded as UTF-8 by the spreadsheet gem. I noticed the same conversion, but to UTF-16BE in chart axis saving. --don't mix the streams--
Then I went to read the
write_utf16le_stringmethod to see if it was doing something odd. I did not expect it to convert the string again, but here it callsutf16be_to_16leon an already encoded utf-16le string.Which in turn doesn't even use encoding but instead unpacks and repacks the string.
And then the
write_utf16be_stringat the end ofwrite_utf16le_stringitself callsutf16be_to_16leto convert it back to 16le. At this point I starting losing track of be's and le's in trying to figure out what the actual encoding becomes.WHY does it do this??? Is it a mistake, or something to appease an older version of excel? There's 3 levels of arbitrary re-encoding for utf-8 without clarifying comments and it clearly produces an output that other readers can't correctly interpret and which differs from excel 2010's xls outputs.