Skip to content
This repository was archived by the owner on Mar 7, 2019. It is now read-only.
This repository was archived by the owner on Mar 7, 2019. It is now read-only.

Encoding issue running on javasphinx + py3 + windows 10 #63

@Poddster

Description

@Poddster

Hi,

Related to issue #56 and issue #37

On Windows10 javasphinx-apidoc won't work when run on Python 3.6.4. It will if run on Python2.7. This is with javasphinx==0.9.15 for both

It looks like the script, or possibly the python stdlib, are expecting the read files to be encoded in cp1252? But the files are actually utf-8. This will hit a problem on any byte that isn't a valid cp1252 character.

e.g. If trying to read character 🐍 ( U+1F40D, encoded in UTF-8 as b'\xF0\x9F\x90\x8D') then the script throws an exception, as it's treating that as 4 separate characters, and byte 0x90 is not a cp1252 character.

The stack trace shown is:

  File "C:\dev\env\python\Python36\Scripts\javasphinx-apidoc-script.py", line 11, in <module>
    load_entry_point('javasphinx==0.9.15', 'console_scripts', 'javasphinx-apidoc')()
  File "c:\dev\env\python\python36\lib\site-packages\javasphinx\apidoc.py", line 347, in main
    opts.member_headers, opts.parser_lib)
  File "c:\dev\env\python\python36\lib\site-packages\javasphinx\apidoc.py", line 228, in generate_documents
    this_file_documents = generate_from_source_file(doc_compiler, source_file, cache_dir)
  File "c:\dev\env\python\python36\lib\site-packages\javasphinx\apidoc.py", line 191, in generate_from_source_file
    source = f.read()
  File "c:\dev\env\python\python36\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 24: character maps to <undefined>

Whilst it works in py2, I'm feel like this is purely by accident due to python2's very "liberal" string decoding policies and the fact that it's a UTF-8 file. If my file was encoded in something weird, e.g. EUCJIS/SJIS, then the tool will fail. The official javadoc tool has an encoding option.

It would be good if javasphinx-apidoc could take an --encoding parameter and ensure that all files are read/decoded in that format.

Full Example

This was done using Powershell_ISA to "ensure" that the unicode characters were printed correctly, but it will happen in cmd.exe or git bash etc.

PS C:\dev\work\Mobile-SDK-Android\docs> Get-Content .\java\utf8.java -Encoding UTF8
package java;

/**
 * 🐍 U+1F40D -> \xF0\x9F\x90\x8D
 * 👐 U+1F450 -> \xF0\x9F\x91\x90
 */
public class EncodingProblems {
    public static void main(String[] args) {
        System.out.println("Hello!");
    }
}

PS C:\dev\work\Mobile-SDK-Android\docs> C:\dev\env\python\Python36\Scripts\javasphinx-apidoc.exe --output-dir=tmp/ java/
C:\dev\env\python\Python36\Scripts\javasphinx-apidoc.exe : Traceback (most recent call last):
At line:1 char:1
+ C:\dev\env\python\Python36\Scripts\javasphinx-apidoc.exe --output-dir ...
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : NotSpecified: (Traceback (most recent call last)::String) [], RemoteException
    + FullyQualifiedErrorId : NativeCommandError
 
  File "C:\dev\env\python\Python36\Scripts\javasphinx-apidoc-script.py", line 11, in <module>
    load_entry_point('javasphinx==0.9.15', 'console_scripts', 'javasphinx-apidoc')()
  File "c:\dev\env\python\python36\lib\site-packages\javasphinx\apidoc.py", line 347, in main
    opts.member_headers, opts.parser_lib)
  File "c:\dev\env\python\python36\lib\site-packages\javasphinx\apidoc.py", line 228, in generate_documents
    this_file_documents = generate_from_source_file(doc_compiler, source_file, cache_dir)
  File "c:\dev\env\python\python36\lib\site-packages\javasphinx\apidoc.py", line 191, in generate_from_source_file
    source = f.read()
  File "c:\dev\env\python\python36\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 24: character maps to <undefined>

PS C:\dev\work\Mobile-SDK-Android\docs> C:\dev\env\python\Python27\Scripts\javasphinx-apidoc.exe --output-dir=tmp/ java/

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions