Skip to content

PHI-base CSV releases should use UTF-8 encoding #13

@jseager7

Description

@jseager7

Trying to open the PHI-base 4.12 CSV file as UTF-8 (in Python) throws an error because the file is not valid UTF-8.

I'm not completely sure what encoding the files use, but using cp1252 encoding doesn't throw any errors (that's the Windows-1252 encoding, a legacy default for many Windows components).

Windows-1252 isn't appropriate for PHI-base now (if it ever was) because some columns (e.g. 'Pathogen strain' and 'Host strain') contain characters outside of the Windows-1252 encoding range, such as the delta symbol (Δ). These symbols end up replaced with question marks. Here's an example from the PHI-base 4.12 CSV:

Record ID                        Record 11248
Pathogen strain    CA14 (?ku70 ?pyrG::AfpyrG)
Name: 11247, dtype: object

@martin2urban What program did you use to generate these CSV files? If I remember correctly, Microsoft Excel doesn't default to UTF-8 when saving as CSV and has to be manually configured to save in UTF-8 encoding.

Here's the list of files that fail to load as UTF-8:

  • phi-base_v4-01_2016-05-01.csv
  • phi-base_v4-03_2017-05-01.csv
  • phi-base_v4-05_2018-05-15.csv
  • phi-base_v4-11_2021-05-05.csv
  • phi-base_v4-12_2021-09-02.csv

We should really convert these files to UTF-8 by regenerating them from the original datasets (if possible).


For completeness, here's the list of valid files: those that are either UTF-8 encoded, or contain no characters outside of the ASCII character set:

  • phi-base_v4-00_2015-09-09.csv
  • phi-base_v4-02_2016-10-03.csv
  • phi-base_v4-04_2017-11-10.csv
  • phi-base_v4-06_2018-12-05.csv
  • phi-base_v4-07_2019-05-27.csv
  • phi-base_v4-08_2019-09-16.csv
  • phi-base_v4-09_2020-05-25.csv
  • phi-base_v4-10_2020-11-02.csv

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions