-
Notifications
You must be signed in to change notification settings - Fork 7
Description
Trying to open the PHI-base 4.12 CSV file as UTF-8 (in Python) throws an error because the file is not valid UTF-8.
I'm not completely sure what encoding the files use, but using cp1252 encoding doesn't throw any errors (that's the Windows-1252 encoding, a legacy default for many Windows components).
Windows-1252 isn't appropriate for PHI-base now (if it ever was) because some columns (e.g. 'Pathogen strain' and 'Host strain') contain characters outside of the Windows-1252 encoding range, such as the delta symbol (Δ). These symbols end up replaced with question marks. Here's an example from the PHI-base 4.12 CSV:
Record ID Record 11248
Pathogen strain CA14 (?ku70 ?pyrG::AfpyrG)
Name: 11247, dtype: object
@martin2urban What program did you use to generate these CSV files? If I remember correctly, Microsoft Excel doesn't default to UTF-8 when saving as CSV and has to be manually configured to save in UTF-8 encoding.
Here's the list of files that fail to load as UTF-8:
- phi-base_v4-01_2016-05-01.csv
- phi-base_v4-03_2017-05-01.csv
- phi-base_v4-05_2018-05-15.csv
- phi-base_v4-11_2021-05-05.csv
- phi-base_v4-12_2021-09-02.csv
We should really convert these files to UTF-8 by regenerating them from the original datasets (if possible).
For completeness, here's the list of valid files: those that are either UTF-8 encoded, or contain no characters outside of the ASCII character set:
- phi-base_v4-00_2015-09-09.csv
- phi-base_v4-02_2016-10-03.csv
- phi-base_v4-04_2017-11-10.csv
- phi-base_v4-06_2018-12-05.csv
- phi-base_v4-07_2019-05-27.csv
- phi-base_v4-08_2019-09-16.csv
- phi-base_v4-09_2020-05-25.csv
- phi-base_v4-10_2020-11-02.csv