PHI-base CSV releases should use UTF-8 encoding

Trying to open the PHI-base 4.12 CSV file as UTF-8 (in Python) throws an error because the file is not valid UTF-8.  

I'm not completely sure what encoding the files use, but using `cp1252` encoding doesn't throw any errors (that's the [Windows-1252](https://en.wikipedia.org/wiki/Windows-1252) encoding, a legacy default for many Windows components).

Windows-1252 isn't appropriate for PHI-base now (if it ever was) because some columns (e.g. 'Pathogen strain' and 'Host strain') contain characters outside of the Windows-1252 encoding range, such as the delta symbol (&Delta;). These symbols end up replaced with question marks. Here's an example from the PHI-base 4.12 CSV:

```
Record ID                        Record 11248
Pathogen strain    CA14 (?ku70 ?pyrG::AfpyrG)
Name: 11247, dtype: object
```
@martin2urban What program did you use to generate these CSV files? If I remember correctly, Microsoft Excel doesn't default to UTF-8 when saving as CSV and has to be manually configured to save in UTF-8 encoding.

Here's the list of files that fail to load as UTF-8:

* phi-base_v4-01_2016-05-01.csv
* phi-base_v4-03_2017-05-01.csv
* phi-base_v4-05_2018-05-15.csv
* phi-base_v4-11_2021-05-05.csv
* phi-base_v4-12_2021-09-02.csv

We should really convert these files to UTF-8 by regenerating them from the original datasets (if possible).

- - -

For completeness, here's the list of valid files: those that are either UTF-8 encoded, or contain no characters outside of the ASCII character set:

* phi-base_v4-00_2015-09-09.csv
* phi-base_v4-02_2016-10-03.csv
* phi-base_v4-04_2017-11-10.csv
* phi-base_v4-06_2018-12-05.csv
* phi-base_v4-07_2019-05-27.csv
* phi-base_v4-08_2019-09-16.csv
* phi-base_v4-09_2020-05-25.csv
* phi-base_v4-10_2020-11-02.csv


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PHI-base CSV releases should use UTF-8 encoding #13

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

PHI-base CSV releases should use UTF-8 encoding #13

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions