Develop multi ling 03 10 2019 no mb4#930
Conversation
… and mess up the database. Note: We do not need decode from thaw because as sequences of bytes nothing changes. (I think.)
…ch/webwork2 into locbug Conflicts: courses.dist/modelCourse/course.conf
… suggested by goehle
added [qw(Encode::Encoding)] to ${pg}{modules}) in defaults.config as…
…lop_uft8_ver2 # Conflicts: # lib/WeBWorK/ContentGenerator/Instructor/SendMail.pm # lib/WeBWorK/Utils.pm
also have a "home directory" printout for debugging.
…pty_passwords_during_checkPassword PR 911+910+904 hotfixes to mgage new_develop_candidate_01_01_2019
…base use the utf8mb4 charset. The new file will be added to the running Docker MariabDB 10.1 container using changes to be made to docker-compose.yml in a following commit.
mysql_enable_utf8 ==> mysql_enable_utf8mb4 "SET NAMES 'utf8'" ==> SET NAMES 'utf8mb4'" modify the length of several VARCHAR columns so key length < 1000 as utf8mb4 reserves 4 bytes per character. These changes sufficed to allow (in a Docker test system): creating and using the admin course in a new MariaDB 10.0 database which was set to use utf8mb4 as the default charset, running OPL_update, creating a test course and creating a homework set with the question setMAAtutorial/hello.pg in it, creating a user whose name needs UTF8 characters, and submitting a UTF8 encoded string as an answer to setMAAtutorial/hello.pg and storing the UTF8 answer in the past_answer table. Note: The Docker test was done after the docker-compose.yml was modified, which is in the following commit.
database to use utf8mb4. This needs to be done BEFORE the Docker database storage volume is created. A special version called docker-compose.yml.specialUTF8MB4-db-storage can be used to replace docker-compose.yml, and will change the name of the database container, the database named storage volume, and make related settings. Replace docker-compose.yml with the file will force Docker to run the containers in a manner which will not intefere with the non-utf8mb4 database used by the prior configuration. That approach would allow one to switch between the use of Docker containers using a utf8mb4 database and using an old style database, just by using the "correct" settings in docker-compose.yml for each "case".
To allow the 1000 byte limit for keys, the OPL_local_statistics was
changed to use ENGINE=MyISAM. That change allowed reducing the
source_file column from varchar(255) to varchar(245), while the
767 bytes key limit of ENGINE=InnoDB as was used by this table and
by OPL_global_statistics.sql (at present) would have limited this to
about 191 characters.
The OPL_problem_user table had 3 keys changed from using 100 characters
each to using only 80 characters each. The same changes were already
made to lib/WeBWorK/DB/Record/PastAnswer.pm for the same reason. Only
the first 245 characters of source_file are used as a key. Finally,
the CHARACTER SET of OPL_problem_user was explicitly changed from ascii
to utf8mb4.
The saved version of OPL_global_statistics.sql needs to be hand modified
as follows, until the GitHub master branch has these changes made. Only
after those changes can the file be loaded into a utf8mb4 database.
utf8 => utf8mb4
255 => 245
ENGINE=InnoDB => ENGINE=MyISAM
=================================
10c10
< /*!40101 SET NAMES utf8mb4 */;
---
> /*!40101 SET NAMES utf8 */;
24c24
< /*!40101 SET character_set_client = utf8mb4 */;
---
> /*!40101 SET character_set_client = utf8 */;
26c26
< `source_file` varchar(245) NOT NULL,
---
> `source_file` varchar(255) NOT NULL,
31c31
< ) ENGINE=MyISAM DEFAULT CHARSET=utf8mb4;
---
> ) ENGINE=InnoDB DEFAULT CHARSET=utf8;
Note: When I tried to include XML::Simple near the start of the first "cpanm install" line, there was an error: Building and testing XMLRPC-Lite-0.717 ... ! Installing XMLRPC::Lite failed. See /root/.cpanm/work/1551887935.125/build.log for details. Retry with --force to force install it. so it was put into a second "cpanm install" line.
…2019_utf8mb4 Tani develop candidate 02 24 2019 utf8mb4
…le. (Not currently possible in read_whole_file inside the Safe compartment.
…3_10_2019 # Conflicts: # lib/Apache/WeBWorK.pm
Change FileManager to be compatible with UTF-8
| @@ -926,7 +926,16 @@ sub Upload { | |||
| if ($type eq 'Text') { | |||
| $upload->dispose; | |||
| $data =~ s/\r\n?/\n/g; | |||
There was a problem hiding this comment.
@mgage I think that the substitution $data =~ s/\r\n?/\n/g; which is run before the data is converted from a string of bytes which may be "raw" UTF8 to a string of characters may be what is sometimes corrupting UTF8 files uploaded in "Text" format.
There was a problem hiding this comment.
It's actually been working for me. I'm pretty sure however that $type is never equal to 'Text' in my examples. The files uploaded have been identified in binary when the mode is automatic. (you reported failures with the automatic mode, but I didn't see that. ) The isText subroutine has been experimented with for other reasons (in general the current hacks for testing for text need to be rethought with the advent of ift8mb4)
sub isText {
my $string = shift;
# return $string !~ m/[^\s\x20-\x7E]{4}/;
return utf8::is_utf8($string);
# return $string !~ m/[\x00-\x08\x0B\x0C\x0E-\x1F\x7F]{2}/;
}
Is what I have in my current copy of FileManager.pm
If I force uploading in text mode I got errors but forcing the print command to use :utf8 fixed the problem. (open(UPLOAD,">:encoding(UTF-8)",$file)).
There was a problem hiding this comment.
Regarding FileManager uploads mangling UTF8 files:
-
I apologize - I tested again and "automatic" is not causing trouble. I probably confused the file sizes. The bigger file size is the damaged file and the smaller file was the good file. I think I may have accidentally assumed the opposite.
-
I did some testing on a very small file I called
small_utf8.pgwhich has just the special characters from the file I was testing with separated by a blank line. The file contents were:
é
¿
á
ó
á
ñ
When uploaded in the 2 different modes - the resulting file had different sizes:
-rw-r--r-- 1 tani tani 23 Mar 14 22:39 small_utf8_uploaded_binary.pg
-rw-r--r-- 1 tani tani 35 Mar 14 22:39 small_utf8_uploaded_text.pg
The original file had 23 bytes and was the same as the result of the binary upload.
Using Unix od we can see the data in each file.
> od -a -t x1 small_utf8_uploaded_binary.pg
0000000 C ) nl nl B ? nl nl C ! nl nl C 3 nl nl
c3 a9 0a 0a c2 bf 0a 0a c3 a1 0a 0a c3 b3 0a 0a
0000020 C ! nl nl C 1 nl
c3 a1 0a 0a c3 b1 0a
0000027
> od -a -t x1 small_utf8_uploaded_text.pg
0000000 C etx B ) nl nl C stx B ? nl nl C etx B !
c3 83 c2 a9 0a 0a c3 82 c2 bf 0a 0a c3 83 c2 a1
0000020 nl nl C etx B 3 nl nl C etx B ! nl nl C etx
0a 0a c3 83 c2 b3 0a 0a c3 83 c2 a1 0a 0a c3 83
0000040 B 1 nl
c2 b1 0a
0000043
Notice how the six 2-byte special characters were each replaced by a 4-byte sequence. That added 12 extra bytes to the files, and when I try to edit the mangled file I see:
é
¿
á
ó
á
ñ
which corresponds to the sort of manged text seen in the mangled version of the original pg file.
- I think the issue is that the uploaded data is a sequence of bytes (octets), so it should not be saved using either
>:utf8as inif (open(UPLOAD,">:utf8",$file)) {print UPLOAD $data; close(UPLOAD)}
- or
>:encoding(UTF-8)as inif (open(UPLOAD,">:encoding(UTF-8)",$file)) {print UPLOAD $data; close(UPLOAD)}
- as those are meant for when the Perl string is already a sequence of Unicode characters which need to be converted to a sequence of UTF-8 octets.
- Instead it should just be written out to a file "as is" using the old fashioned:
if (open(UPLOAD,">$file")) {print UPLOAD $data; close(UPLOAD)}
- which should not modify the sequence of bytes.
- For me, that change leads to text mode giving an unmangled file.
- The question remains whether the substitution to change DOS/Mac line-endings to Unix line-endings can mess up multi-byte characters.
- According to https://en.wikipedia.org/wiki/UTF-8 multi-byte UTF-8 characters have the high-order bit set in all the bytes, so bytes with the high-order bit off must be an ASCII character in the regular location.
- In particular, the octet
\r= CR (Carriage return, 0x0D, 13 in decimal) cannot appear in a valid UTF-8 file with any meaning other than carriage return. - Thus, I conclude that the line-end substitution cannot cause any harm.
- Sorry, I did not do my homework before raising the suspicion that the line-end substitution might be related to the problem.
There was a problem hiding this comment.
I'm repeating much of what I just posted at #915 as it bears on this issue also.
I did some more reading/review about Perl and Unicode.
In short:
- Any "text" which arrives from outside Perl should be "decoded" (from whatever transfer encoding it was sent in) and thereby converted to a sequence of Perl characters (a "text" string) when a Perl "text string" is needed.
- Any "text" being sent out of Perl should be "encoded" from a sequence of Perl characters (a "text" string) to whatever transfer encoding is expected on the other side of the line.
When FileManager.pm is handling a file upload, the upload data is just a sequence of bytes from the file being uploaded. Those bytes should essentially just be written to the file being created. When the data is either in an 8-bit encoding or in UTF-8 encoding, the line-end substitution code is safe (as explained in the prior comment), but no encoding conversion code should be applied.
Bottom line - for this situation we want: if (open(UPLOAD,">$file")) {print UPLOAD $data; close(UPLOAD)} without any encoding layer involved.
References:
|
Maybe we can add a setting in defaults.conf which controls whether the (problematic) 3-byte max mysql Then anyone who needs to support a (old) version of mySQL which does not support |
| # We set the bimode for print to utf8 because some language options | ||
| # use utf8 characters | ||
| binmode(STDOUT, ":utf8"); | ||
| binmode(STDOUT, "encoding(UTF-8)"); |
There was a problem hiding this comment.
Should this be ":encoding(UTF-8)" with a colon?
There was a problem hiding this comment.
Yes. Believe there is supposed to be a colon in front of encoding.
I agree about creating a switch. The places I've identified where utf8mb4 is specifically mentioned and causes trouble for my old mysql installation (pre 5.3) which can't handle mb4 are:
- set up utf8mb4 switch???
- control line 95,99 in OPL-update
- LocationAddresses.pm ine 31
- SQL.pm line 272
The use of encoding(UTF-8) above didn't cause problems.
There was a problem hiding this comment.
Also:
- lib/WeBWorK/Utils/ListingDB.pm line 519 (
SET NAMESfor DB connection). - bin/update-OPL-statistics - line 85 (
CHARACTER SET utf8mb4vs. ascii)
I think all the changes can be found in the following 2 commits from my tree.
taniwallach@b840e28
taniwallach@472e837
from March 6 2019 commits to branch: https://github.com/taniwallach/webwork2/commits/tani_develop_candidate_02_24_2019_utf8mb4
Note the column sizes in the various database tables which exceed the index limit for utf8mb4 encoding can be made smaller already, which may save trouble for relatively new tables when conversion needs to be done.
|
I'm closing this since nearly everything in this pull request is also in #932 and can be made compatible with early versions of mysql by setting $ENABLE_UTF8MB4=0 in site.conf. The one set of files missing is the themes/math4-ar directory. Can someone remind me what this directory contains? Is it still needed? |
It was an early attempt (by @heiderich) to allow using Arabic and the right-to-left text direction in a course, but required a special set of templates. The main change in math4-ar was setting This was replaced in 5758400 (and e4b6358) by a modification of the main math4 files to use a new method to add a That approach is currently supporting Hebrew courses, should support Arabic also with |
This replaces develop_candidate_02_24_2019 PR #927 . This has additional tweaks to
encode(UTF-8) and :utf8 layers for input/output. It is set up so that while it uses utf8 it
does not the force the usage of utf8mb4 so it can work with mysql older than 5.3 with
legacy courses. Once we require utf8mb4 the code will no longer be compatible with
older versions of mysql/mariadb.
The reading of the PG version has been tweaked so that it is read directly from the file pg/VERSION.
The file webwork2/PG_VERSION is no longer used. I think this will improve compatibility with docker on some machines where linking PG_VERSION to pg/VERSION caused troubles.