Develop multi ling 03 10 2019 no mb4 by mgage · Pull Request #930 · openwebwork/webwork2

mgage · 2019-03-12T15:01:49Z

This replaces develop_candidate_02_24_2019 PR #927 . This has additional tweaks to
encode(UTF-8) and :utf8 layers for input/output. It is set up so that while it uses utf8 it
does not the force the usage of utf8mb4 so it can work with mysql older than 5.3 with
legacy courses. Once we require utf8mb4 the code will no longer be compatible with
older versions of mysql/mariadb.

The reading of the PG version has been tweaked so that it is read directly from the file pg/VERSION.
The file webwork2/PG_VERSION is no longer used. I think this will improve compatibility with docker on some machines where linking PG_VERSION to pg/VERSION caused troubles.

… and mess up the database. Note: We do not need decode from thaw because as sequences of bytes nothing changes. (I think.)

…bled.)

…ch/webwork2 into locbug Conflicts: courses.dist/modelCourse/course.conf

… suggested by goehle

added [qw(Encode::Encoding)] to ${pg}{modules}) in defaults.config as…

…om/heiderich/webwork2 into locbug

…o locbug

…lop_uft8_ver2 # Conflicts: # lib/WeBWorK/ContentGenerator/Instructor/SendMail.pm # lib/WeBWorK/Utils.pm

also have a "home directory" printout for debugging.

…pty_passwords_during_checkPassword PR 911+910+904 hotfixes to mgage new_develop_candidate_01_01_2019

…base use the utf8mb4 charset. The new file will be added to the running Docker MariabDB 10.1 container using changes to be made to docker-compose.yml in a following commit.

mysql_enable_utf8 ==> mysql_enable_utf8mb4 "SET NAMES 'utf8'" ==> SET NAMES 'utf8mb4'" modify the length of several VARCHAR columns so key length < 1000 as utf8mb4 reserves 4 bytes per character. These changes sufficed to allow (in a Docker test system): creating and using the admin course in a new MariaDB 10.0 database which was set to use utf8mb4 as the default charset, running OPL_update, creating a test course and creating a homework set with the question setMAAtutorial/hello.pg in it, creating a user whose name needs UTF8 characters, and submitting a UTF8 encoded string as an answer to setMAAtutorial/hello.pg and storing the UTF8 answer in the past_answer table. Note: The Docker test was done after the docker-compose.yml was modified, which is in the following commit.

database to use utf8mb4. This needs to be done BEFORE the Docker database storage volume is created. A special version called docker-compose.yml.specialUTF8MB4-db-storage can be used to replace docker-compose.yml, and will change the name of the database container, the database named storage volume, and make related settings. Replace docker-compose.yml with the file will force Docker to run the containers in a manner which will not intefere with the non-utf8mb4 database used by the prior configuration. That approach would allow one to switch between the use of Docker containers using a utf8mb4 database and using an old style database, just by using the "correct" settings in docker-compose.yml for each "case".

To allow the 1000 byte limit for keys, the OPL_local_statistics was changed to use ENGINE=MyISAM. That change allowed reducing the source_file column from varchar(255) to varchar(245), while the 767 bytes key limit of ENGINE=InnoDB as was used by this table and by OPL_global_statistics.sql (at present) would have limited this to about 191 characters. The OPL_problem_user table had 3 keys changed from using 100 characters each to using only 80 characters each. The same changes were already made to lib/WeBWorK/DB/Record/PastAnswer.pm for the same reason. Only the first 245 characters of source_file are used as a key. Finally, the CHARACTER SET of OPL_problem_user was explicitly changed from ascii to utf8mb4. The saved version of OPL_global_statistics.sql needs to be hand modified as follows, until the GitHub master branch has these changes made. Only after those changes can the file be loaded into a utf8mb4 database. utf8 => utf8mb4 255 => 245 ENGINE=InnoDB => ENGINE=MyISAM ================================= 10c10 < /*!40101 SET NAMES utf8mb4 */; --- > /*!40101 SET NAMES utf8 */; 24c24 < /*!40101 SET character_set_client = utf8mb4 */; --- > /*!40101 SET character_set_client = utf8 */; 26c26 < `source_file` varchar(245) NOT NULL, --- > `source_file` varchar(255) NOT NULL, 31c31 < ) ENGINE=MyISAM DEFAULT CHARSET=utf8mb4; --- > ) ENGINE=InnoDB DEFAULT CHARSET=utf8;

Note: When I tried to include XML::Simple near the start of the first "cpanm install" line, there was an error: Building and testing XMLRPC-Lite-0.717 ... ! Installing XMLRPC::Lite failed. See /root/.cpanm/work/1551887935.125/build.log for details. Retry with --force to force install it. so it was put into a second "cpanm install" line.

…2019_utf8mb4 Tani develop candidate 02 24 2019 utf8mb4

…le. (Not currently possible in read_whole_file inside the Safe compartment.

…3_10_2019 # Conflicts: # lib/Apache/WeBWorK.pm

Change FileManager to be compatible with UTF-8

taniwallach · 2019-03-12T18:50:10Z

@@ -926,7 +926,16 @@ sub Upload {
 	if ($type eq 'Text') {
 		$upload->dispose;
 		$data =~ s/\r\n?/\n/g;


@mgage I think that the substitution $data =~ s/\r\n?/\n/g; which is run before the data is converted from a string of bytes which may be "raw" UTF8 to a string of characters may be what is sometimes corrupting UTF8 files uploaded in "Text" format.

It's actually been working for me. I'm pretty sure however that $type is never equal to 'Text' in my examples. The files uploaded have been identified in binary when the mode is automatic. (you reported failures with the automatic mode, but I didn't see that. ) The isText subroutine has been experimented with for other reasons (in general the current hacks for testing for text need to be rethought with the advent of ift8mb4)

sub isText {
my $string = shift;

# return $string !~ m/[^\s\x20-\x7E]{4}/; return utf8::is_utf8($string); # return $string !~ m/[\x00-\x08\x0B\x0C\x0E-\x1F\x7F]{2}/;

}

Is what I have in my current copy of FileManager.pm

If I force uploading in text mode I got errors but forcing the print command to use :utf8 fixed the problem. (open(UPLOAD,">:encoding(UTF-8)",$file)).

Regarding FileManager uploads mangling UTF8 files:

I apologize - I tested again and "automatic" is not causing trouble. I probably confused the file sizes. The bigger file size is the damaged file and the smaller file was the good file. I think I may have accidentally assumed the opposite.

I did some testing on a very small file I called small_utf8.pg which has just the special characters from the file I was testing with separated by a blank line. The file contents were:

é ¿ á ó á ñ

When uploaded in the 2 different modes - the resulting file had different sizes:

-rw-r--r-- 1 tani tani 23 Mar 14 22:39 small_utf8_uploaded_binary.pg -rw-r--r-- 1 tani tani 35 Mar 14 22:39 small_utf8_uploaded_text.pg

The original file had 23 bytes and was the same as the result of the binary upload.

Using Unix od we can see the data in each file.

> od -a -t x1 small_utf8_uploaded_binary.pg 0000000 C ) nl nl B ? nl nl C ! nl nl C 3 nl nl c3 a9 0a 0a c2 bf 0a 0a c3 a1 0a 0a c3 b3 0a 0a 0000020 C ! nl nl C 1 nl c3 a1 0a 0a c3 b1 0a 0000027 > od -a -t x1 small_utf8_uploaded_text.pg 0000000 C etx B ) nl nl C stx B ? nl nl C etx B ! c3 83 c2 a9 0a 0a c3 82 c2 bf 0a 0a c3 83 c2 a1 0000020 nl nl C etx B 3 nl nl C etx B ! nl nl C etx 0a 0a c3 83 c2 b3 0a 0a c3 83 c2 a1 0a 0a c3 83 0000040 B 1 nl c2 b1 0a 0000043

Notice how the six 2-byte special characters were each replaced by a 4-byte sequence. That added 12 extra bytes to the files, and when I try to edit the mangled file I see:

Ã© Â¿ Ã¡ Ã³ Ã¡ Ã±

which corresponds to the sort of manged text seen in the mangled version of the original pg file.

I think the issue is that the uploaded data is a sequence of bytes (octets), so it should not be saved using either

>:utf8 as in

if (open(UPLOAD,">:utf8",$file)) {print UPLOAD $data; close(UPLOAD)}

or >:encoding(UTF-8) as in

if (open(UPLOAD,">:encoding(UTF-8)",$file)) {print UPLOAD $data; close(UPLOAD)}

as those are meant for when the Perl string is already a sequence of Unicode characters which need to be converted to a sequence of UTF-8 octets.

Instead it should just be written out to a file "as is" using the old fashioned:

if (open(UPLOAD,">$file")) {print UPLOAD $data; close(UPLOAD)}

which should not modify the sequence of bytes.

For me, that change leads to text mode giving an unmangled file.

The question remains whether the substitution to change DOS/Mac line-endings to Unix line-endings can mess up multi-byte characters.

According to https://en.wikipedia.org/wiki/UTF-8 multi-byte UTF-8 characters have the high-order bit set in all the bytes, so bytes with the high-order bit off must be an ASCII character in the regular location.

In particular, the octet \r = CR (Carriage return, 0x0D, 13 in decimal) cannot appear in a valid UTF-8 file with any meaning other than carriage return.

Thus, I conclude that the line-end substitution cannot cause any harm.

Sorry, I did not do my homework before raising the suspicion that the line-end substitution might be related to the problem.

I'm repeating much of what I just posted at #915 as it bears on this issue also.

I did some more reading/review about Perl and Unicode.

In short:

Any "text" which arrives from outside Perl should be "decoded" (from whatever transfer encoding it was sent in) and thereby converted to a sequence of Perl characters (a "text" string) when a Perl "text string" is needed.

Any "text" being sent out of Perl should be "encoded" from a sequence of Perl characters (a "text" string) to whatever transfer encoding is expected on the other side of the line.

When FileManager.pm is handling a file upload, the upload data is just a sequence of bytes from the file being uploaded. Those bytes should essentially just be written to the file being created. When the data is either in an 8-bit encoding or in UTF-8 encoding, the line-end substitution code is safe (as explained in the prior comment), but no encoding conversion code should be applied.

Bottom line - for this situation we want: if (open(UPLOAD,">$file")) {print UPLOAD $data; close(UPLOAD)} without any encoding layer involved.

References:

https://mojolicious.org/perldoc/perlunitut

https://mojolicious.org/perldoc/perlunifaq

https://mojolicious.org/perldoc/perluniintro

taniwallach · 2019-03-12T19:57:33Z

@mgage

Maybe we can add a setting in defaults.conf which controls whether the (problematic) 3-byte max mysql utf8 charset is used instead of the full 4-byte max utf8mb4. (This difference exists only at the SQL level: in terms of the charset mySQL is storing, and in terms of the charset for the communications between WW and the SQL server).

Then anyone who needs to support a (old) version of mySQL which does not support utf8mb4 can flip the setting, while modern installations can just use utf8mb4 and avoid any possible issues with 4-byte characters?

taniwallach · 2019-03-12T20:02:18Z

 	# We set the bimode for print to utf8 because some language options
 	# use utf8 characters
-	binmode(STDOUT, ":utf8");
+	binmode(STDOUT, "encoding(UTF-8)");


Should this be ":encoding(UTF-8)" with a colon?

Yes. Believe there is supposed to be a colon in front of encoding.

I agree about creating a switch. The places I've identified where utf8mb4 is specifically mentioned and causes trouble for my old mysql installation (pre 5.3) which can't handle mb4 are:

set up utf8mb4 switch???

control line 95,99 in OPL-update

LocationAddresses.pm ine 31

SQL.pm line 272

The use of encoding(UTF-8) above didn't cause problems.

Also:

lib/WeBWorK/Utils/ListingDB.pm line 519 (SET NAMES for DB connection).

bin/update-OPL-statistics - line 85 (CHARACTER SET utf8mb4 vs. ascii)

I think all the changes can be found in the following 2 commits from my tree.
taniwallach@b840e28
taniwallach@472e837
from March 6 2019 commits to branch: https://github.com/taniwallach/webwork2/commits/tani_develop_candidate_02_24_2019_utf8mb4

Note the column sizes in the various database tables which exceed the index limit for utf8mb4 encoding can be made smaller already, which may save trouble for relatively new tables when conversion needs to be done.

…19_no_mb4

mgage · 2019-04-07T14:20:42Z

I'm closing this since nearly everything in this pull request is also in #932 and can be made compatible with early versions of mysql by setting $ENABLE_UTF8MB4=0 in site.conf. The one set of files missing is the themes/math4-ar directory. Can someone remind me what this directory contains? Is it still needed?

taniwallach · 2019-04-07T14:40:47Z

The one set of files missing is the themes/math4-ar directory. Can someone remind me what this directory contains? Is it still needed?

It was an early attempt (by @heiderich) to allow using Arabic and the right-to-left text direction in a course, but required a special set of templates. The main change in math4-ar was setting lang="ar" dir="rtl" in the main HTML tag of 4 files.

This was replaced in 5758400 (and e4b6358) by a modification of the main math4 files to use a new method to add a lang and when needed a dir tag as needed based on the language set for the course: 

That approach is currently supporting Hebrew courses, should support Arabic also with dir=rtl set, and other course languages without triggering right-to-left text direction. Adding support to turn on RTL mode for other right to left languages can be done in the future by adding suitable cases to the output_course_lang_and_dir method in lib/WeBWorK/ContentGenerator.pm.

goehle and others added 30 commits June 20, 2016 14:13

Tweaking how UTF8 encoding is done.

c7830d1

Adding more UTF8 support.

7f6a18a

Cleanup

5c20361

Added support for utf8 on xml

a54c760

Encoded results from freeze so they don't become long utf8 characters…

5c44802

… and mess up the database. Note: We do not need decode from thaw because as sequences of bytes nothing changes. (I think.)

Fix some broken css

fa1f954

small change.

ad37eb1

Added hardcopy support (assumin gyou have the fonts installed and ena…

0073969

…bled.)

Tracking down more open commands.

45f8bc1

corrected file encodings

7998565

Merge branch 'convert_to_utf8_encoding' of https://github.com/heideri…

eeb63a3

…ch/webwork2 into locbug Conflicts: courses.dist/modelCourse/course.conf

added [qw(Encode::Encoding)] to ${pg}{modules}) in defaults.config as…

e8b9947

… suggested by goehle

Merge pull request #14 from heiderich/Encode-error

ceb43be

added [qw(Encode::Encoding)] to ${pg}{modules}) in defaults.config as…

Merge branch 'add_maketext_calls_to_achievements' of https://github.c…

e1b5553

…om/heiderich/webwork2 into locbug

Merge branch 'locbug' of https://github.com/goehle/webwork2 into locbug

70ae45a

Merge branch 'develop' of https://github.com/openwebwork/webwork2 int…

80b663e

…o locbug

Merge branch 'develop' of https://github.com/openwebwork/webwork2 int…

f2b1564

…o locbug

Freeze/thaw to base64 because mysql fields are varchar

b0bfe73

update localization files.

200c5b8

Support for transition to freeze_base64

ea7147f

Polishing error handling

3d3795c

Misspelled method.

eaf0f2f

Merge branch 'locbug' of https://github.com/goehle/webwork2 into locbug

239790f

Used utf8::valid when I should have used utf8::is_utf8

1b083a0

Whoops.

6e2fbfd

Wrong maketext

e85eb00

Merge branch 'locbug' of https://github.com/goehle/webwork2 into deve…

054666f

…lop_uft8_ver2 # Conflicts: # lib/WeBWorK/ContentGenerator/Instructor/SendMail.pm # lib/WeBWorK/Utils.pm

local experimental changes

7541f5d

add utf-8 support to OPL-update (needs testing)

20c3237

add use utf8; to WWSafe.pm

4bee27e

mgage and others added 16 commits January 9, 2019 07:09

Include Encode as a module which is shared with the Safe compartment

3f83a1b

Add link to WeBWorK/webwork2 library

2eb2a2d

also have a "home directory" printout for debugging.

Merge pull request #23 from taniwallach/tani_forbid_whitespace_and_em…

01cba1e

…pty_passwords_during_checkPassword PR 911+910+904 hotfixes to mgage new_develop_candidate_01_01_2019

Reconcile copyright dates

23ddd22

Add a special config file for use in Docker, to have the MariaDB data…

d804537

…base use the utf8mb4 charset. The new file will be added to the running Docker MariabDB 10.1 container using changes to be made to docker-compose.yml in a following commit.

Merge pull request #24 from taniwallach/tani_develop_candidate_02_24_…

d271ec5

…2019_utf8mb4 Tani develop candidate 02 24 2019 utf8mb4

Merge branch 'develop' into develop_candidate_02_24_2019

fef2d56

Moved input and output encodings to :encoding(UTF-8) whereever possib…

362794f

…le. (Not currently possible in read_whole_file inside the Safe compartment.

update comments

a131163

Merge branch 'develop_candidate_02_24_2019' into develop_multi_ling_0…

612e3b3

…3_10_2019 # Conflicts: # lib/Apache/WeBWorK.pm

Change the method for determining PG version.

3f4777f

Change FileManager to be compatible with UTF-8

clean up.

4476890

mgage mentioned this pull request Mar 12, 2019

Develop candidate 02 24 2019 #927

Closed

taniwallach reviewed Mar 12, 2019

View reviewed changes

set OPL scripts to work with latin1 for this branch

984e061

This was referenced Mar 14, 2019

Tani multi library v01 - improved beta version #876

Closed

need to manage PG_VERSION better #929

Closed

Develop multi ling 03 14 2019 mb4 #932

Closed

Bring Dockerfile in line with version in branch develop_multi......20…

1202849

…19_no_mb4

mgage closed this Apr 7, 2019

mgage deleted the develop_multi_ling_03_10_2019_no_mb4 branch April 15, 2019 00:48

taniwallach mentioned this pull request Aug 9, 2019

Investigate WWSafe.pm vs. Perl's standard Safe.pm - UTF-8 motivation #982

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Develop multi ling 03 10 2019 no mb4#930

Develop multi ling 03 10 2019 no mb4#930
mgage wants to merge 106 commits into
openwebwork:developfrom
mgage:develop_multi_ling_03_10_2019_no_mb4

mgage commented Mar 12, 2019

Uh oh!

taniwallach Mar 12, 2019

Uh oh!

mgage Mar 12, 2019

Uh oh!

taniwallach Mar 14, 2019

Uh oh!

taniwallach Mar 14, 2019

Uh oh!

taniwallach commented Mar 12, 2019

Uh oh!

taniwallach Mar 12, 2019

Uh oh!

mgage Mar 12, 2019 •

edited

Loading

Uh oh!

taniwallach Mar 15, 2019

Uh oh!

mgage commented Apr 7, 2019

Uh oh!

taniwallach commented Apr 7, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Conversation

mgage commented Mar 12, 2019

Uh oh!

taniwallach Mar 12, 2019

Choose a reason for hiding this comment

Uh oh!

mgage Mar 12, 2019

Choose a reason for hiding this comment

Uh oh!

taniwallach Mar 14, 2019

Choose a reason for hiding this comment

Uh oh!

taniwallach Mar 14, 2019

Choose a reason for hiding this comment

Uh oh!

taniwallach commented Mar 12, 2019

Uh oh!

taniwallach Mar 12, 2019

Choose a reason for hiding this comment

Uh oh!

mgage Mar 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

taniwallach Mar 15, 2019

Choose a reason for hiding this comment

Uh oh!

mgage commented Apr 7, 2019

Uh oh!

taniwallach commented Apr 7, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

mgage Mar 12, 2019 •

edited

Loading