Skip to content

Develop multi ling 03 10 2019 no mb4#930

Closed
mgage wants to merge 106 commits into
openwebwork:developfrom
mgage:develop_multi_ling_03_10_2019_no_mb4
Closed

Develop multi ling 03 10 2019 no mb4#930
mgage wants to merge 106 commits into
openwebwork:developfrom
mgage:develop_multi_ling_03_10_2019_no_mb4

Conversation

@mgage
Copy link
Copy Markdown
Member

@mgage mgage commented Mar 12, 2019

This replaces develop_candidate_02_24_2019 PR #927 . This has additional tweaks to
encode(UTF-8) and :utf8 layers for input/output. It is set up so that while it uses utf8 it
does not the force the usage of utf8mb4 so it can work with mysql older than 5.3 with
legacy courses. Once we require utf8mb4 the code will no longer be compatible with
older versions of mysql/mariadb.

The reading of the PG version has been tweaked so that it is read directly from the file pg/VERSION.
The file webwork2/PG_VERSION is no longer used. I think this will improve compatibility with docker on some machines where linking PG_VERSION to pg/VERSION caused troubles.

goehle and others added 30 commits June 20, 2016 14:13
… and mess up the database. Note: We do not need decode from thaw because as sequences of bytes nothing changes. (I think.)
…ch/webwork2 into locbug

Conflicts:
	courses.dist/modelCourse/course.conf
added [qw(Encode::Encoding)] to ${pg}{modules}) in defaults.config as…
…lop_uft8_ver2

# Conflicts:
#	lib/WeBWorK/ContentGenerator/Instructor/SendMail.pm
#	lib/WeBWorK/Utils.pm
mgage and others added 16 commits January 9, 2019 07:09
also have a "home directory" printout  for debugging.
…pty_passwords_during_checkPassword

PR 911+910+904 hotfixes to mgage new_develop_candidate_01_01_2019
…base

use the utf8mb4 charset. The new file will be added to the running Docker
MariabDB 10.1 container using changes to be made to docker-compose.yml in
a following commit.
	mysql_enable_utf8    ==>    mysql_enable_utf8mb4
	"SET NAMES 'utf8'"   ==>    SET NAMES 'utf8mb4'"
	modify the length of several VARCHAR columns so key length < 1000
	   as utf8mb4 reserves 4 bytes per character.
These changes sufficed to allow (in a Docker test system):
	creating and using the admin course in a new MariaDB 10.0 database
		which was set to use utf8mb4 as the default charset,
	running OPL_update,
	creating a test course and creating a homework set with the question
		setMAAtutorial/hello.pg in it,
	creating a user whose name needs UTF8 characters, and
	submitting a UTF8 encoded string as an answer to
		setMAAtutorial/hello.pg and storing the UTF8 answer in the
		past_answer table.
Note: The Docker test was done after the docker-compose.yml was modified,
which is in the following commit.
database to use utf8mb4. This needs to be done BEFORE
the Docker database storage volume is created.

A special version called
	docker-compose.yml.specialUTF8MB4-db-storage
can be used to replace docker-compose.yml, and will
change the name of the database container, the database
named storage volume, and make related settings. Replace
docker-compose.yml with the file will force Docker to
run the containers in a manner which will not intefere with
the non-utf8mb4 database used by the prior configuration.
That approach would allow one to switch between the use of
Docker containers using a utf8mb4 database and using an
old style database, just by using the "correct" settings
in docker-compose.yml for each "case".
To allow the 1000 byte limit for keys, the OPL_local_statistics was
changed to use ENGINE=MyISAM. That change allowed reducing the
source_file column from varchar(255) to varchar(245), while the
767 bytes key limit of ENGINE=InnoDB as was used by this table and
by OPL_global_statistics.sql (at present) would have limited this to
about 191 characters.

The OPL_problem_user table had 3 keys changed from using 100 characters
each to using only 80 characters each. The same changes were already
made to lib/WeBWorK/DB/Record/PastAnswer.pm for the same reason. Only
the first 245 characters of source_file are used as a key. Finally,
the CHARACTER SET of OPL_problem_user was explicitly changed from ascii
to utf8mb4.

The saved version of OPL_global_statistics.sql needs to be hand modified
as follows, until the GitHub master branch has these changes made. Only
after those changes can the file be loaded into a utf8mb4 database.
        utf8    => utf8mb4
        255     => 245
  ENGINE=InnoDB => ENGINE=MyISAM
=================================

10c10
< /*!40101 SET NAMES utf8mb4 */;
---
> /*!40101 SET NAMES utf8 */;
24c24
< /*!40101 SET character_set_client = utf8mb4 */;
---
> /*!40101 SET character_set_client = utf8 */;
26c26
<   `source_file` varchar(245) NOT NULL,
---
>   `source_file` varchar(255) NOT NULL,
31c31
< ) ENGINE=MyISAM DEFAULT CHARSET=utf8mb4;
---
> ) ENGINE=InnoDB DEFAULT CHARSET=utf8;
Note: When I tried to include XML::Simple near the start of the first
"cpanm install" line, there was an error:
   Building and testing XMLRPC-Lite-0.717 ... ! Installing XMLRPC::Lite failed.
   See /root/.cpanm/work/1551887935.125/build.log for details.
   Retry with --force to force install it.
so it was put into a second "cpanm install" line.
…2019_utf8mb4

Tani develop candidate 02 24 2019 utf8mb4
…le. (Not currently possible in read_whole_file inside the Safe compartment.
…3_10_2019

# Conflicts:
#	lib/Apache/WeBWorK.pm
Change FileManager to be compatible with UTF-8
@@ -926,7 +926,16 @@ sub Upload {
if ($type eq 'Text') {
$upload->dispose;
$data =~ s/\r\n?/\n/g;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mgage I think that the substitution $data =~ s/\r\n?/\n/g; which is run before the data is converted from a string of bytes which may be "raw" UTF8 to a string of characters may be what is sometimes corrupting UTF8 files uploaded in "Text" format.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's actually been working for me. I'm pretty sure however that $type is never equal to 'Text' in my examples. The files uploaded have been identified in binary when the mode is automatic. (you reported failures with the automatic mode, but I didn't see that. ) The isText subroutine has been experimented with for other reasons (in general the current hacks for testing for text need to be rethought with the advent of ift8mb4)

sub isText {
my $string = shift;

#	return $string !~ m/[^\s\x20-\x7E]{4}/;
return utf8::is_utf8($string);
# return $string !~ m/[\x00-\x08\x0B\x0C\x0E-\x1F\x7F]{2}/;

}

Is what I have in my current copy of FileManager.pm

If I force uploading in text mode I got errors but forcing the print command to use :utf8 fixed the problem. (open(UPLOAD,">:encoding(UTF-8)",$file)).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding FileManager uploads mangling UTF8 files:

  1. I apologize - I tested again and "automatic" is not causing trouble. I probably confused the file sizes. The bigger file size is the damaged file and the smaller file was the good file. I think I may have accidentally assumed the opposite.

  2. I did some testing on a very small file I called small_utf8.pg which has just the special characters from the file I was testing with separated by a blank line. The file contents were:

é

¿

á

ó

á

ñ

When uploaded in the 2 different modes - the resulting file had different sizes:

-rw-r--r-- 1 tani tani 23 Mar 14 22:39 small_utf8_uploaded_binary.pg
-rw-r--r-- 1 tani tani 35 Mar 14 22:39 small_utf8_uploaded_text.pg

The original file had 23 bytes and was the same as the result of the binary upload.

Using Unix od we can see the data in each file.

> od -a -t x1 small_utf8_uploaded_binary.pg

0000000   C   )  nl  nl   B   ?  nl  nl   C   !  nl  nl   C   3  nl  nl
         c3  a9  0a  0a  c2  bf  0a  0a  c3  a1  0a  0a  c3  b3  0a  0a
0000020   C   !  nl  nl   C   1  nl
         c3  a1  0a  0a  c3  b1  0a
0000027

> od -a -t x1 small_utf8_uploaded_text.pg 

0000000   C etx   B   )  nl  nl   C stx   B   ?  nl  nl   C etx   B   !
         c3  83  c2  a9  0a  0a  c3  82  c2  bf  0a  0a  c3  83  c2  a1
0000020  nl  nl   C etx   B   3  nl  nl   C etx   B   !  nl  nl   C etx
         0a  0a  c3  83  c2  b3  0a  0a  c3  83  c2  a1  0a  0a  c3  83
0000040   B   1  nl
         c2  b1  0a
0000043

Notice how the six 2-byte special characters were each replaced by a 4-byte sequence. That added 12 extra bytes to the files, and when I try to edit the mangled file I see:

é

¿

á

ó

á

ñ

which corresponds to the sort of manged text seen in the mangled version of the original pg file.

  1. I think the issue is that the uploaded data is a sequence of bytes (octets), so it should not be saved using either
  • >:utf8 as in
    • if (open(UPLOAD,">:utf8",$file)) {print UPLOAD $data; close(UPLOAD)}
  • or >:encoding(UTF-8) as in
    • if (open(UPLOAD,">:encoding(UTF-8)",$file)) {print UPLOAD $data; close(UPLOAD)}
  • as those are meant for when the Perl string is already a sequence of Unicode characters which need to be converted to a sequence of UTF-8 octets.
  • Instead it should just be written out to a file "as is" using the old fashioned:
    • if (open(UPLOAD,">$file")) {print UPLOAD $data; close(UPLOAD)}
  • which should not modify the sequence of bytes.
  • For me, that change leads to text mode giving an unmangled file.
  1. The question remains whether the substitution to change DOS/Mac line-endings to Unix line-endings can mess up multi-byte characters.
  • According to https://en.wikipedia.org/wiki/UTF-8 multi-byte UTF-8 characters have the high-order bit set in all the bytes, so bytes with the high-order bit off must be an ASCII character in the regular location.
  • In particular, the octet \r = CR (Carriage return, 0x0D, 13 in decimal) cannot appear in a valid UTF-8 file with any meaning other than carriage return.
  • Thus, I conclude that the line-end substitution cannot cause any harm.
  • Sorry, I did not do my homework before raising the suspicion that the line-end substitution might be related to the problem.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm repeating much of what I just posted at #915 as it bears on this issue also.

I did some more reading/review about Perl and Unicode.

In short:

  • Any "text" which arrives from outside Perl should be "decoded" (from whatever transfer encoding it was sent in) and thereby converted to a sequence of Perl characters (a "text" string) when a Perl "text string" is needed.
  • Any "text" being sent out of Perl should be "encoded" from a sequence of Perl characters (a "text" string) to whatever transfer encoding is expected on the other side of the line.

When FileManager.pm is handling a file upload, the upload data is just a sequence of bytes from the file being uploaded. Those bytes should essentially just be written to the file being created. When the data is either in an 8-bit encoding or in UTF-8 encoding, the line-end substitution code is safe (as explained in the prior comment), but no encoding conversion code should be applied.

Bottom line - for this situation we want: if (open(UPLOAD,">$file")) {print UPLOAD $data; close(UPLOAD)} without any encoding layer involved.

References:

@taniwallach
Copy link
Copy Markdown
Member

@mgage

Maybe we can add a setting in defaults.conf which controls whether the (problematic) 3-byte max mysql utf8 charset is used instead of the full 4-byte max utf8mb4. (This difference exists only at the SQL level: in terms of the charset mySQL is storing, and in terms of the charset for the communications between WW and the SQL server).

Then anyone who needs to support a (old) version of mySQL which does not support utf8mb4 can flip the setting, while modern installations can just use utf8mb4 and avoid any possible issues with 4-byte characters?

Comment thread lib/Apache/WeBWorK.pm
# We set the bimode for print to utf8 because some language options
# use utf8 characters
binmode(STDOUT, ":utf8");
binmode(STDOUT, "encoding(UTF-8)");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be ":encoding(UTF-8)" with a colon?

Copy link
Copy Markdown
Member Author

@mgage mgage Mar 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Believe there is supposed to be a colon in front of encoding.

I agree about creating a switch. The places I've identified where utf8mb4 is specifically mentioned and causes trouble for my old mysql installation (pre 5.3) which can't handle mb4 are:

  • set up utf8mb4 switch???
  • control line 95,99 in OPL-update
  • LocationAddresses.pm ine 31
  • SQL.pm line 272

The use of encoding(UTF-8) above didn't cause problems.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also:

  • lib/WeBWorK/Utils/ListingDB.pm line 519 (SET NAMES for DB connection).
  • bin/update-OPL-statistics - line 85 (CHARACTER SET utf8mb4 vs. ascii)

I think all the changes can be found in the following 2 commits from my tree.
taniwallach@b840e28
taniwallach@472e837
from March 6 2019 commits to branch: https://github.com/taniwallach/webwork2/commits/tani_develop_candidate_02_24_2019_utf8mb4

Note the column sizes in the various database tables which exceed the index limit for utf8mb4 encoding can be made smaller already, which may save trouble for relatively new tables when conversion needs to be done.

@mgage
Copy link
Copy Markdown
Member Author

mgage commented Apr 7, 2019

I'm closing this since nearly everything in this pull request is also in #932 and can be made compatible with early versions of mysql by setting $ENABLE_UTF8MB4=0 in site.conf. The one set of files missing is the themes/math4-ar directory. Can someone remind me what this directory contains? Is it still needed?

@mgage mgage closed this Apr 7, 2019
@taniwallach
Copy link
Copy Markdown
Member

The one set of files missing is the themes/math4-ar directory. Can someone remind me what this directory contains? Is it still needed?

It was an early attempt (by @heiderich) to allow using Arabic and the right-to-left text direction in a course, but required a special set of templates. The main change in math4-ar was setting lang="ar" dir="rtl" in the main HTML tag of 4 files.

This was replaced in 5758400 (and e4b6358) by a modification of the main math4 files to use a new method to add a lang and when needed a dir tag as needed based on the language set for the course: <!--#output_course_lang_and_dir-->

That approach is currently supporting Hebrew courses, should support Arabic also with dir=rtl set, and other course languages without triggering right-to-left text direction. Adding support to turn on RTL mode for other right to left languages can be done in the future by adding suitable cases to the output_course_lang_and_dir method in lib/WeBWorK/ContentGenerator.pm.

@mgage mgage deleted the develop_multi_ling_03_10_2019_no_mb4 branch April 15, 2019 00:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants