New develop candidate 01 01 2019 by mgage · Pull Request #908 · openwebwork/webwork2

mgage · 2019-01-02T03:49:12Z

This includes some fixes which have been sitting on my laptop for a while. I think this is an improved version of develop_candidate and I'm withdrawing that pull request.

… and mess up the database. Note: We do not need decode from thaw because as sequences of bytes nothing changes. (I think.)

…bled.)

…ch/webwork2 into locbug Conflicts: courses.dist/modelCourse/course.conf

… suggested by goehle

added [qw(Encode::Encoding)] to ${pg}{modules}) in defaults.config as…

…om/heiderich/webwork2 into locbug

…o locbug

…lop_uft8_ver2 # Conflicts: # lib/WeBWorK/ContentGenerator/Instructor/SendMail.pm # lib/WeBWorK/Utils.pm

as an undef value of $self->{password} is needed for the authenticate() function to detect session-timeout. 2. Prevent the localized "inactivity timeout" error message from being overridden by an "authentication failed" error message when it should not be overriden.

taniwallach · 2019-01-09T19:48:29Z

Hi Tani, Without having researched as much as you have I also think we should go to using the full utf8mb. What I have been doing is to change as much of the perl code to allow utf8mb. As far as I can tell in perl utf8 and utf8mb are synonymous. I think there is a reasonable chance that develop would work well with mysql once the limits on certain field sizes have been adjusted. I’m worried about what it would take to convert data (e.g. problems) that had been stored in utf8 or smaller format to behave properly when the format is changed to utf8mb4.

As far as I understand it the difference between utf8mb4 and utf8 is only in mySQL and MariaDB, as mySQL decided to create a charset in mySQL called utf8 which is restricted to Unicode characters which need 3 or fewer bytes in the "real" UTF-8 encoding. Thus there are no Perl side issues except for those triggered by trying to push disallowed UTF-8 characters into mySQL tables which do not permit them (where utf8mb4 would allow them) or expecting such characters to come back out of the database which could not handle them.

Thus changing from mySQL utf8 to utf8mb4 cannot lose data. The challenge is the change from the current latin1 encoding to utf8mb4.

Apparently using utf8 in mySQL can also trigger problems with encrypted data pushed into VARCHAR columns, as certain byte sequences are disallowed. (But that is bad practice as binary data should not be stored in VARCHAR columns.)

The recommendation seems to be to open connections with SET NAMES utf8mb4 COLLATE utf8mb4_unicode_520_ci instead of SET NAMES utf8 and to explicitly set the charset and collation when a database is first created.

References:

References on the conversion process (various options):

https://codex.wordpress.org/Converting_Database_Character_Sets
https://dba.stackexchange.com/questions/123572/convert-mysql-database-from-latin1-to-utf8mb4-and-take-care-of-german-umlauts
A walkthrough via a SQL dump / modify / restore: https://gist.github.com/mdemblani/34d30015f24e1927736f189b88be143a
Another walkthrough: https://www.percona.com/blog/2018/04/10/migrating-database-charsets-to-utf8mb4/ which has some comments on how to use ALTER TABLE if necessary.
That one referred to an explanation of why it uses VARBINARY as an intermediate stage in ALTER TABLES in https://www.tusacentral.net/joomla/index.php/mysql-blogs/162-understand-what-happens-in-mysql-when-using-utf-string-with-latin1-encoding.html
https://nixmash.com/post/i-never-did-like-latin-from-mysql-latin-1-to-utf8mb4
https://medium.com/@manish_demblani/breaking-out-from-the-mysql-character-set-hell-24c6a306e1e5

taniwallach · 2019-01-09T20:40:20Z

It might be worth seeing what happens if we convert to innodb format. Moodle converted some time ago and the prevailing wisdom was that that was the way to go. We did some experiments on this early on (when the push to innodb was just getting started) and found that using innodb slowed things down — in particular the library script OPL-update was very slow. I think it’s worth doing that experiment again before we go much further with the utf8/utf8mb4 database question.

@mgage - about InnoDB vs myISAM - this is somewhat off topic from the encoding/internationalization issue, but this thread is a good a place as anywhere else for now... The only real reason it matters here is the different limits on key sizes and how that effects the field sizes or a setting to use only a suitable portion of a field in the key.

I agree, this is definitely something to consider and test. The big challenge is figuring out how to do reasonable performance comparisons (and collect suitable metrics) under some sort of real load.

Are there any people with DB performance monitoring / tuning experience in the active WW community?

There is a series of articles about monitoring mySQL performance monitoring at https://www.datadoghq.com/blog/monitoring-mysql-performance-metrics/

Some possible free+open monitoring tools:

I found an interesting article on myISAM vs InnoDB at https://www.liquidweb.com/kb/mysql-performance-myisam-vs-innodb/ and the short summary seems to be:

MyISAM is better for large tables which are read-heavy and write-light.
InnoDB is better for tables which are write-heavy as it uses "row level" rather than "table level" locking.
Several other sites I looked at had pretty similar conclusions.

I suspect that bench-marking during OPL table creation is not that helpful, as OPL-update is an infrequent activity (even if annoying slow) while the OPL tables are ready very often when the library browser is used with no changes being made. Thus the read speeds during production for these tables are more important that the creation time. This hints that the OPL tables should probably stay as MyISAM, as it is reported to be faster at selects.

On the other hand, some of the course data tables (at least those which store answers and scores) are tables which do a lot of writing, so may be better in InnoDB.

Most likely, the best approach would to consider each type of table and how it is used during production and select the correct engine type for that table. Quite a number of web posts discuss the options and advantages of using some tables of each type in a single database.

One such thread on using both table types in one database is: https://dba.stackexchange.com/questions/385/is-it-common-practice-to-mix-innodb-and-myisam-tables-on-same-server

…pty_passwords_during_checkPassword PR 911+910+904 hotfixes to mgage new_develop_candidate_01_01_2019

mgage · 2019-01-23T00:33:43Z

We are getting "The file does not appear to be a text file" when trying to
read pg files using FileManager on demo.webwork.rochester.edu which is using
new develop candidate 01 01 2019 (#908)

The code being triggered seems to be line 511 in FileManager
and isText() is defined at line 1275 by:

##################################################
#
# Check if a string is plain text
# (i.e., doesn't contain four non-regular
# characters in a row.)
#
##################################################
#
# Check if a string is plain text
#
sub isText {
	my $string = shift;
	#	return $string !~ m/[^\s\x20-\x7E]{4}/;
	return utf8::is_utf8($string);
	# return $string !~ m/[\x00-\x08\x0B\x0C\x0E-\x1F\x7F]{2}/;
}

The read file branch has been changed substantially in this PR but it hasn't caused trouble anywhere else. isText is only used in this file. Avoiding the line return utf8::is_utf8($string) removes the error.
I haven't yet investigated further about the behavior of utf8::is_utf8. I'll do that on my own
development machine.

The symptoms: On demo.webwork.rochester.edu/webwork2/2019_JMM_Baltimore (and any other course including UR101 which you can access with profa/profa ) go to the file manager and try to read or view any .pg file. You will get the "is not a text file" error.

dpvc · 2019-01-23T15:43:42Z

I don't think that utf8::is_utf8($string); is the right thing to use, here. The isText() function is supposed to tell if the contents of a string is text rather than raw binary data, but that is not what utf8::is_utf8() does. That function is only about how the string is stored internally, not about the actual contents of the string, so is not the appropriate test for this.

Of course, there is no actual way to tell if a string is text versus binary, and so isText() was just a heuristic to try to work it out. Of course, it derives from the olden days, when text files were ASCII files, and so would not contain many non-ASCII characters, so the test was to look for four non-ASCII characters in a row as an indication of a binary file.

Now, with Unicode files, that is no longer a sufficient test, since they can contain lots of non-ASCII characters. But whether the string is stored internally in UTF8 encoding or not has nothing to do with its contents, so is definitely the wrong test.

Perhaps a better test would be to look for a high percentage of spaces and newlines (or carriage returns)? Text files like .pg files and such should have a higher concentration of those than a binary file. That might do the trick.

The main reason for the test is to tell if you need to convert line-endings from Mac or DOS line-endings (\r or \r\n) to unix line-endings (\n). Perhaps that is no longer really necessary (though I suspect it is). The other reason is to prevent viewing/editing non-text files, but I suppose that is not all that critical. So it might be that you can do without the check entirely.

mgage · 2019-01-23T16:26:55Z

Thanks Davide. I thought it might be something like that. I'm finally on the plane ready to fly back to Rochester. I got snowed out on Sunday and this was the first flight.

dpvc · 2019-01-23T16:30:04Z

Wow, sorry abut your long delay. Hope your trip is uneventful from here on out.

Alex-Jordan · 2019-01-23T17:36:24Z

I've run into WW telling me the answer log (or was it the login log?) is not a text file, and refusing to show it to me. If a new test is devised, these should be looked at along with .pg.

Speaking of that, why not pass judgment by file extension? Whitelist .pg, .pl, .txt, .html, .log, .csv, ...

taniwallach · 2019-01-23T18:46:13Z

I think Alex is right - the first round of decisions should based on file extensions. That would be fast and accurate for most files in a WW course. File extensions are already how lib/WeBWorK/ContentGenerator/Instructor/FileManager.pm is deciding what are image files (~line 469 : $file =~ m/\.(gif|jpg|png)/i). (A config option could be included to allow editing the recognized extensions for text and image files.)

As a fallback, we could use the output from the Unix file utility to decide what other file types to treat as text, if there is any real need for it. The file utility tends to do a very good jobs (in my experience) of classifying files. Some output classifications from a quick run on some misc WW related directories:

ASCII text
ISO-8859 text
UTF-8 Unicode text
PNG image data ...
GIF image data ...
JPEG image data ...
SVG Scalable Vector Graphics image
Perl script text executable
zlib compressed data
PDF document ...
Zip archive data
compiled Java class data ...
TrueType font data

dpvc · 2019-01-23T18:54:09Z

we could use the output from the Unix file utility to decide

Note that the main usage for isText() is for knowing whether to convert line breaks when a file is uploaded to the server. That is happening before the file is created on the unix system, so you can't run file on it yet, unless you first save it verbatim as a temporary file, then test it, then delete the temporary file and save the modified one (if it needs to be modified).

For the other situations, yes, the would work, but you now have to decode the file output. I think file extensions are probably sufficient, though hardly perfect.

taniwallach · 2019-01-23T19:31:57Z

Note that the main usage for isText() is for knowing whether to convert line breaks when a file is uploaded to the server. That is happening before the file is created on the unix system, so you can't run file on it yet, unless you first save it verbatim as a temporary file, then test it, then delete the temporary file and save the modified one (if it needs to be modified).

Thanks - sorry, your comment that this is the most important need for isText() did not catch my attention. I also think it likely that things will work better if text files on the server do have Unix line-endings. file also (at least the current version I use on Debian) reports when there is a text file with non-Unix line endings, and it seems pretty easy to determine from the output that a file is a "text" file with with CRLF line terminators (dos) or with with CR line terminator (mac) . (See below.)

For the upload context, I am very skeptical that a simple test to properly determine if a file is a text file can be written once UTF-8 is allowed which is less complicated than storing a temp file and checking it with the unix file command. Why recreate the wheel... why not just upload the raw file as is, test file file type, and when necessary (detected as a text file with non-Unix line endings) either trigger an automatic conversion (maybe using the utilities dos2unix and/or mac2unix to convert the file in place) or ask the user for an OK to run such a conversion. We could save on checking with the user for the "known" text file extensions.

Once that is possible, it should not be difficult to also add a simple feature to allow converting files in place between the different line-endings as needed by the user. (Maybe a user wants to download a PG file with DOS line endings, and has fewer tools to do this on the local PC side of things?)

Sample file output reporting non-Unix line-endings:

tani@lxtani:/tmp$ file course.conf 
course.conf: ASCII text

tani@lxtani:/tmp$ unix2dos course.conf
unix2dos: converting file course.conf to DOS format...

tani@lxtani:/tmp$ file course.conf 
course.conf: ASCII text, with CRLF line terminators

tani@lxtani:/tmp$ dos2unix course.conf ; unix2mac course.conf
dos2unix: converting file course.conf to Unix format...
unix2mac: converting file course.conf to Mac format...

tani@lxtani:/tmp$ file course.conf 
course.conf: ASCII text, with CR line terminators

tani@lxtani:/tmp$ file HebrewSample1.pg
HebrewSample1.pg: UTF-8 Unicode text

tani@lxtani:/tmp$ unix2dos HebrewSample1.pg
unix2dos: converting file HebrewSample1.pg to DOS format...

tani@lxtani:/tmp$ file HebrewSample1.pg
HebrewSample1.pg: UTF-8 Unicode text, with CRLF line terminators

tani@lxtani:/tmp$ dos2unix HebrewSample1.pg ; unix2mac HebrewSample1.pg
dos2unix: converting file HebrewSample1.pg to Unix format...
unix2mac: converting file HebrewSample1.pg to Mac format...

tani@lxtani:/tmp$ file HebrewSample1.pg
HebrewSample1.pg: UTF-8 Unicode text, with CR line terminators

taniwallach · 2019-01-23T19:36:42Z

If the file suffix + file test approach seems reasonable, I can try in the coming week to write the code, do some rudimentary tests, and make a PR into
https://github.com/mgage/webwork2/tree/new_develop_candidate_01_01_2019

However, if someone has a better idea - we can try it. @dpvc - do you want to try out the white-space/line-end statistics first?

mgage · 2019-01-28T03:08:43Z

This PR is almost entirely (but not completely) about utf8 and multilingual issues. The changes that are not essentially about multilingual issues are pretty simple and I think I'll leave them. They should not be too confusing to the reviewers.

mgage · 2019-02-25T02:02:24Z

This PR has been replaced by PR #927 where the conflicts have already been resolved.
I'll close this commit, but it should be remembered for the discussion items it contains. We
may need them as we vet the changes for the localization commits.

goehle and others added 30 commits June 20, 2016 14:13

Tweaking how UTF8 encoding is done.

c7830d1

Adding more UTF8 support.

7f6a18a

Cleanup

5c20361

Added support for utf8 on xml

a54c760

Encoded results from freeze so they don't become long utf8 characters…

5c44802

… and mess up the database. Note: We do not need decode from thaw because as sequences of bytes nothing changes. (I think.)

Fix some broken css

fa1f954

small change.

ad37eb1

Added hardcopy support (assumin gyou have the fonts installed and ena…

0073969

…bled.)

Tracking down more open commands.

45f8bc1

corrected file encodings

7998565

Merge branch 'convert_to_utf8_encoding' of https://github.com/heideri…

eeb63a3

…ch/webwork2 into locbug Conflicts: courses.dist/modelCourse/course.conf

added [qw(Encode::Encoding)] to ${pg}{modules}) in defaults.config as…

e8b9947

… suggested by goehle

Merge pull request #14 from heiderich/Encode-error

ceb43be

added [qw(Encode::Encoding)] to ${pg}{modules}) in defaults.config as…

Merge branch 'add_maketext_calls_to_achievements' of https://github.c…

e1b5553

…om/heiderich/webwork2 into locbug

Merge branch 'locbug' of https://github.com/goehle/webwork2 into locbug

70ae45a

Merge branch 'develop' of https://github.com/openwebwork/webwork2 int…

80b663e

…o locbug

Merge branch 'develop' of https://github.com/openwebwork/webwork2 int…

f2b1564

…o locbug

Freeze/thaw to base64 because mysql fields are varchar

b0bfe73

update localization files.

200c5b8

Support for transition to freeze_base64

ea7147f

Polishing error handling

3d3795c

Misspelled method.

eaf0f2f

Merge branch 'locbug' of https://github.com/goehle/webwork2 into locbug

239790f

Used utf8::valid when I should have used utf8::is_utf8

1b083a0

Whoops.

6e2fbfd

Wrong maketext

e85eb00

Merge branch 'locbug' of https://github.com/goehle/webwork2 into deve…

054666f

…lop_uft8_ver2 # Conflicts: # lib/WeBWorK/ContentGenerator/Instructor/SendMail.pm # lib/WeBWorK/Utils.pm

local experimental changes

7541f5d

add utf-8 support to OPL-update (needs testing)

20c3237

add use utf8; to WWSafe.pm

4bee27e

Merge pull request #23 from taniwallach/tani_forbid_whitespace_and_em…

01cba1e

…pty_passwords_during_checkPassword PR 911+910+904 hotfixes to mgage new_develop_candidate_01_01_2019

taniwallach mentioned this pull request Jan 14, 2019

Collect internationalization/localization changes from forked repositories in particular from the CCDMD forks #914

Closed

mgage requested review from Alex-Jordan, dlglin, dpvc, drdrew42, dsteinmo, heiderich, jwj61, nmoller, sean-fitzpatrick and taniwallach January 23, 2019 01:03

mgage mentioned this pull request Feb 25, 2019

Develop candidate 02 24 2019 #927

Closed

mgage closed this Feb 25, 2019

mgage deleted the new_develop_candidate_01_01_2019 branch March 11, 2019 19:50

taniwallach mentioned this pull request Aug 9, 2019

Investigate WWSafe.pm vs. Perl's standard Safe.pm - UTF-8 motivation #982

Open

Uh oh!

Conversation

mgage commented Jan 2, 2019

Uh oh!

taniwallach commented Jan 9, 2019

Uh oh!

taniwallach commented Jan 9, 2019

Uh oh!

mgage commented Jan 23, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dpvc commented Jan 23, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mgage commented Jan 23, 2019 via email • edited by dpvc Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dpvc commented Jan 23, 2019

Uh oh!

Alex-Jordan commented Jan 23, 2019

Uh oh!

taniwallach commented Jan 23, 2019

Uh oh!

dpvc commented Jan 23, 2019

Uh oh!

taniwallach commented Jan 23, 2019

Uh oh!

taniwallach commented Jan 23, 2019

Uh oh!

mgage commented Jan 28, 2019

Uh oh!

mgage commented Feb 25, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

mgage commented Jan 23, 2019 •

edited

Loading

dpvc commented Jan 23, 2019 •

edited

Loading

mgage commented Jan 23, 2019 via email •

edited by dpvc

Loading