Skip to content

New develop candidate 01 01 2019#908

Closed
mgage wants to merge 96 commits into
openwebwork:developfrom
mgage:new_develop_candidate_01_01_2019
Closed

New develop candidate 01 01 2019#908
mgage wants to merge 96 commits into
openwebwork:developfrom
mgage:new_develop_candidate_01_01_2019

Conversation

@mgage
Copy link
Copy Markdown
Member

@mgage mgage commented Jan 2, 2019

This includes some fixes which have been sitting on my laptop for a while. I think this is an improved version of develop_candidate and I'm withdrawing that pull request.

goehle and others added 30 commits June 20, 2016 14:13
… and mess up the database. Note: We do not need decode from thaw because as sequences of bytes nothing changes. (I think.)
…ch/webwork2 into locbug

Conflicts:
	courses.dist/modelCourse/course.conf
added [qw(Encode::Encoding)] to ${pg}{modules}) in defaults.config as…
…lop_uft8_ver2

# Conflicts:
#	lib/WeBWorK/ContentGenerator/Instructor/SendMail.pm
#	lib/WeBWorK/Utils.pm
as an undef value of $self->{password} is needed for the
authenticate() function to detect session-timeout.

2. Prevent the localized "inactivity timeout" error message from
being overridden by an "authentication failed" error message
when it should not be overriden.
@taniwallach
Copy link
Copy Markdown
Member

Hi Tani, Without having researched as much as you have I also think we should go to using the full utf8mb. What I have been doing is to change as much of the perl code to allow utf8mb. As far as I can tell in perl utf8 and utf8mb are synonymous. I think there is a reasonable chance that develop would work well with mysql once the limits on certain field sizes have been adjusted. I’m worried about what it would take to convert data (e.g. problems) that had been stored in utf8 or smaller format to behave properly when the format is changed to utf8mb4.

As far as I understand it the difference between utf8mb4 and utf8 is only in mySQL and MariaDB, as mySQL decided to create a charset in mySQL called utf8 which is restricted to Unicode characters which need 3 or fewer bytes in the "real" UTF-8 encoding. Thus there are no Perl side issues except for those triggered by trying to push disallowed UTF-8 characters into mySQL tables which do not permit them (where utf8mb4 would allow them) or expecting such characters to come back out of the database which could not handle them.

Thus changing from mySQL utf8 to utf8mb4 cannot lose data. The challenge is the change from the current latin1 encoding to utf8mb4.

Apparently using utf8 in mySQL can also trigger problems with encrypted data pushed into VARCHAR columns, as certain byte sequences are disallowed. (But that is bad practice as binary data should not be stored in VARCHAR columns.)

The recommendation seems to be to open connections with SET NAMES utf8mb4 COLLATE utf8mb4_unicode_520_ci instead of SET NAMES utf8 and to explicitly set the charset and collation when a database is first created.

References:

References on the conversion process (various options):

@taniwallach
Copy link
Copy Markdown
Member

It might be worth seeing what happens if we convert to innodb format. Moodle converted some time ago and the prevailing wisdom was that that was the way to go. We did some experiments on this early on (when the push to innodb was just getting started) and found that using innodb slowed things down — in particular the library script OPL-update was very slow. I think it’s worth doing that experiment again before we go much further with the utf8/utf8mb4 database question.

@mgage - about InnoDB vs myISAM - this is somewhat off topic from the encoding/internationalization issue, but this thread is a good a place as anywhere else for now... The only real reason it matters here is the different limits on key sizes and how that effects the field sizes or a setting to use only a suitable portion of a field in the key.

I agree, this is definitely something to consider and test. The big challenge is figuring out how to do reasonable performance comparisons (and collect suitable metrics) under some sort of real load.

Are there any people with DB performance monitoring / tuning experience in the active WW community?

There is a series of articles about monitoring mySQL performance monitoring at https://www.datadoghq.com/blog/monitoring-mysql-performance-metrics/

Some possible free+open monitoring tools:

I found an interesting article on myISAM vs InnoDB at https://www.liquidweb.com/kb/mysql-performance-myisam-vs-innodb/ and the short summary seems to be:

  • MyISAM is better for large tables which are read-heavy and write-light.
  • InnoDB is better for tables which are write-heavy as it uses "row level" rather than "table level" locking.
    Several other sites I looked at had pretty similar conclusions.

I suspect that bench-marking during OPL table creation is not that helpful, as OPL-update is an infrequent activity (even if annoying slow) while the OPL tables are ready very often when the library browser is used with no changes being made. Thus the read speeds during production for these tables are more important that the creation time. This hints that the OPL tables should probably stay as MyISAM, as it is reported to be faster at selects.

On the other hand, some of the course data tables (at least those which store answers and scores) are tables which do a lot of writing, so may be better in InnoDB.

Most likely, the best approach would to consider each type of table and how it is used during production and select the correct engine type for that table. Quite a number of web posts discuss the options and advantages of using some tables of each type in a single database.

One such thread on using both table types in one database is: https://dba.stackexchange.com/questions/385/is-it-common-practice-to-mix-innodb-and-myisam-tables-on-same-server

…pty_passwords_during_checkPassword

PR 911+910+904 hotfixes to mgage new_develop_candidate_01_01_2019
@mgage
Copy link
Copy Markdown
Member Author

mgage commented Jan 23, 2019

We are getting "The file does not appear to be a text file" when trying to
read pg files using FileManager on demo.webwork.rochester.edu which is using
new develop candidate 01 01 2019 (#908)

The code being triggered seems to be line 511 in FileManager
and isText() is defined at line 1275 by:

##################################################
#
# Check if a string is plain text
# (i.e., doesn't contain four non-regular
# characters in a row.)
#
##################################################
#
# Check if a string is plain text
#
sub isText {
	my $string = shift;
	#	return $string !~ m/[^\s\x20-\x7E]{4}/;
	return utf8::is_utf8($string);
	# return $string !~ m/[\x00-\x08\x0B\x0C\x0E-\x1F\x7F]{2}/;
}

The read file branch has been changed substantially in this PR but it hasn't caused trouble anywhere else. isText is only used in this file. Avoiding the line return utf8::is_utf8($string) removes the error.
I haven't yet investigated further about the behavior of utf8::is_utf8. I'll do that on my own
development machine.

The symptoms: On demo.webwork.rochester.edu/webwork2/2019_JMM_Baltimore (and any other course including UR101 which you can access with profa/profa ) go to the file manager and try to read or view any .pg file. You will get the "is not a text file" error.

@dpvc
Copy link
Copy Markdown
Member

dpvc commented Jan 23, 2019

I don't think that utf8::is_utf8($string); is the right thing to use, here. The isText() function is supposed to tell if the contents of a string is text rather than raw binary data, but that is not what utf8::is_utf8() does. That function is only about how the string is stored internally, not about the actual contents of the string, so is not the appropriate test for this.

Of course, there is no actual way to tell if a string is text versus binary, and so isText() was just a heuristic to try to work it out. Of course, it derives from the olden days, when text files were ASCII files, and so would not contain many non-ASCII characters, so the test was to look for four non-ASCII characters in a row as an indication of a binary file.

Now, with Unicode files, that is no longer a sufficient test, since they can contain lots of non-ASCII characters. But whether the string is stored internally in UTF8 encoding or not has nothing to do with its contents, so is definitely the wrong test.

Perhaps a better test would be to look for a high percentage of spaces and newlines (or carriage returns)? Text files like .pg files and such should have a higher concentration of those than a binary file. That might do the trick.

The main reason for the test is to tell if you need to convert line-endings from Mac or DOS line-endings (\r or \r\n) to unix line-endings (\n). Perhaps that is no longer really necessary (though I suspect it is). The other reason is to prevent viewing/editing non-text files, but I suppose that is not all that critical. So it might be that you can do without the check entirely.

@mgage
Copy link
Copy Markdown
Member Author

mgage commented Jan 23, 2019 via email

@dpvc
Copy link
Copy Markdown
Member

dpvc commented Jan 23, 2019

Wow, sorry abut your long delay. Hope your trip is uneventful from here on out.

@Alex-Jordan
Copy link
Copy Markdown
Contributor

I've run into WW telling me the answer log (or was it the login log?) is not a text file, and refusing to show it to me. If a new test is devised, these should be looked at along with .pg.

Speaking of that, why not pass judgment by file extension? Whitelist .pg, .pl, .txt, .html, .log, .csv, ...

@taniwallach
Copy link
Copy Markdown
Member

I think Alex is right - the first round of decisions should based on file extensions. That would be fast and accurate for most files in a WW course. File extensions are already how lib/WeBWorK/ContentGenerator/Instructor/FileManager.pm is deciding what are image files (~line 469 : $file =~ m/\.(gif|jpg|png)/i). (A config option could be included to allow editing the recognized extensions for text and image files.)

As a fallback, we could use the output from the Unix file utility to decide what other file types to treat as text, if there is any real need for it. The file utility tends to do a very good jobs (in my experience) of classifying files. Some output classifications from a quick run on some misc WW related directories:

  • ASCII text
  • ISO-8859 text
  • UTF-8 Unicode text
  • PNG image data ...
  • GIF image data ...
  • JPEG image data ...
  • SVG Scalable Vector Graphics image
  • Perl script text executable
  • zlib compressed data
  • PDF document ...
  • Zip archive data
  • compiled Java class data ...
  • TrueType font data

@dpvc
Copy link
Copy Markdown
Member

dpvc commented Jan 23, 2019

we could use the output from the Unix file utility to decide

Note that the main usage for isText() is for knowing whether to convert line breaks when a file is uploaded to the server. That is happening before the file is created on the unix system, so you can't run file on it yet, unless you first save it verbatim as a temporary file, then test it, then delete the temporary file and save the modified one (if it needs to be modified).

For the other situations, yes, the would work, but you now have to decode the file output. I think file extensions are probably sufficient, though hardly perfect.

@taniwallach
Copy link
Copy Markdown
Member

Note that the main usage for isText() is for knowing whether to convert line breaks when a file is uploaded to the server. That is happening before the file is created on the unix system, so you can't run file on it yet, unless you first save it verbatim as a temporary file, then test it, then delete the temporary file and save the modified one (if it needs to be modified).

Thanks - sorry, your comment that this is the most important need for isText() did not catch my attention. I also think it likely that things will work better if text files on the server do have Unix line-endings. file also (at least the current version I use on Debian) reports when there is a text file with non-Unix line endings, and it seems pretty easy to determine from the output that a file is a "text" file with with CRLF line terminators (dos) or with with CR line terminator (mac) . (See below.)

For the upload context, I am very skeptical that a simple test to properly determine if a file is a text file can be written once UTF-8 is allowed which is less complicated than storing a temp file and checking it with the unix file command. Why recreate the wheel... why not just upload the raw file as is, test file file type, and when necessary (detected as a text file with non-Unix line endings) either trigger an automatic conversion (maybe using the utilities dos2unix and/or mac2unix to convert the file in place) or ask the user for an OK to run such a conversion. We could save on checking with the user for the "known" text file extensions.

Once that is possible, it should not be difficult to also add a simple feature to allow converting files in place between the different line-endings as needed by the user. (Maybe a user wants to download a PG file with DOS line endings, and has fewer tools to do this on the local PC side of things?)

Sample file output reporting non-Unix line-endings:

tani@lxtani:/tmp$ file course.conf 
course.conf: ASCII text

tani@lxtani:/tmp$ unix2dos course.conf
unix2dos: converting file course.conf to DOS format...

tani@lxtani:/tmp$ file course.conf 
course.conf: ASCII text, with CRLF line terminators

tani@lxtani:/tmp$ dos2unix course.conf ; unix2mac course.conf
dos2unix: converting file course.conf to Unix format...
unix2mac: converting file course.conf to Mac format...

tani@lxtani:/tmp$ file course.conf 
course.conf: ASCII text, with CR line terminators

tani@lxtani:/tmp$ file HebrewSample1.pg
HebrewSample1.pg: UTF-8 Unicode text

tani@lxtani:/tmp$ unix2dos HebrewSample1.pg
unix2dos: converting file HebrewSample1.pg to DOS format...

tani@lxtani:/tmp$ file HebrewSample1.pg
HebrewSample1.pg: UTF-8 Unicode text, with CRLF line terminators

tani@lxtani:/tmp$ dos2unix HebrewSample1.pg ; unix2mac HebrewSample1.pg
dos2unix: converting file HebrewSample1.pg to Unix format...
unix2mac: converting file HebrewSample1.pg to Mac format...

tani@lxtani:/tmp$ file HebrewSample1.pg
HebrewSample1.pg: UTF-8 Unicode text, with CR line terminators

@taniwallach
Copy link
Copy Markdown
Member

If the file suffix + file test approach seems reasonable, I can try in the coming week to write the code, do some rudimentary tests, and make a PR into
https://github.com/mgage/webwork2/tree/new_develop_candidate_01_01_2019

However, if someone has a better idea - we can try it. @dpvc - do you want to try out the white-space/line-end statistics first?

@mgage
Copy link
Copy Markdown
Member Author

mgage commented Jan 28, 2019

This PR is almost entirely (but not completely) about utf8 and multilingual issues. The changes that are not essentially about multilingual issues are pretty simple and I think I'll leave them. They should not be too confusing to the reviewers.

@mgage
Copy link
Copy Markdown
Member Author

mgage commented Feb 25, 2019

This PR has been replaced by PR #927 where the conflicts have already been resolved.
I'll close this commit, but it should be remembered for the discussion items it contains. We
may need them as we vet the changes for the localization commits.

@mgage mgage closed this Feb 25, 2019
@mgage mgage deleted the new_develop_candidate_01_01_2019 branch March 11, 2019 19:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants