Skip to content

Conversation

@kalaspuffar
Copy link
Collaborator

Hi @bertfrees and @PaulRambags

I got curious about the hyphenation topic and thought there should be a way to make a simple, readable code for this specific topic. So I spent the weekend creating this SimpleHyphenator.

Moreover, I created this video explaining the concept. (https://youtu.be/XxV6bhCZHn0)

We generally believe this is a better solution that we could improve on going forward, but we don't want it to impact performance. I've run some benchmarks.

The first-millisecond test is running one book over and over six times. And then, I calculated the mean.

After that, you can see the result of 3 consecutive runs of the regression tests on my personal rig (A bit quicker than usual)

New Version 1.0.8-SNAPSHOT

15803ms
11320ms
12433ms
13291ms
13185ms
11563ms
Mean: 12932ms

BUILD SUCCESSFUL in 49m 12s
BUILD SUCCESSFUL in 49m 15s
BUILD SUCCESSFUL in 49m 20s

Old Version 1.0.7

11168ms
16631ms
13262ms
14179ms
10486ms
12641ms
Mean: 13061ms

BUILD SUCCESSFUL in 49m 24s
BUILD SUCCESSFUL in 50m 01s
BUILD SUCCESSFUL in 49m 49s

My conclusion is that there is no significant difference between the algorithms. Perhaps the old version could be slower or less predictable in the time a book takes to process. But overall there is not any big difference.

Best regards
Daniel

@kalaspuffar
Copy link
Collaborator Author

I forgot to mention that I had converted the Swedish, American English, and Great Brittain English dictionaries. If this solution is approved, I can also convert the other languages so we can migrate to a single solution for all hyphenation, where we can improve the hyphenation code for all languages natively in this repository.

@bertfrees
Copy link
Contributor

bertfrees commented Feb 28, 2023

Nice :-)

It's nice to have an alternative to https://github.com/daisy/jhyphen. If it behaves the same, performs kind of the same and supports non-standard hyphenation, I could use it instead of jhyphen.

I think it would be good to have a test to prove that it behaves the same as Hyphen. (You mentioned Hunspell in your video, so therefore I assume this is a replacement of jhyphen, not texhyphj.) A way to test this is to hyphenate a big corpus of words using the two implementations that are supposed to be equivalent. By the way, in your existing tests, why do you compare the new implementation with the texhyphj implementation? Also you compare the performance with that of the texhyphj implementation. It would be interesting to compare it with the jhyphen implementation too.

Comment on lines +93 to +104
line.toUpperCase().contains("COMPOUNDLEFTHYPHENMIN") ||
line.toUpperCase().contains("COMPOUNDRIGHTHYPHENMIN") ||
line.toUpperCase().contains("ISO8859-1") ||
line.toUpperCase().contains("UTF-8")
) {
continue;
}
if (line.toUpperCase().startsWith("LEFTHYPHENMIN")) {
leftHyphenMin = Integer.parseInt(line.split(" ")[1]);
continue;
}
if (line.toUpperCase().startsWith("RIGHTHYPHENMIN")) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These words can only appear at the start of the file, so some optimization is possible here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess more optimization is possible, I haven't really checked the whole code. This is just one thing I noticed. Of course this is just the compilation of the table, optimizing the inner loops of the hyphenation itself is was really matters.

By the way, in your video you relativize the importance of performance with respect to readability, which I appreciate, but of course we also have to acknowledge that the importance of performance increases as the documents we are dealing with grow. So it might be worth checking if there are optimizations possible without compromising too much on readability.

@kalaspuffar
Copy link
Collaborator Author

Hi @bertfrees

I'll answer you in a new comment, as my response is more general than talking about a specific part of the code.

First of I added a couple of test cases for jHyphen, but they are ignored at the moment. For some reason, they don't yield the same result as jTexHyphen or SimpleHyphenator. Not sure why, but there are differences independent of which dictionary I use.

I've tried with the dictionary shipped with my Linux installation and also with the one I generated from the old tex format, and both generate differences between what we create in jHyphen and the other two implementations.

I also did some performance benchmarking, but it isn't really fair, as the build failed.

11081ms
11804ms
11568ms
12342ms
11412ms
12651ms
Mean:11809ms

BUILD FAILED in 1h 4m 10s

As you mentioned in your comment, this functionality could be improved when it comes to performance, and I really hope we will. But my main goal was to make it readable so we could make a joint effort to improve performance later. Currently, it performs as well as the old code so we could only improve :)

With the small change I did in the last commit, we are getting a small performance improvement:

12854ms
11486ms
11156ms
13511ms
10315ms
11903ms
Mean:11870ms

BUILD SUCCESSFUL in 49m 50s

The reason the full regression test isn't much faster is because of our hyphenation cache, so if we don't have a long-running process working with dotify material, we need to create that each time which entails some upfront work.

The reason I compared this Hyphenator with LatexHyphenator in my original post is that we use that for most languages and then fall back to jHyphen if we have any issues with a specific language. For example, the Russian language crashed on read with LatexHyphenator.

I'm pretty confident that we can convert the other languages to use the new Hyphenator, and we can handle crash issues much easier when we have all the code in the same place.

Other benefits I can see of not using the older Hyphenators are that we are cross-platform because we have the dictionary files inside of the jar file, and we don't require an external library that might only be available on one platform (jHyphen might not work on Windows or Mac).

Best regards
Daniel

@bertfrees
Copy link
Contributor

bertfrees commented Mar 1, 2023

I agree there are benefits to using a Java implementation over a C implementation.

I've tried with the dictionary shipped with my Linux installation and also with the one I generated from the old tex format, and both generate differences between what we create in jHyphen and the other two implementations.

When you compare the results of different implementations, are the transformations based on the same original TeX patterns (pre-processed with substrings.pl for the jhyphen and SimpleHyphenator implementations)? Also how sure are you that the jtexhyph and SimpleHyphenator yield the same results? Is that based on the small data set that you use in the unit tests?

@kalaspuffar
Copy link
Collaborator Author

Hi @bertfrees

The jHyphen results are from converting with substring.pl and I also tried with my conversion script. And it worked for English. The converted file with substring.pl don't handle the exceptions well, but my solution will solve that.

The failures are more on the Swedish dictionary, which doesn't even handle the first word in the list.

The test ran on jtexhyphen and SimpleHyphenator, both the test cases in the repository and I've also run the full regression tests on both of them without any changes. The baseline of the regression tests was created with jtexhyphen but testing against SimpleHyphenator does not yield any changes.

Some of the later unit test cases were added to test specific problems I found running the regression test.

The best way to test against a larger test set is just to swap the order of the implementations in org.daisy.dotify.api.hyphenator.HyphenatorFactoryService and run them against new material. Currently, we only have English and Swedish dictionaries for all Hyphenators. But if you want, I could create Dutch or other ones if you want to test them against your local material.

Or you can use the java class in the repository linked in the video to create different dictionaries. The code is not that extensive, just remove the wrappers and replace all the exception words with specific words with masks of 8's and 9's.

Best regards
Daniel

@bertfrees
Copy link
Contributor

if you want, I could create Dutch or other ones if you want to test them against your local material

For DAISY Pipeline, I would just like to keep using the exact same dictionaries that I am currently using, only with a different library. DAISY Pipeline includes the dictionaries taken from LibreOffice, except for the Norwegian and German ones which are special ones maintained by NLB and SBS.

Swapping only the library should normally work just fine, because jhyphen and SimpleHyphenator are supposed to support the exact same dictionary file format.

But I guess before testing with a more extended set of test data we need to find out what is going wrong with the Swedish test. Are you really sure you are using the same .dic file? AFAICS JHyphenatorFactory uses the one in /usr/share/hyphen and SimpleHyphenatorFactory the one in src/org/daisy/dotify/hyphenator/impl/resource-files/dict.

@kalaspuffar
Copy link
Collaborator Author

Hi @bertfrees

I think the Hunspell hyphen library and the dictionaries seem to be just fine. If I run it locally using the example executable, it will do the right thing.

$ ./example ../myhyphen/hyph_sv_SE.dic test.txt 
av=stav=nings=reg=ler

The jHyphenator seems to have a problem with hyphenation at the end or beginning of the word. I can't see that it adds any dots anywhere.

And I can't send them in either because the library has this pattern:

Matcher matcher = Pattern.compile("['\\p{L}]+").matcher(text);

To my knowledge, that pattern isn't good either because it will not take all Unicode characters available. So there is a bunch of issues with this library.

The rules for the word avstavningsregler that defines this hyphenation point found in the dictionary are:

2g3le2r.
eg2ler
sre2g1l
eg1l
eg1le
g1le
reg1l

Using the available rules above, if it has dots to work with, the best rule is 2g3le2r. where there is an odd number between g3l. Sadly it doesn't find this rule, probably due to the pattern matching above and not adding the dots back. So the best rule it goes by is eg2ler and will not hyphenate the word correctly.

I've done a bunch of lookups verifying the other two libraries' rules when it comes to the unit cases we have in the repository. With code that is easier to understand and verify, I can, with high certainty, say that it will find all patterns and words correctly.

Not sure what gain we get by debugging and fixing jHyphen? This was just a library I added in 2020 to handle the issue that we were missing a couple of dictionaries in TeX format, Norwegian was one of them, I believe.

Best regards
Daniel

@kalaspuffar
Copy link
Collaborator Author

Hi @bertfrees

With the last commit, I've rewritten the glue code in jHyphen with the code from SimpleHyphenator so we can use the Hunspell hyphenator for performance testing.

13796ms
11067ms
11827ms
12228ms
11986ms
12143ms

Mean: 12174ms

BUILD SUCCESSFUL in 50m 42s
BUILD SUCCESSFUL in 50m 49s
BUILD SUCCESSFUL in 50m 46s

These results show that it's a tad bit slower, but with the cache, it's not impacting that much. You see it more in the individual file results than in the regression test results.

I could run the regression test without a cache for each algorithm to get an even better picture, but it will not be the same code as we later want to run.

Best regards
Daniel

@kalaspuffar
Copy link
Collaborator Author

Hi @bertfrees

A requirement to run the tests is first converting them with my tool that fixes the hyphenation exceptions. And if we want to run Hunspell, then we need to run the substring.pl script as well as which takes some words and combines them to make more complex rules. And I guess that Hunspell is looking for these to improve performance.

SimpleHyphenator works with the Hunspell variant and the one without these complex rules.

Best regards
Daniel

@bertfrees
Copy link
Contributor

What do you mean by "I can't see that it adds any dots anywhere", and why should it?

What was the content of your test.txt?

@bertfrees
Copy link
Contributor

And I guess that Hunspell is looking for these to improve performance.

Indeed, this is why Hunspell can perform better, because of the pre-processing.

@kalaspuffar
Copy link
Collaborator Author

Hi @bertfrees

I mean that the library needs the dots to delimit the words starting point and ending point when talking to the library.

$ cat test.txt 
avstavningsregler

So currently, I add the dots after lowercasing the word in the hyphenation_inner method.

    private byte[] hyphenate_inner(String word) {
        word = word.toLowerCase();
        word = "." + word + ".";
        byte[] wordBytes = StandardCharsets.UTF_8.encode(word).array();
        int wordSize = wordBytes.length;
        if (wordSize > wordHyphens.capacity()) {
            wordHyphens = ByteBuffer.allocate(wordSize * 2);
        }
        PointerByReference repPointer = new PointerByReference(Pointer.NULL);
        PointerByReference posPointer = new PointerByReference(Pointer.NULL);
        PointerByReference cutPointer = new PointerByReference(Pointer.NULL);
        Hyphen.getLibrary().hnj_hyphen_hyphenate2(dictionary, wordBytes, wordSize, wordHyphens, null,
                repPointer, posPointer, cutPointer);
        return wordHyphens.array();
    }

Best regards
Daniel

@bertfrees
Copy link
Contributor

Not sure what gain we get by debugging and fixing jHyphen?

One thing is that jhyphen supports non-standard hyphenation.

@kalaspuffar
Copy link
Collaborator Author

One thing is that jhyphen supports non-standard hyphenation.

Could this be a good resource to understand the concept or can you recommend another paper?
https://tug.org/tugboat/tb27-1/tb86nemeth.pdf

Best regards
Daniel

@bertfrees
Copy link
Contributor

Not sure what gain we get by debugging and fixing jHyphen?

But I'm not sure that it needs to be debugged actually. As far as I know it is not needed to mark the beginning and end of words. I never heard of this requirement and never noticed anything wrong. I'm really surprised that the example tool does it correct (I haven't been able to check myself yet.)

@bertfrees
Copy link
Contributor

Could this be a good resource to understand the concept or can you recommend another paper?

Yes, that's the only resource I know but I think it's a good way to understand the concept.

@bertfrees
Copy link
Contributor

I've reproduced the issue with jhyphen and will look into it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants