-
Notifications
You must be signed in to change notification settings - Fork 2
SimpleHyphenator #53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
SimpleHyphenator #53
Conversation
|
I forgot to mention that I had converted the Swedish, American English, and Great Brittain English dictionaries. If this solution is approved, I can also convert the other languages so we can migrate to a single solution for all hyphenation, where we can improve the hyphenation code for all languages natively in this repository. |
|
Nice :-) It's nice to have an alternative to https://github.com/daisy/jhyphen. If it behaves the same, performs kind of the same and supports non-standard hyphenation, I could use it instead of jhyphen. I think it would be good to have a test to prove that it behaves the same as Hyphen. (You mentioned Hunspell in your video, so therefore I assume this is a replacement of jhyphen, not texhyphj.) A way to test this is to hyphenate a big corpus of words using the two implementations that are supposed to be equivalent. By the way, in your existing tests, why do you compare the new implementation with the texhyphj implementation? Also you compare the performance with that of the texhyphj implementation. It would be interesting to compare it with the jhyphen implementation too. |
| line.toUpperCase().contains("COMPOUNDLEFTHYPHENMIN") || | ||
| line.toUpperCase().contains("COMPOUNDRIGHTHYPHENMIN") || | ||
| line.toUpperCase().contains("ISO8859-1") || | ||
| line.toUpperCase().contains("UTF-8") | ||
| ) { | ||
| continue; | ||
| } | ||
| if (line.toUpperCase().startsWith("LEFTHYPHENMIN")) { | ||
| leftHyphenMin = Integer.parseInt(line.split(" ")[1]); | ||
| continue; | ||
| } | ||
| if (line.toUpperCase().startsWith("RIGHTHYPHENMIN")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These words can only appear at the start of the file, so some optimization is possible here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess more optimization is possible, I haven't really checked the whole code. This is just one thing I noticed. Of course this is just the compilation of the table, optimizing the inner loops of the hyphenation itself is was really matters.
By the way, in your video you relativize the importance of performance with respect to readability, which I appreciate, but of course we also have to acknowledge that the importance of performance increases as the documents we are dealing with grow. So it might be worth checking if there are optimizations possible without compromising too much on readability.
|
Hi @bertfrees I'll answer you in a new comment, as my response is more general than talking about a specific part of the code. First of I added a couple of test cases for jHyphen, but they are ignored at the moment. For some reason, they don't yield the same result as jTexHyphen or SimpleHyphenator. Not sure why, but there are differences independent of which dictionary I use. I've tried with the dictionary shipped with my Linux installation and also with the one I generated from the old I also did some performance benchmarking, but it isn't really fair, as the build failed. As you mentioned in your comment, this functionality could be improved when it comes to performance, and I really hope we will. But my main goal was to make it readable so we could make a joint effort to improve performance later. Currently, it performs as well as the old code so we could only improve :) With the small change I did in the last commit, we are getting a small performance improvement: The reason the full regression test isn't much faster is because of our hyphenation cache, so if we don't have a long-running process working with dotify material, we need to create that each time which entails some upfront work. The reason I compared this Hyphenator with LatexHyphenator in my original post is that we use that for most languages and then fall back to jHyphen if we have any issues with a specific language. For example, the Russian language crashed on read with LatexHyphenator. I'm pretty confident that we can convert the other languages to use the new Hyphenator, and we can handle crash issues much easier when we have all the code in the same place. Other benefits I can see of not using the older Hyphenators are that we are cross-platform because we have the dictionary files inside of the jar file, and we don't require an external library that might only be available on one platform (jHyphen might not work on Windows or Mac). Best regards |
|
I agree there are benefits to using a Java implementation over a C implementation.
When you compare the results of different implementations, are the transformations based on the same original TeX patterns (pre-processed with substrings.pl for the jhyphen and SimpleHyphenator implementations)? Also how sure are you that the jtexhyph and SimpleHyphenator yield the same results? Is that based on the small data set that you use in the unit tests? |
|
Hi @bertfrees The jHyphen results are from converting with The failures are more on the Swedish dictionary, which doesn't even handle the first word in the list. The test ran on jtexhyphen and SimpleHyphenator, both the test cases in the repository and I've also run the full regression tests on both of them without any changes. The baseline of the regression tests was created with jtexhyphen but testing against SimpleHyphenator does not yield any changes. Some of the later unit test cases were added to test specific problems I found running the regression test. The best way to test against a larger test set is just to swap the order of the implementations in Or you can use the java class in the repository linked in the video to create different dictionaries. The code is not that extensive, just remove the wrappers and replace all the exception words with specific words with masks of 8's and 9's. Best regards |
For DAISY Pipeline, I would just like to keep using the exact same dictionaries that I am currently using, only with a different library. DAISY Pipeline includes the dictionaries taken from LibreOffice, except for the Norwegian and German ones which are special ones maintained by NLB and SBS. Swapping only the library should normally work just fine, because jhyphen and SimpleHyphenator are supposed to support the exact same dictionary file format. But I guess before testing with a more extended set of test data we need to find out what is going wrong with the Swedish test. Are you really sure you are using the same .dic file? AFAICS |
|
Hi @bertfrees I think the Hunspell hyphen library and the dictionaries seem to be just fine. If I run it locally using the example executable, it will do the right thing. The jHyphenator seems to have a problem with hyphenation at the end or beginning of the word. I can't see that it adds any dots anywhere. And I can't send them in either because the library has this pattern: To my knowledge, that pattern isn't good either because it will not take all Unicode characters available. So there is a bunch of issues with this library. The rules for the word Using the available rules above, if it has dots to work with, the best rule is I've done a bunch of lookups verifying the other two libraries' rules when it comes to the unit cases we have in the repository. With code that is easier to understand and verify, I can, with high certainty, say that it will find all patterns and words correctly. Not sure what gain we get by debugging and fixing jHyphen? This was just a library I added in 2020 to handle the issue that we were missing a couple of dictionaries in TeX format, Norwegian was one of them, I believe. Best regards |
|
Hi @bertfrees With the last commit, I've rewritten the glue code in jHyphen with the code from SimpleHyphenator so we can use the Hunspell hyphenator for performance testing. These results show that it's a tad bit slower, but with the cache, it's not impacting that much. You see it more in the individual file results than in the regression test results. I could run the regression test without a cache for each algorithm to get an even better picture, but it will not be the same code as we later want to run. Best regards |
|
Hi @bertfrees A requirement to run the tests is first converting them with my tool that fixes the hyphenation exceptions. And if we want to run Hunspell, then we need to run the SimpleHyphenator works with the Hunspell variant and the one without these complex rules. Best regards |
|
What do you mean by "I can't see that it adds any dots anywhere", and why should it? What was the content of your test.txt? |
Indeed, this is why Hunspell can perform better, because of the pre-processing. |
|
Hi @bertfrees I mean that the library needs the dots to delimit the words starting point and ending point when talking to the library. So currently, I add the dots after lowercasing the word in the hyphenation_inner method. Best regards |
One thing is that jhyphen supports non-standard hyphenation. |
Could this be a good resource to understand the concept or can you recommend another paper? Best regards |
But I'm not sure that it needs to be debugged actually. As far as I know it is not needed to mark the beginning and end of words. I never heard of this requirement and never noticed anything wrong. I'm really surprised that the |
Yes, that's the only resource I know but I think it's a good way to understand the concept. |
|
I've reproduced the issue with jhyphen and will look into it. |
Hi @bertfrees and @PaulRambags
I got curious about the hyphenation topic and thought there should be a way to make a simple, readable code for this specific topic. So I spent the weekend creating this SimpleHyphenator.
Moreover, I created this video explaining the concept. (https://youtu.be/XxV6bhCZHn0)
We generally believe this is a better solution that we could improve on going forward, but we don't want it to impact performance. I've run some benchmarks.
The first-millisecond test is running one book over and over six times. And then, I calculated the mean.
After that, you can see the result of 3 consecutive runs of the regression tests on my personal rig (A bit quicker than usual)
New Version 1.0.8-SNAPSHOT
Old Version 1.0.7
My conclusion is that there is no significant difference between the algorithms. Perhaps the old version could be slower or less predictable in the time a book takes to process. But overall there is not any big difference.
Best regards
Daniel