Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spellcheck all enabled scripts and improve URL detection #613

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

codokie
Copy link
Contributor

@codokie codokie commented Apr 1, 2024

I made some changes to the spellcheck so that it checks enabled locales other than the active one if they are of different scripts. If they are of different script, there is no concern that the spellcheck would validate typos in written other locales.
If no locale could be determined, then the text consists of non-letters only (numbers, special characters, etc.) so it should not be spell checked.

The result is saved in the cache for quick retrieval since the spellcheck service may be called multiple times for the same word.
This change should also make it seem like the spellchecker can "remember which language a word was typed" (#93) even though the TextInfo has no attribute for the locale.

If the experimental setting "URL detection" is enabled, URLs are checked using a regex pattern Matcher to prevent them from being flagged as typos.
For some reason though, URLs containing '?', '!', or ';' are still truncated before they are sent to the spellchecker, so it marks the text after any of those symbols as a typo if its not a valid dictionary word.

Copy link
Contributor Author

@codokie codokie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1c5d955: Nevermind.. if mLocale is null then it is assigned a value. My bad.

@Helium314
Copy link
Owner

Are the changes to URL detection and spell check dependent on each other?
If not, I would much prefer to have the changes split up into separate PRs.

@codokie
Copy link
Contributor Author

codokie commented Apr 3, 2024

@Helium314 the URL detection changes are minimal and were made solely for the spellchecker so that it won't mark URI as typos. There might be a branch conflict if I'll separate them.

@Helium314
Copy link
Owner

Why actually check all enabled scripts? I remember somewhere you wrote that checking the whole text when adding a new word is already inefficient. Checking it in all scripts seems even less efficient than that.
When it comes to performance, you should also be very careful with regexes.

Btw it might be good to define what is the wanted behavior, because from your description I don't think I understand it.

the URL detection changes are minimal and were made solely for the spellchecker

I did not check the code yet, but it looks like your are changing RichInputConnection and InputLogic which are not related to the spell checker.

@codokie
Copy link
Contributor Author

codokie commented Apr 3, 2024

I know this approach is not exactly efficient, but spellchecking is not inherently a light task, and those who enable it should know it requires more processing power. That's why the regex is checked only for those who enable the experimental URL detection in settings. To make it efficient might require a real overhaul and I admit that I am not knowledgeable enough for that task.

The check for determining the locale from the script of the letters is O(n^2) at the worst case if it's a number or similar.
But assuming most people are only bilingual, this should not be too bad probably, and for actual words the script will be found within 2 iterations.

The wanted behavior is as follows: If I type in English and then switch to any subtype of a different script like Greek (for example), the words written in English should be checked in the English dictionary and not the Greek one.

It's just that the current implementation flags way too many legit words, and it's not that efficient either. It even checks numbers against the dictionary which is really a waste.

@codokie
Copy link
Contributor Author

codokie commented Apr 25, 2024

@Helium314 Are there any plans to include this in the upcoming version?
The current state of the spell checker is almost unusable for those who write in multiple scripts.
I think it has something to do with changes made in Android 14. Even numbers are spellchecked which is really weird..
I can revert the suggestion length limit but the URL detection is vital to prevent flagging URLs as typos.

@codokie
Copy link
Contributor Author

codokie commented Apr 26, 2024

I made another change to the spellchecker, it should now use the cache to avoid checking the dictionary when the word has been checked before.
This should also make it seem like it can "remember which language a word was typed" even though the TextInfo has no attribute for the locale.
@RHJihan could you please check whether this solves your old issue with the spellchecker #93?

@RHJihan
Copy link
Contributor

RHJihan commented Apr 28, 2024

@RHJihan could you please check whether this solves your old issue with the spellchecker #93?

Yes, it solves the problem.

@Uranusek
Copy link

Could you make changes to spellcheck so that it doesn't check single letters and names after @?

For example, when replying to someone on Discord or GitHub, their nick will be underlined and this is a bit irritating. Single letters are also underlined for some reason. Even capital letters of those that are already in the dictionary, e.g. O or Z.

@codokie
Copy link
Contributor Author

codokie commented May 23, 2024

@Uranusek I agree the spellchecker should not underline words starting with @
However I have a feeling there may be more special use cases that it does not cover, if you think of any other sensible rule to add, please let me know as that could really help make this feature useful & accurate. Thanks

@codokie
Copy link
Contributor Author

codokie commented May 23, 2024

Single letters are also underlined for some reason. Even capital letters of those that are already in the dictionary, e.g. O or Z.

With the current changes in this PR, words with only 1 letter will not be reported as typos, so this issue should be fixed.

@Uranusek
Copy link

Oh, that's great, thanks! I think spellchecker shouldn't underline email addresses either.

@codokie
Copy link
Contributor Author

codokie commented May 26, 2024

spellchecker shouldn't underline email addresses

That should also not happen anymore when this PR is merged, but only if you enable the "Advanced URL detection" (disabled by default) setting.
I don't know how bad is the performance impact of regex pattern matching on low-end phones so it may be best to associate this feature with an experimental setting.

An alternative solution which should barely impact performance is to whitelist all words containing an at sign '@', but it would mean that this, for example, would not be considered a typo: mail@example,,,org

@Helium314
Copy link
Owner

I finally had a look at this PR, and was reminded why I had postponed this several times. Sorry for this story again, but it really bothers me.
(you even wrote "URLs, numbers and special characters are checked against the dictionary as well. Unsurprisingly, they are then flagged as typos (red underlined). But that's for another issue..")

This should be at least 4 PRs:

  • check in all enabled scripts
  • improve URL detection
  • change cache usage (avoid underlining previously enabled languages)
  • at least one more PR for the other changes (like treatment of short words, or separating words on dash, which btw I didn't find mentioned anywhere, it just appears in the code)

Advantages:

  • less time I need to spend for review, because I only need to understand a single change instead of deciphering which change belongs where
  • faster review, because I can read a short PR in much less time than a long one (time I need to understand grows more than linear with length)
  • increased probablility I'm actually looking at PRs instead of thinking "not another one of those things, will check later"
  • I can relatively simply revert an individual change if it turns out to be breaking stuff / introducing bad behavior

Now for the actual code stuff:
I think it has potential for issues (inconsistencies with cache and when things in other languages are underlined or not)
but overall it's an improvement.

Some comments / questions in the code section.

I don't know how bad is the performance impact of regex pattern matching on low-end phones

Performance is fine.

}
}

// Finds the enabled locale with a script matching one of the letters in the given text.
public Locale findLikelyLocaleOfText(final String text) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name is somewhat misleading, better rename to reflect what it actualy does: find a locale with matching script

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right.. maybe it should be moved out of the AndroidWordLevelSpellCheckerSession class too because it is not specifically related to the spellchecker except for the local variable localesToCheck (which can be passed as an argument)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's not necessary to move it, but I definitely agree with your reasoning. Possibly could be moved to ScriptUtils or maybe LocaleUtils.

@@ -47,6 +48,8 @@ public abstract class AndroidWordLevelSpellCheckerSession extends Session {

public final static String[] EMPTY_STRING_ARRAY = new String[0];

public final static int FLAG_UNCHECKABLE = 0;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This goes into SuggestionsInfo as suggestionsAttributes, but in the documentation I can't find anything referring to uncheckable. 0 just seems to be when none of the RESULT_ATTR_[...] applies.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah I may have overlooked that, although a 0 flag is already used elsewhere in the code:
https://github.com/Helium314/HeliBoard/blame/136b45880e6c3aadb6d9e1b68f90851b1fbb94e7/app/src/main/java/helium314/keyboard/latin/spellcheck/AndroidSpellCheckerSession.java#L72
so it might need to be changed as well.
if we don't know what is the fallback behavior in the absence of flag then it would probably be wrong to name it like that with a constant

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment above says "Neither RESULT_ATTR_IN_THE_DICTIONARY nor RESULT_ATTR_LOOKS_LIKE_TYPO", and this also fits with documentation, which implies that 0 means none of the flags.
There is no information regarding 0 meaning "uncheckable".

Copy link
Contributor Author

@codokie codokie Aug 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you had spent more time to understand how the code works, you could have inferred that cached suggestions with FLAG_UNCHECKABLE just return AndroidSpellCheckerService.getNotInDictEmptySuggestions(false), so this a flag of 0 is not returned to/handled by the native code/another class.
Essentially this constant could be named however we want as long as its purpose is clear.
The only change I would make is visibility-related: turn it to a private constant.

@@ -171,58 +205,31 @@ public void onClose() {
}

private static final int CHECKABILITY_CHECKABLE = 0;
private static final int CHECKABILITY_TOO_MANY_NON_LETTERS = 1;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to be notified if I wrote "My phone number is0123456789" <- missed a space after "is"

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem I see here is that this might not be wanted by some people (it might have been introduced for a reason), and with the current style of PR I can't just simply revert this change if it turns out clearly unpopular.

@codokie
Copy link
Contributor Author

codokie commented Jul 24, 2024

@Helium314 thanks for the review, you've raised some good points.

Admittedly I'm somewhat lazy and don't like fixing merge conflicts so that's why it's all wound up in a single PR.
Even if it's bad practice, I don't think it's uncommon to do a overhaul of a single feature (in this case, the spellchecker) in one PR. Especially if only about 100 lines of code were modified.

As you know from my other PRs I kind of lost interest with the project and have less time currently so I don't know when/if to rework this PR. Sorry about that..

If you would still like to merge this PR (you could also split it as you had suggested), I'll do my best to answer (hopefully not too many) questions here (within reasonable time).

@Helium314
Copy link
Owner

Even if it's bad practice, I don't think it's uncommon to do a overhaul of a single feature (in this case, the spellchecker) in one PR.

Such an overhaul can be done, but needs to be in a better discussed and documented way. Things like removing CHECKABILITY_TOO_MANY_NON_LETTERS is not in the linked issue, and not in the description of your PR. I wouldn't be surprised if this change is disliked by a bunch of users.
Reviewing and understanding changes to code I have barely read once (which is the case for the spell checker) is quite time consuming, and even more so when you find undocumented extras and you have to decipher whether they are actually undocumented changes, or just related to something you don't yet understand. With my current understanding of the spell checker, I never would have agreed to a general overhaul in a single PR. And even with more knowledge, such things should be first discussed with other community members (personally I don't really care because I don't use the spell checker).

If you would still like to merge this PR

I will merge it with some smaller changes as mentioned above.

@codokie
Copy link
Contributor Author

codokie commented Aug 26, 2024

Things like removing CHECKABILITY_TOO_MANY_NON_LETTERS

What I suggested is also the behavior of some spellcheckers like the one present in Microsoft Word.
No one seems to be upset that currently the spellchecker labels numbers as typos, so I highly doubt you will see someone complain if a mix of letters and numbers will be labeled as a typo.

The reason why your fork of OpenBoard interested me is because you added some neat features to it. It seems that (perhaps due to time constraints) you are no longer focused on renovating the keyboard anymore, only to maintain it. That's understandable, but you did not convince me that my ideas are not desirable by others.

I had spent some considerable time debugging the spellchecker and I think I have more than a basic understanding of how it works. I assure you that the person who originally added this functionality did not get it completely right (not trying to claim that I know much better.)

I won't be participating in this PR anymore because I do not think that your review is open-minded and you will likely heavily change things you do not really understand or able to test.
So feel free to alter my undecipherable code as you see fit, or to close the PR if you feel offended (not exactly my intention)

@BlackyHawky
Copy link
Contributor

(My comment below is off-topic so I'm hiding it by default.)

Read my comment

It's so sad to read this comment, even though I can totally understand it. ☹️

@Helium314
I hope you'll find a compromise so that @codokie can work on this application again.
His ideas need to be exploited, and creating a developer group may be a good idea so that you can freely discuss future improvements.
I think 2, 3 or 4 people maximum to manage the application might be a good idea.

I hope this will be taken into consideration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants