Improve handling HTML special cases #312

jelmervdl · 2022-01-26T21:16:15Z

Draft for now to just capture ongoing improvements

Change log

If markup appears in any of the tokens that make up a word, spread that markup to the rest of the word. Even if alignment scores for that token are lower than other tokens. Fixes the link "disappearing" from "Ceasar" in Cayo o Gayo Trebacio Testa (m. ca. 4) nacido probablemente en <a href="/wiki/Velia" title="Velia">Velia</a>, <a href="/wiki/Basilicata" title="Basilicata">Lucania</a>, fue uno de los abogados más prominentes de su época. Amigo de <a href="/wiki/Cicer%C3%B3n" title="Cicerón">Cicerón</a>, fue recomendado por este a <a href="/wiki/Julio_C%C3%A9sar" title="Julio César">César</a>, quien lo acogió en la <a href="/wiki/Galia" title="Galia">Galia</a> como consejero legal y jefe de la oficina de comunicaciones. Apoyó a César en la <a href="/wiki/Segunda_guerra_civil_de_la_Rep%C3%BAblica_romana" title="Segunda guerra civil de la República romana">guerra civil</a> y tras su asesinato pasó al bando de Octavio (futuro <a href="/wiki/Augusto" title="Augusto">Augusto</a>), quien tuvo en gran estima a Trebacio hasta que este falleció en el 4 d. C. Escribió varias obras relativas al <a href="/wiki/Derecho_romano" title="Derecho romano">Derecho romano</a>, pero ninguna se ha conservado. (with es->en model)
Ignore certain tags, like <code> and <samp> Pass through for certain HTML elements #313
Never treat  as word-breaking elements should not introduce spaces around them #339
Allow for  (I haven't seen this go wrong but in theory it could have gone wrong.)
That code that makes sure that each tag is returned in the output at least once? Yeah that was a bit too trigger happy. It now more correctly guarantees once and only once.
Use the isContinuation logic to detect when to insert a space after an open or close tag. This will add a space in underline, but not in end of the sentence. This change makes especially Wikipedia look more natural.
Ignore tags like <noscript> at the parser level because Firefox does so as well. Since Firefox can't guarantee that <noscript> tags have valid HTML inside them, this looks like the safest bet for now.

…s source tokens I do need those continuation delimiters for that, even though I really don't like them since they're so character set focussed!

🎉

Tag case is retained in the output though. Well, for the opening tag at least. Closing tag always matches opening tag.

Fixes #339

Fixes #313

std::bad_alloc :( Also expand tests to make sure we're recording the full ignored tag contents.

Trouble was that `Scanner::scanEntity()` returns a value() that does not point to inside the HTML input stream (but to a *decoded* entity instead). So we need another API, `Scanner::start()`, to figure out where a token starts in HTML.

When a word near the of a translated sentence aligns with one at the beginning, it pushes prevIt back to the beginning. Then the next translated token will insert all straggler void elements between prevIt and it. Instead of using prevIt to track where we were with inserting stragglers, we keep our own iterator that never moves backwards.

…e after a tag Main reason for using this instead of `std::isspace` is to prevent a space being inserted between the tag and the full stop in `This is a test.`. Because that has been bothering me a lot.

# Conflicts: # src/translator/html.cpp

These are all elements that Firefox treats as opaque in their HTML5 parser. As a consequence, when you'd request `noscriptElement.innerHTML` you'd get the raw text content of the thing, as opposed to a serialized tree. So invalid HTML? Just passed on as is! Well, we're going to do the same then. Besides, if noscript then also probably no extension.

This tag is a bit difficult. No HTML is allowed inside of it (e.g. similar to `<textarea>`) but we do want to capture it's text content as text (decoding entities etc.) so we can translate it. So for now I'll just trust that nobody is insane enough to use HTML inside the title tag. And if they do, we'll be as insane back and try to maintain that (very much not allowed) structure.

I don't know what happened here.

Hopefully this will make the overall code more readable given you're familiar with the concept it tries to implement…

jelmervdl · 2022-02-16T13:26:47Z

I think this pull request has grown enough to start trying to clean it up and get it merged.

Preferably, #353 can be landed, before this gets merged. I will add some measures that show that this set of changes improves the HTML output quality. The description above mostly cover it: edge cases and refactoring to keep growing code complexity manageable.

I've tried to make HTML.cpp as readable as possible by leaving comments with my intentions throughout, but clang-tidy still gives be very, very low grades for certain functions (especially those with switch statements.) Please let me know if any bit of code's intended use isn't clear, or if you have suggestions on how I can split up certain parts to be more comprehensible.

jerinphilip · 2022-02-16T14:04:11Z

Preferably, #353 can be landed, before this gets merged. I will add some measures that show that this set of changes improves the HTML output quality.

Let's take this PR forward before #353. #353 looks like it will take longer - it's still exploratory. Let's discuss internally how to craft and position the numbers.

jelmervdl · 2022-02-21T10:30:12Z

I'll try to add a more elaborate test to bergamot-translator-tests to test some of the added edge cases as well.

jerinphilip

Leaving a few comments that are local to the source around in a first pass. I expect to take one more pass once my understanding of the HTML pipeline improves from exchanges here.

I will try to do some black-box testing via the extension (as opposed to the previous python, thanks @jelmervdl for the dogfooding capabilities and improvements consequent of experimenting with the extension to the library).

Overall positive about this PR, so hoping to expedite merge. Most of the queries and suggestions below may be undertaken follow-up PR (save a suspected infinite loop).

src/translator/xh_scanner.cpp

src/translator/html.cpp

src/translator/html.h

src/translator/html.cpp

jerinphilip

Thanks for the extra tests. Let's get this in, and pursue things below in later PRs.

As feedback, that needn't necessarily be covered in this PR - someone trying to understand the HTML part will appreciate a bigger blob of documentation near the HTML class or something describing the components (parsing, what form of AST/DOM equivalent, target HTML gen/restoration) - at least I would.

Minor: I do not fully understand what stragger means (tried googling) in this context. I see most of the Taint (which I had troubled comprehending before) has been replaced with TagStack.

Long term I think it will be a good idea to abstract different readers writers inserting something pandoc like in between, thus providing a framework to translate formats (docx, markdown, LaTeX etc). Right now we're specializing for HTML.

jelmervdl added 14 commits January 26, 2022 17:49

Aggressively try to retain markup on words if it appears on one of it…

40eabc1

…s source tokens I do need those continuation delimiters for that, even though I really don't like them since they're so character set focussed!

Outdated todo

723e725

🎉

Be explicit about where the two different string_view types are used

9600c70

Make HTML tags case insensitive

3d6673c

Tag case is retained in the output though. Well, for the opening tag at least. Closing tag always matches opening tag.

Treat special

5634c40

Fixes #339

Add support for ignoring tags

e516dbd

Fixes #313

Merge branch 'main' into html-improvements

19acb54

Add test for regression in ignored element code path

46159ba

std::bad_alloc :( Also expand tests to make sure we're recording the full ignored tag contents.

Fix bad_alloc in consumeIgnoredTag

af39c75

Trouble was that `Scanner::scanEntity()` returns a value() that does not point to inside the HTML input stream (but to a *decoded* entity instead). So we need another API, `Scanner::start()`, to figure out where a token starts in HTML.

Use isContinuation function to check whether we need to insert a spac…

32f403a

…e after a tag Main reason for using this instead of `std::isspace` is to prevent a space being inserted between the tag and the full stop in `This is a test.`. Because that has been bothering me a lot.

Merge branch 'main' into html-improvements

afc75f0

# Conflicts: # src/translator/html.cpp

jelmervdl mentioned this pull request Feb 15, 2022

Continuous checks and evaluation of HTML translation feature #331

Open

This was linked to issues Feb 15, 2022

Pass through for certain HTML elements #313

Closed

 elements should not introduce spaces around them #339

Closed

jelmervdl added 4 commits February 16, 2022 13:12

Follow clang-tidy advice

dda9860

Fix missing \n\n?

d7e1c07

I don't know what happened here.

Add more comments and less creative variable names

203ba0a

Hopefully this will make the overall code more readable given you're familiar with the concept it tries to implement…

Too many negations and my head just negates itself

a1ee8e9

jelmervdl marked this pull request as ready for review February 21, 2022 10:29

jelmervdl requested a review from jerinphilip February 21, 2022 10:30

Update bergamot-translator-tests

ac83e50

jelmervdl mentioned this pull request Feb 21, 2022

Expand HTML tests browsermt/bergamot-translator-tests#58

Merged

jerinphilip reviewed Feb 21, 2022

View reviewed changes

Update tests

6a7bd21

jelmervdl added 7 commits February 21, 2022 16:27

Replace snake_case and magic numbers

c90d00f

Use std::max_element instead of own implementation

ad612e4

Rename isSubset to extends (and flip argument order for readability)

f451983

Move apply(AnnotatedText const&, Fun) to AnnotatedText itself.

c891eda

Try to reduce the number of nested conditions in consumeIgnoredTag a bit

54be426

Update tests for Ubuntu 18.04/avx2

279462c

Revert int64_t to size_t (and mute tidy complaining about it)

346821b

jerinphilip previously approved these changes Feb 22, 2022

View reviewed changes

jerinphilip and others added 3 commits February 22, 2022 18:37

Merge branch 'main' into html-improvements

8cc695b

Bit more high level documentation on how HTML class works.

48cfc00

Remark about 'taint'

a81dfdf

jelmervdl dismissed jerinphilip’s stale review via a81dfdf February 22, 2022 19:51

jelmervdl added 2 commits February 22, 2022 21:11

Fix the constructor situation

bbfa4e3

Add accidentally removed private methods back to header

ea10e91

jerinphilip changed the title ~~HTML processing improvements~~ Improve handling HTML special cases Feb 22, 2022

jerinphilip merged commit 1f98f97 into main Feb 22, 2022

jerinphilip deleted the html-improvements branch February 22, 2022 20:25

jerinphilip mentioned this pull request Mar 15, 2022

JS: Using supervised QE models for available language pairs #378

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve handling HTML special cases #312

Improve handling HTML special cases #312

jelmervdl commented Jan 26, 2022 •

edited

Loading

jelmervdl commented Feb 16, 2022

jerinphilip commented Feb 16, 2022

jelmervdl commented Feb 21, 2022

jerinphilip left a comment

jerinphilip left a comment •

edited

Loading

Improve handling HTML special cases #312

Improve handling HTML special cases #312

Conversation

jelmervdl commented Jan 26, 2022 • edited Loading

jelmervdl commented Feb 16, 2022

jerinphilip commented Feb 16, 2022

jelmervdl commented Feb 21, 2022

jerinphilip left a comment

Choose a reason for hiding this comment

jerinphilip left a comment • edited Loading

Choose a reason for hiding this comment

jelmervdl commented Jan 26, 2022 •

edited

Loading

jerinphilip left a comment •

edited

Loading