Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve handling HTML special cases #312

Merged
merged 32 commits into from
Feb 22, 2022
Merged

Improve handling HTML special cases #312

merged 32 commits into from
Feb 22, 2022

Conversation

jelmervdl
Copy link
Member

@jelmervdl jelmervdl commented Jan 26, 2022

Draft for now to just capture ongoing improvements

Change log

  • If markup appears in any of the tokens that make up a word, spread that markup to the rest of the word. Even if alignment scores for that token are lower than other tokens. Fixes the link "disappearing" from "Ceasar" in <p><b>Cayo</b> o <b>Gayo Trebacio Testa</b> (m. ca. 4) nacido probablemente en <a href="/wiki/Velia" title="Velia">Velia</a>, <a href="/wiki/Basilicata" title="Basilicata">Lucania</a>, fue uno de los abogados más prominentes de su época. Amigo de <a href="/wiki/Cicer%C3%B3n" title="Cicerón">Cicerón</a>, fue recomendado por este a <a href="/wiki/Julio_C%C3%A9sar" title="Julio César">César</a>, quien lo acogió en la <a href="/wiki/Galia" title="Galia">Galia</a> como consejero legal y jefe de la oficina de comunicaciones. Apoyó a César en la <a href="/wiki/Segunda_guerra_civil_de_la_Rep%C3%BAblica_romana" title="Segunda guerra civil de la República romana">guerra civil</a> y tras su asesinato pasó al bando de Octavio (futuro <a href="/wiki/Augusto" title="Augusto">Augusto</a>), quien tuvo en gran estima a Trebacio hasta que este falleció en el 4&nbsp;d.&nbsp;C. Escribió varias obras relativas al <a href="/wiki/Derecho_romano" title="Derecho romano">Derecho romano</a>, pero ninguna se ha conservado.</p> (with es->en model)
  • Ignore certain tags, like <code> and <samp> Pass through for certain HTML elements #313
  • Never treat <wbr> as word-breaking <wbr> elements should not introduce spaces around them #339
  • Allow for <P></p> (I haven't seen this go wrong but in theory it could have gone wrong.)
  • That code that makes sure that each tag is returned in the output at least once? Yeah that was a bit too trigger happy. It now more correctly guarantees once and only once.
  • Use the isContinuation logic to detect when to insert a space after an open or close tag. This will add a space in un<u>der</u>line, but not in end of the <b>sentence</b>. This change makes especially Wikipedia look more natural.
  • Ignore tags like <noscript> at the parser level because Firefox does so as well. Since Firefox can't guarantee that <noscript> tags have valid HTML inside them, this looks like the safest bet for now.

…s source tokens

I do need those continuation delimiters for that, even though I really don't like them since they're so character set focussed!
Tag case is retained in the output though. Well, for the opening tag at least. Closing tag always matches opening tag.
std::bad_alloc :( Also expand tests to make sure we're recording the full ignored tag contents.
Trouble was that `Scanner::scanEntity()` returns a value() that does not point to inside the HTML input stream (but to a *decoded* entity instead). So we need another API, `Scanner::start()`, to figure out where a token starts in HTML.
When a word near the of a translated sentence aligns with one at the beginning, it pushes prevIt back to the beginning. Then the next translated token will insert all straggler void elements between prevIt and it. Instead of using prevIt to track where we were with inserting stragglers, we keep our own iterator that never moves backwards.
…e after a tag

Main reason for using this instead of `std::isspace` is to prevent a space being inserted between the tag and the full stop in `This is a <b>test</b>.`. Because that has been bothering me a lot.
# Conflicts:
#	src/translator/html.cpp
These are all elements that Firefox treats as opaque in their HTML5 parser. As a consequence, when you'd request `noscriptElement.innerHTML` you'd get the raw text content of the thing, as opposed to a serialized tree. So invalid HTML? Just passed on as is! Well, we're going to do the same then. Besides, if noscript then also probably no extension.
This tag is a bit difficult. No HTML is allowed inside of it (e.g. similar to `<textarea>`) but we do want to capture it's text content as text (decoding entities etc.) so we can translate it. So for now I'll just trust that nobody is insane enough to use HTML inside the title tag. And if they do, we'll be as insane back and try to maintain that (very much not allowed) structure.
I don't know what happened here.
Hopefully this will make the overall code more readable given you're familiar with the concept it tries to implement…
@jelmervdl
Copy link
Member Author

I think this pull request has grown enough to start trying to clean it up and get it merged.

Preferably, #353 can be landed, before this gets merged. I will add some measures that show that this set of changes improves the HTML output quality. The description above mostly cover it: edge cases and refactoring to keep growing code complexity manageable.

I've tried to make HTML.cpp as readable as possible by leaving comments with my intentions throughout, but clang-tidy still gives be very, very low grades for certain functions (especially those with switch statements.) Please let me know if any bit of code's intended use isn't clear, or if you have suggestions on how I can split up certain parts to be more comprehensible.

@jerinphilip
Copy link
Contributor

Preferably, #353 can be landed, before this gets merged. I will add some measures that show that this set of changes improves the HTML output quality.

Let's take this PR forward before #353. #353 looks like it will take longer - it's still exploratory. Let's discuss internally how to craft and position the numbers.

@jelmervdl jelmervdl marked this pull request as ready for review February 21, 2022 10:29
@jelmervdl
Copy link
Member Author

I'll try to add a more elaborate test to bergamot-translator-tests to test some of the added edge cases as well.

Copy link
Contributor

@jerinphilip jerinphilip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving a few comments that are local to the source around in a first pass. I expect to take one more pass once my understanding of the HTML pipeline improves from exchanges here.

I will try to do some black-box testing via the extension (as opposed to the previous python, thanks @jelmervdl for the dogfooding capabilities and improvements consequent of experimenting with the extension to the library).

Overall positive about this PR, so hoping to expedite merge. Most of the queries and suggestions below may be undertaken follow-up PR (save a suspected infinite loop).

src/translator/xh_scanner.cpp Outdated Show resolved Hide resolved
src/translator/xh_scanner.cpp Outdated Show resolved Hide resolved
src/translator/xh_scanner.cpp Outdated Show resolved Hide resolved
src/translator/html.cpp Outdated Show resolved Hide resolved
src/translator/html.cpp Outdated Show resolved Hide resolved
src/translator/html.cpp Outdated Show resolved Hide resolved
src/translator/html.h Show resolved Hide resolved
src/translator/html.cpp Show resolved Hide resolved
src/translator/html.cpp Outdated Show resolved Hide resolved
src/translator/html.cpp Outdated Show resolved Hide resolved
jerinphilip
jerinphilip previously approved these changes Feb 22, 2022
Copy link
Contributor

@jerinphilip jerinphilip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the extra tests. Let's get this in, and pursue things below in later PRs.

As feedback, that needn't necessarily be covered in this PR - someone trying to understand the HTML part will appreciate a bigger blob of documentation near the HTML class or something describing the components (parsing, what form of AST/DOM equivalent, target HTML gen/restoration) - at least I would.

Minor: I do not fully understand what stragger means (tried googling) in this context. I see most of the Taint (which I had troubled comprehending before) has been replaced with TagStack.

Long term I think it will be a good idea to abstract different readers writers inserting something pandoc like in between, thus providing a framework to translate formats (docx, markdown, LaTeX etc). Right now we're specializing for HTML.

@jerinphilip jerinphilip changed the title HTML processing improvements Improve handling HTML special cases Feb 22, 2022
@jerinphilip jerinphilip merged commit 1f98f97 into main Feb 22, 2022
@jerinphilip jerinphilip deleted the html-improvements branch February 22, 2022 20:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

<wbr> elements should not introduce spaces around them Pass through for certain HTML elements
2 participants