Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Continuous checks and evaluation of HTML translation feature #331

Open
jerinphilip opened this issue Feb 5, 2022 · 2 comments
Open

Continuous checks and evaluation of HTML translation feature #331

jerinphilip opened this issue Feb 5, 2022 · 2 comments
Labels
mod: ci Things related to CI code in this repository. mod: html Issues related to handling HTML

Comments

@jerinphilip
Copy link
Contributor

Translating HTML to provide value to the user is doable and done here quite well. Translating all sorts of HTML with error correction is a decent research problem. This issue documents what is already scattered over several internal messaging into this public document for future reference and visibility.

Problem There are at least two aspects to this issue:

  1. Is our mechanism able to handle all sorts of HTML thrown at it without crashing? If crashing, do we have means to communicate to other consumers (looking at you, WebAssembly) to handle the crash gracefully.
  2. Are the rules we encode here the best across the existing noise/corruption in real-world HTML? Treating HTML elements as word breaking while it works on some also miserably fails in other aspects. Currently, we are engineering a rule-based system for error correction, assuming malformed HTML (Treat most HTML elements as word-breaking #286 (comment)). While I still doubt whether bergamot-translator should have taken this up, the HTML feature appears to have reached a satisfactory state.

We however remain without consensus on whether what we are doing is better than the existing setting, whether one HTML assumption is better than the other except for the developer's instincts based on experience.

A skeleton solution The design of infrastructure to know better could be to obtain a representative sample of noisy real-world HTML, then correct it by experts to create an evaluation dataset. Then, create a few metrics which we consider valuable and continuously look at a scalar score constructed aggregating the said metrics over the evaluation dataset.

An existing implementation https://github.com/jerinphilip/tagtransfer is an exploratory undertaking towards the above problem in pursuit of setting up the infrastructure. It's in python and richer in HTML parsing and validation and debugging tools, unlike WebAssembly. So we can expand on to:

  1. Crawl many web pages and ensure the HTML translation mechanism doesn't crash. I don't believe we can handle all invalid user input, but if we hit something like 95% of exemplary web pages with no crash, we can either shift blame to bad developers or have them correct their HTML. There is a manual google-translate website like mechanism in place, but it is straightforward to make this automated.
  2. Use an already existing XML dataset and evaluation data to provide a straightforward array of metrics for now. We can in the future enhance this by allowing for force-decoding and restricting the scope to evaluating the HTML algorithm alone.

Alternate ideas, improvements and suggestions are welcome and much appreciated.

@jerinphilip jerinphilip added mod: ci Things related to CI code in this repository. mod: html Issues related to handling HTML labels Feb 5, 2022
@jelmervdl
Copy link
Member

I started implementing some tests in Python, particularly for the things I'm focussing on with parsing & restoring HTML: https://colab.research.google.com/drive/1asuIT1OffBxKz-88pQrGDBxgmYWVvF6J?usp=sharing

@jelmervdl
Copy link
Member

jelmervdl commented Feb 15, 2022

@jerinphilip I'd like to add this to CI somehow, not as a pass-or-fail test but as a "hey this get's a score of N" type of job. It would help with comparing #312 (and future ones like that) to main. Could you help with adding this to CI?

Edit: to clarify, I'm thinking of something a bit more like 2. We have a standard set of pages where we write measures for (e.g. like the ones in my colab example) and then report scores for each of those measures per push/pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
mod: ci Things related to CI code in this repository. mod: html Issues related to handling HTML
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants