Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve handling HTML special cases #312

Merged
merged 32 commits into from
Feb 22, 2022
Merged
Show file tree
Hide file tree
Changes from 19 commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
40eabc1
Aggressively try to retain markup on words if it appears on one of it…
jelmervdl Jan 26, 2022
723e725
Outdated todo
jelmervdl Jan 26, 2022
9600c70
Be explicit about where the two different string_view types are used
jelmervdl Feb 8, 2022
3d6673c
Make HTML tags case insensitive
jelmervdl Feb 8, 2022
5634c40
Treat <wbr> special
jelmervdl Feb 8, 2022
e516dbd
Add support for ignoring tags
jelmervdl Feb 8, 2022
19acb54
Merge branch 'main' into html-improvements
jelmervdl Feb 8, 2022
46159ba
Add test for regression in ignored element code path
jelmervdl Feb 9, 2022
af39c75
Fix bad_alloc in consumeIgnoredTag
jelmervdl Feb 9, 2022
f595c51
Prevent straggler void elements to show up twice
jelmervdl Feb 11, 2022
32f403a
Use isContinuation function to check whether we need to insert a spac…
jelmervdl Feb 11, 2022
afc75f0
Merge branch 'main' into html-improvements
jelmervdl Feb 14, 2022
72e54f8
Treat more elements as opaque when parsing
jelmervdl Feb 14, 2022
ea244d2
Do not skip `<title>` for now
jelmervdl Feb 14, 2022
dda9860
Follow clang-tidy advice
jelmervdl Feb 16, 2022
d7e1c07
Fix missing \n\n?
jelmervdl Feb 16, 2022
203ba0a
Add more comments and less creative variable names
jelmervdl Feb 16, 2022
a1ee8e9
Too many negations and my head just negates itself
jelmervdl Feb 16, 2022
ac83e50
Update bergamot-translator-tests
jelmervdl Feb 21, 2022
6a7bd21
Update tests
jelmervdl Feb 21, 2022
c90d00f
Replace snake_case and magic numbers
jelmervdl Feb 21, 2022
ad612e4
Use std::max_element instead of own implementation
jelmervdl Feb 21, 2022
f451983
Rename isSubset to extends (and flip argument order for readability)
jelmervdl Feb 21, 2022
c891eda
Move apply(AnnotatedText const&, Fun) to AnnotatedText itself.
jelmervdl Feb 21, 2022
54be426
Try to reduce the number of nested conditions in consumeIgnoredTag a bit
jelmervdl Feb 21, 2022
279462c
Update tests for Ubuntu 18.04/avx2
jelmervdl Feb 21, 2022
346821b
Revert int64_t to size_t (and mute tidy complaining about it)
jelmervdl Feb 21, 2022
8cc695b
Merge branch 'main' into html-improvements
jerinphilip Feb 22, 2022
48cfc00
Bit more high level documentation on how HTML class works.
jelmervdl Feb 22, 2022
a81dfdf
Remark about 'taint'
jelmervdl Feb 22, 2022
bbfa4e3
Fix the constructor situation
jelmervdl Feb 22, 2022
ea10e91
Add accidentally removed private methods back to header
jelmervdl Feb 22, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
76 changes: 76 additions & 0 deletions src/tests/units/html_tests.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -172,6 +172,16 @@ TEST_CASE("Do not abort if the input is just empty element") {
CHECK(response.target.text == "<p></p>");
}

TEST_CASE("Tag names are case insensitive") {
// Tests <P> vs </p> and <BR> should be recognized as a void tag <br>.
// <B> should be recognized as inline.
std::string test_str("<P><B>Spa</B>ce<BR>please?</p>");

std::string input(test_str);
HTML html(std::move(input), true);
CHECK(input == "Spa ce\n\nplease?");
}

TEST_CASE("Test case html entities") {
// These are all entities I would expect in innerHTML, since all other entities
// can be encoded as UTF-8 so there's no need to encode them through &...; when
Expand Down Expand Up @@ -618,6 +628,72 @@ TEST_CASE("Test comment") {
CHECK(response.target.text == test_str);
}

TEST_CASE("Test <wbr> element") {
std::string test_str("hel<wbr>lo");

std::string input(test_str);
HTML html(std::move(input), true);
CHECK(input == "hello");
}

TEST_CASE("Test <wbr> element (case-insensitive)") {
std::string test_str("hel<WBR>lo");

std::string input(test_str);
HTML html(std::move(input), true);
CHECK(input == "hello");
}

TEST_CASE("Test ignored element (nested)") {
std::string test_str("foo <var><var>nested</var></var> bar");
std::string expected_str("foo <var><var>nested</var></var>bar");

std::string input(test_str);
HTML html(std::move(input), true);
CHECK(input == "foo bar");

Response response;
std::string sentence_str("foo bar");
std::vector<string_view> sentence{
string_view(sentence_str.data() + 0, 3), // foo
string_view(sentence_str.data() + 3, 1), // _
string_view(sentence_str.data() + 4, 4), // _bar
string_view(sentence_str.data() + 8, 0), // ""
};
response.source.appendSentence("", sentence.begin(), sentence.end());
response.target.appendSentence("", sentence.begin(), sentence.end());
response.alignments = {identity_matrix<float>(4)};

html.restore(response);
CHECK(response.source.text == expected_str);
CHECK(response.target.text == expected_str);
}

TEST_CASE("Test ignored element (with entity)") {
std::string test_str("foo <var>&amp;</var> bar");
std::string expected_str("foo <var>&amp;</var>bar");

std::string input(test_str);
HTML html(std::move(input), true);
CHECK(input == "foo bar");

Response response;
std::string sentence_str("foo bar");
std::vector<string_view> sentence{
string_view(sentence_str.data() + 0, 3), // foo
string_view(sentence_str.data() + 3, 1), // _
string_view(sentence_str.data() + 4, 4), // _bar
string_view(sentence_str.data() + 8, 0), // ""
};
response.source.appendSentence("", sentence.begin(), sentence.end());
response.target.appendSentence("", sentence.begin(), sentence.end());
response.alignments = {identity_matrix<float>(4)};

html.restore(response);
CHECK(response.source.text == expected_str);
CHECK(response.target.text == expected_str);
}

TEST_CASE("End-to-end translation", "[!mayfail]") {
std::string input("<p>I <b>like</b> to <u>drive</u> this car.</p>");
HTML html(std::move(input), true);
Expand Down
Loading