Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Terminal double quote is transformed to open double quote character #12

Closed
4 tasks done
rgov opened this issue Aug 23, 2024 · 5 comments
Closed
4 tasks done

Terminal double quote is transformed to open double quote character #12

rgov opened this issue Aug 23, 2024 · 5 comments
Labels
💪 phase/solved Post is done 🦋 type/enhancement This is great to have

Comments

@rgov
Copy link
Contributor

rgov commented Aug 23, 2024

Initial checklist

Affected packages and versions

6.1.0

Steps to reproduce

I am transforming text that contains a parenthetical quote written in quotation marks, like: ("This is a test."). The output contains two opening double quotes. The second quote should be a closing double quote.

As a more minimal input ("A.") reproduces the issue. Interestingly, removing the period causes it to output correctly.

import {retext} from 'retext'
import retextSmartypants from 'retext-smartypants'

const file = await retext()
  .use(retextSmartypants)
  .process('("A.")')

console.log(
  Buffer.from(
    (new TextEncoder()).encode(file.value)
  ).toString('hex')
)

I've encoded the resulting output as hex-encoded-UTF-8 to minimize the chance of macOS helpfully correcting anything for me.

The above code outputs 28e2809c412ee2809c29, showing that both the quotation marks were transformed to the same sequence, e2809c, an opening double quote.

SmartyPants 1.5.1 transforms the text correctly:

$ echo '("A.")' | perl SmartyPants.pl -1
(“A.”)

Affected runtime and version

[email protected]

@github-actions github-actions bot added 👋 phase/new Post is being triaged automatically 🤞 phase/open Post is being triaged manually and removed 👋 phase/new Post is being triaged automatically labels Aug 23, 2024
@wooorm
Copy link
Member

wooorm commented Aug 23, 2024

Heya! Definitely possible. It‘s been ages since I looked into this algorithm.

Here’s the code:

function quotesDefault(state, node, index, parent) {

The SmartyPants pl code seems to be around L442 there.

As you have that running, perhaps you are able to add some print statements there, to debug which steps are doing the transform, and then see what’s missing in the code here?

@rgov
Copy link
Contributor Author

rgov commented Aug 23, 2024

Thanks. Normally, if you write say ("This is a test"), the closing quote is converted with this case:

} else if (
previous &&
previous.type !== 'WhiteSpaceNode' &&
previous.type !== 'SymbolNode' &&
previous.type !== 'PunctuationNode'
) {
// Closing quotes.
node.value = state.close[quoteIndex]

But, in my example ("This is a test."), the conditions aren't met and it falls through to the default case (opening quote). The condition isn't met because the previous node is a PunctuationNode.

{
  type: 'PunctuationNode',
  value: '.',
  position: {
    start: { line: 1, column: 17, offset: 16 },
    end: { line: 1, column: 18, offset: 17 }
  }
}

The part of the original Perl that transforms that character is here:

    my $close_class = qr![^\ \t\r\n\[\{\(\-]!;

    # Double closing quotes:
    s {
        ($close_class)?
        "
        (?(1)|(?=\s))   # If $1 captured, then do nothing;
                           # if not, then make sure the next char is whitespace.
    } {$1”}xg;

The logic between the two is not equivalent.

  1. The Perl does not consider only punctuation or symbol characters — $close_class is any character excluding whitespace, hyphens, and opening brackets ([{.
  2. The Perl does not put any constraints on the previous character, or even if such a character exists.
  3. But if there is a previous character, and it does not belong to $close_class, then the next character must be whitespace.

Note there is a second place where the Perl code outputs a double closing quote, which is:

    # Special case if the very first character is a quote
    # followed by punctuation at a non-word-break. Close the quotes by brute force:
    s/^"(?=$punct_class\B)/”/;

In retext-smartypants that's here, but again the translation differs. I don't think nextNext.type !== 'WordNode' is semantically equivalent to \B.

if (
next &&
nextNext &&
(next.type === 'PunctuationNode' || next.type === 'SymbolNode') &&
nextNext.type !== 'WordNode'
) {
// Special case if the very first character is a quote followed by
// punctuation at a non-word-break. Close the quotes by brute force.
node.value = state.close[quoteIndex]

I think anywhere the Perl code uses $punct_class you're clear to use node.type === 'PunctuationNode' || node.type === 'SymbolNode' but where the Perl code uses other classes it's not equivalent.

@wooorm
Copy link
Member

wooorm commented Aug 23, 2024

nice research!

want to work on a PR, test out whether that change is enough?

@rgov
Copy link
Contributor Author

rgov commented Aug 23, 2024

I interpret the meaning of \B as basically "the following character is the same type as this one" where "type" means "word" or "non-word":

  if (
    next &&
    nextNext &&
    (next.type === 'PunctuationNode' || next.type === 'SymbolNode') &&
    ((next.type === 'WordNode' && nextNext.type === 'WordNode') ||
     (next.type !== 'WordNode' && nextNext.type !=='WordNode'))
  )

But there's a weird hitch where non-word at the end of the string matches ("%" =~ /.\B/), because we pretend there's a non-word at the end of the string:

\b and \B assume there's a non-word character before the beginning and after the end of the source string; so \b will match at the beginning (or end) of the source string if the source string begins (or ends) with a word character. Otherwise, \B will match.

Translating that, I think it would be:

  if (
    next &&
    (next.type === 'PunctuationNode' || next.type === 'SymbolNode') &&
    (!nextNext || nextNext.type !== 'WordNode')
  )

Here are some test cases:

Test case Explanation SmartyPants 1.5.1 retext-smartypants 6.1.0 Suggested fix
"~a Quote, punctuation, word break “~a “~a “~a
"~* Quote, punctuation, non-word-break, different symbols ”~* ”~* ”~*
"~~a Quote, punctuation, non-word-break, same symbol ”~~a “~~a “~~a
"~ Quote, punctuation, end-of-string ”~ “~ ”~

The third test case still fails because repeated occurrences of the same symbol are combined into a single node. How do you recommend we get the desired behavior? Something with next.value.length?

@wooorm wooorm closed this as completed in b4629a7 Aug 26, 2024
@wooorm wooorm added 🦋 type/enhancement This is great to have 💪 phase/solved Post is done labels Aug 26, 2024
@github-actions github-actions bot removed the 🤞 phase/open Post is being triaged manually label Aug 26, 2024
@wooorm
Copy link
Member

wooorm commented Aug 26, 2024

thanks, released! https://github.com/retextjs/retext-smartypants/releases/tag/6.1.1

I don’t really care for the "~~a difference. SmartyPants is buggy. This uses ASTs and can be smarter. If there was a good real case, then maybe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💪 phase/solved Post is done 🦋 type/enhancement This is great to have
Development

No branches or pull requests

2 participants