Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Links and hashtags seem to change after translation #93

Open
reddere opened this issue Aug 15, 2023 · 10 comments
Open

Links and hashtags seem to change after translation #93

reddere opened this issue Aug 15, 2023 · 10 comments

Comments

@reddere
Copy link

reddere commented Aug 15, 2023

When using GoogleTranslate(), it alterates the links capital and non-capital letters randomly. How to fix this?

@Animenosekai
Copy link
Owner

Do you have an example to reproduce ?

@reddere
Copy link
Author

reddere commented Aug 24, 2023

Do you have an example to reproduce ?

Absolutely @Animenosekai ! Here is a text I got from a tweet. Notice how both hashtags and the tweet link letters are alterated. In the second hashtag, a letter even gets added out of nowhere.

from translatepy.translators.google import GoogleTranslate 

text = 'Kado Thorne es un Vampiro y viajó en el tiempo desde el año 2020 cuando se presentó a la skin Oro.\n\n#Fortnite #FortniteLastResort https://t.co/m1cE9sSrNb'

translator = GoogleTranslate()

italian_text = translator.translate(text, 'Italian')

print(italian_text)

Result:
Kado Thorne è un vampiro e ha viaggiato nel tempo dal 2020 quando apparve nell'oro della pelle.\n\n#FORTNITE #FORTNITLelasTResort https://t.co/M1ce9SSRNB

Even if the normal text got translated fine, hashtags and link got alterated:

  • Hashtag n.1 went from #Fortnite to #FORTNITE (letters alteration)
  • Hashtag n.2 went from #FortniteLastResort to #FORTNITLelasTResort (letters alteration + missing letter E + somehow "Last" got totally distorted and "Lelas", which doesnt mean anything in Italian)
  • Link went from https://t.co/m1cE9sSrNb to https://t.co/M1ce9SSRNB. This alteration breaks entirely the link.

Any ideas on how to fix this?

@Animenosekai
Copy link
Owner

Parsing with a Regex maybe ?

@reddere
Copy link
Author

reddere commented Aug 27, 2023

what do you mean? theres params I can pass to the GoogleTranslate() instance that allow me to hide parts of the passed text using regex?

@Animenosekai
Copy link
Owner

what do you mean? theres params I can pass to the GoogleTranslate() instance that allow me to hide parts of the passed text using regex?

Nope not for now but should I ?

Here is the major problem coming with this and HTML translation though :

#71 (comment)

TLDR: Might work for Latin based languages, but different languages have different structures and the order of words might need to change from one language to another. (this is also one of the reasons why when we translate stuff we don't translate each word individually and put back the pieces)

@reddere
Copy link
Author

reddere commented Aug 28, 2023

Yeah I mean implement what I said would actually make it way better. The issue you mentioned kinda relates to the topic, and yeah thats easily fixable by just add a space in the final result after the dots or commas, if missing, but yeah implementing regex or any other way to hide certain parts of text would be awesome as it's frequent to alterate them

@Animenosekai
Copy link
Owner

Animenosekai commented Aug 28, 2023

Yes, this issue might be easier to handle than normal translations, as links don't exactly mean anything and don't need to be translated.

But, here is the problem :

First, it is not possible to separately translate things because it might not result in the best translation (because words have different meanings as a whole rather than individually). Also, as said before, there is no telling the position of the link should change, thus we can't just pin the position of the link and replace it after the translation:

(French) Je voudrais changer le lien https://google.com parce qu'il me semble y avoir trouvé une erreur
(Japanese) https://google.comのリンクに問題があると思うから変えたいです

Notice the change of position of the link

Now, if we let the translator translate everything and it ends up having issues with the links, we might want to find the link in the translated text and replace it with the previous one.

Something like this would be imaginable:

def link_correction(translated_text: str, links: list[str]) -> str:
    """A simple link correction function to keep the same links as before translation"""
    processing_text = translated_text.lower()
    for link in links:
        index = processing_text.find(link.lower()) # try to find the link in the translated text
        translated_text = translated_text[:index] + link + translated_text[len(link) + 1:] # just replace the link with the one before translation
    return translated_text

Note
This is an oversimplification of what could be done

Now, as you mentioned previously:

Link went from https://t.co/m1cE9sSrNb to https://t.co/M1ce9SSRNB. This alteration breaks entirely the link.

So if we have two links similar lower cased, they might be both replaced by the same link.


Now what should I do ?

  • Should I implement something which takes a Regex expression and tries to split the original text, then translates each parts individually and puts the pieces back together at the end, successfully leaving the Regex'ed parts untouched, but which comes with the first issue mentioned ?
  • Should I implement the oversimplified algorithm written herebefore ?
  • Also, should I implement the thing to add back spaces after dots, but this would work on languages using spaces after dots only (Latin-based for example) and might break the other ones ?
  • Also, what if for some reason, the user wants to translate the links ?

Note
Even if I'm only talking about links here the same thing applies to the hashtags, with the exception that hashtags are even harder to correct after the translation as they might carry some meaning and might need to be translated

@Animenosekai Animenosekai changed the title Alterating links Links and hashtags seem to change after translation Aug 29, 2023
@Animenosekai Animenosekai pinned this issue Aug 29, 2023
@ZhymabekRoman
Copy link
Contributor

@reddere, Use GoogleTranslateV2 and specify all your "static" links/hashtags into specific span tag:

<span class="notranslate">TAGS OR LINKS THERE</span>

For more information visit: https://cloud.google.com/translate/troubleshooting

In [5]: from translatepy.translators.google import GoogleTranslateV2

In [6]: dl = GoogleTranslateV2()

In [9]: dl.translate('Kado Thorne es un Vampiro y viajó en el tiempo desde el año 2020 cuando se presentó a la skin Oro.\n\n<span class="notranslate">#Fortnite</span> <span class="notran
   ...: slate">#FortniteLastResort</span> <span class="notranslate">https://t.co/m1cE9sSrNb</span>', 'it')
Out[9]: TranslationResult(service=Translator(Google), source='Kado Thorne es un Vampiro y viajó en el tiempo desde el año 2020 cuando se presentó a la skin Oro.\n\n<span class="notranslate">#Fortnite</span> <span class="notranslate">#FortniteLastResort</span> <span class="notranslate">https://t.co/m1cE9sSrNb</span>', source_lang=Language(Spanish), dest_lang=Language(Italian), translation='Kado Thorne è un vampiro e ha viaggiato indietro nel tempo a partire dall\'anno 2020 quando gli è stata presentata la skin Oro.\n\n<span class="notranslate">#Fortnite</span> <span class="notranslate">#FortniteLastResort</span> <span class="notranslate">https://t.co/m1cE9sSrNb</span>')

@reddere
Copy link
Author

reddere commented Sep 11, 2023

Thank you so much @ZhymabekRoman @Animenosekai . Haven't tested the workaround yet, but I kept my old GoogleTranslator until just 2 days ago when I tried the ReversoTranslator, which to me, seems to work even better than GoogleTranslator. Both on a lexical and choice of word level, in Italian seems to work decently.

Somehow though, I did find an issue for that one as well, as it throws error when word like única are in the source text, but I find better to open a separate issue for that one: #96

@Animenosekai
Copy link
Owner

Animenosekai commented Sep 11, 2023

Was talking with Venom on Discord about possible workarounds and support for notranslate or other HTML parsing ways of not translating certain parts of a given input. Might consider this soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants