Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not accurate source language autodetection #74

Open
joeperpetua opened this issue Jan 22, 2023 · 14 comments
Open

Not accurate source language autodetection #74

joeperpetua opened this issue Jan 22, 2023 · 14 comments

Comments

@joeperpetua
Copy link

Hi!
First of all wanted to say that I love the project, have been using it for a while now.

I came across some bizarre behavior that maybe you could check or maybe explain to me (I tried checking the source code for the functions but did not see anything relevant that could be causing this).

In this case, it seems that the source language autodetection is a bit off when giving it short and single words. I reproduced it with Spanish, but I don't know if it does happen in other languages too.
In this case, if you give the words "casa" or "hola" for example, it will detect the source language as English instead of Spanish.

For example using the base translator:

Python 3.11.1 (tags/v3.11.1:a7a450f, Dec  6 2022, 19:58:39) [MSC v.1934 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import translatepy
>>> translatepy.Translator().language("casa")
LanguageResult(service=Google, source=casa, result=eng)

Then I tried using the translators explicitly, in this case Reverso and Google, then using the base translator again, and it worked correctly (I guess because of the cache, but I may be wrong):

Python 3.11.1 (tags/v3.11.1:a7a450f, Dec  6 2022, 19:58:39) [MSC v.1934 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import translatepy
>>> translatepy.translators.reverso.ReversoTranslate().language("casa")
LanguageResult(service=Reverso, source=casa, result=spa)
>>> translatepy.translators.google.GoogleTranslate().language("casa")
LanguageResult(service=Google, source=casa, result=spa)
>>> translatepy.Translate().language("casa")
LanguageResult(service=Google, source=casa, result=spa)

But interestingly enough, then, in the same session, using the base translator with the method translate(), the detection was off again:

>>> translatepy.Translate().translate("casa", "en")
TranslationResult(service=Google, source=casa, source_language=eng, destination_language=eng, result=casa)

Any ideas of why could be this happening? I guess the workaround by know would be to run the GoogleTranslate().language() method, and then the Translator().translate() method to get accurate results, like so:

>>> lang = translatepy.translators.google.GoogleTranslate().language("casa")
>>> translatepy.Translate().translate("casa", "en", lang.result)
TranslationResult(service=Google, source=casa, source_language=spa, destination_language=eng, result=house)

Anyway, wanted to ask about this and see if there is any reasoning behind it.
Sorry for the long message and thanks in adavance !

@ZhymabekRoman
Copy link
Contributor

Thanks for reporting this! This is strange, in my case even class GoogleTranslate doesn't recognize the language correctly. Problems seem to be on Google server side

translate git:(main) ipython3
Python 3.9.2 (default, Feb 28 2021, 17:03:44) 
Type 'copyright', 'credits' or 'license' for more information
IPython 8.8.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import translatepy

In [2]:  translatepy.translators.google.GoogleTranslateV1().language("casa")
Out[2]: LanguageResult(service=Google, source=casa, result=eng)
translate git:(main) ipython3
Python 3.9.2 (default, Feb 28 2021, 17:03:44) 
Type 'copyright', 'credits' or 'license' for more information
IPython 8.8.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import translatepy

In [2]:  translatepy.translators.google.GoogleTranslateV2().language("casa")
Out[2]: LanguageResult(service=Google, source=casa, result=eng)
translate git:(main) ipython3
Python 3.9.2 (default, Feb 28 2021, 17:03:44) 
Type 'copyright', 'credits' or 'license' for more information
IPython 8.8.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import translatepy

In [2]:  translatepy.translators.google.GoogleTranslate().language("casa")
Out[2]: LanguageResult(service=Google, source=casa, result=eng)

@joeperpetua
Copy link
Author

Thanks for the response!
I experimented a little more, and it does seem that Google Translate is the issue.
Also, it seems that the first response will influence the subsequent results. For example:
Used GoogleTranslate() first, got result=eng. But then used Reverso, and the result was the same as the one from Google:

Python 3.11.1 (tags/v3.11.1:a7a450f, Dec  6 2022, 19:58:39) [MSC v.1934 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import translatepy
>>> translatepy.translators.google.GoogleTranslate().language("casa")
LanguageResult(service=Google, source=casa, result=eng)
>>> translatepy.translators.reverso.ReversoTranslate().language("casa")
LanguageResult(service=Reverso, source=casa, result=eng)

But, if you use Reverso first, then the result will be correct when using Google Translate:

Python 3.11.1 (tags/v3.11.1:a7a450f, Dec  6 2022, 19:58:39) [MSC v.1934 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import translatepy
>>> translatepy.translators.reverso.ReversoTranslate().language("casa")
LanguageResult(service=Reverso, source=casa, result=spa)
>>> translatepy.translators.google.GoogleTranslate().language("casa")
LanguageResult(service=Google, source=casa, result=spa)

Could this be related to the cache mechanism?

@Animenosekai
Copy link
Owner

(I guess because of the cache, but I may be wrong)

Yes, I would guess the same !

But interestingly enough, then, in the same session, using the base translator with the method translate(), the detection was off again

This is normal, because some translators, such as Google Translate, already returns the source language with their translation endpoint, and some need to first call the language endpoint.

So, even if you called the language endpoint first with Google Translate, the source language would be the one returned by the translation endpoint.

The weirdest thing is that Google Translate returned Spanish though.

Looking at the official website, we see that indeed the detected language is English

Screenshot 0005-01-23 at 21 18 59

@Animenosekai
Copy link
Owner

Also, it seems that the first response will influence the subsequent results

Now this is weird, because it shouldn't lol

This is the part where the GET cache is returned

_cache_key = str(url) + str(kwargs)
if _cache_key in self.GETCACHE and time() - self.GETCACHE[_cache_key]["timestamp"] < self.cache_duration:
return self.GETCACHE[_cache_key]["response"]

For the translator cache, here is the part where it gets the cache

if _cache_key in self._languages_cache:
# Taking the values from the cache
language = self._languages_cache[_cache_key]

But that's weird because we clearly see that you are creating two different instances of the Translator class

>>> translatepy.translators.reverso.ReversoTranslate().language("casa")
LanguageResult(service=Reverso, source=casa, result=spa)
>>> translatepy.translators.google.GoogleTranslate().language("casa")
LanguageResult(service=Google, source=casa, result=spa)

@joeperpetua
Copy link
Author

Well, just found a very interesting behavior (or bug) from Google Translate.
It seems that it will detect a different language depending on the language of your Google account. For example:
GA - English | detects English:
image
GA - Spanish | detects Spanish:
image
GA - French and German | detect Portuguese:
image
image
From this, I guess that the best would be to just clean the cache in the production server and then go with Reverso to get the language and pass it explicitly.

@Animenosekai
Copy link
Owner

Well, just found a very interesting behavior (or bug) from Google Translate.
It seems that it will detect a different language depending on the language of your Google account. For example:

Wow now that's interesting...

I guess it might be a feature to guess better the expected result.

@Animenosekai
Copy link
Owner

Animenosekai commented Jan 23, 2023

But then it might change the result based on the service URL used 🤔

@Animenosekai
Copy link
Owner

Just confirmed it:

>>> from translatepy.translators.google import GoogleTranslate
>>> g = GoogleTranslate(service_url="translate.google.es")
>>> g.language("casa")
LanguageResult(service=Google, source=casa, result=spa)
>>> g = GoogleTranslate(service_url="translate.google.fr")
>>> g.language("casa")
LanguageResult(service=Google, source=casa, result=spa)
>>> g.clean_cache()
>>> g.language("casa")
LanguageResult(service=Google, source=casa, result=por)

And yes something is happening with the caches

@joeperpetua
Copy link
Author

joeperpetua commented Jan 23, 2023

But that's weird because we clearly see that you are creating two different instances of the Translator class

Well, that is interesting indeed, I would have totally blamed it in the cache to be honest lol

I guess it might be a feature to guess better the expected result.
But then it might change the result based on the service URL used 🤔

Yeah, but I think it kinda makes sense for words that are the same in different languages, for example casa is the same in Spanish, Portuguese and Italian, so if your GA is set in Italian, the detection will go with Italian:
image

@joeperpetua
Copy link
Author

And yes something is happening with the caches

Well, that is something lol, I tried checking in the source code before, but my python skills are not that sharp 😅 maybe you have a better eye to catch what's going on lol

@ZhymabekRoman
Copy link
Contributor

ZhymabekRoman commented Jan 24, 2023

And yes something is happening with the caches

It's not a bug, it's a feature. When I designed the V2 translatepy architecture, I make a one cache instance avaible for all BaseTranslate class instances. In practice, it doesn't seem to be a good idea. If required, I can make PR to fix this, and integrate new LRU cache logic (#58).

class BaseTranslator(ABC):
"""
Base abstract class for a translate service
"""
_translations_cache = LRUDictCache()
_transliterations_cache = LRUDictCache()
_languages_cache = LRUDictCache()
_spellchecks_cache = LRUDictCache()
_examples_cache = LRUDictCache()
_dictionaries_cache = LRUDictCache()
_text_to_speeches_cache = LRUDictCache(8)

Caches initializes as class attributes, not instance. More info: https://stackoverflow.com/a/207128/13452914

@Animenosekai
Copy link
Owner

When I designed the V2 translatepy architecture, I make a one cache instance avaible for all BaseTranslate class instances.

Yes, I think this should be changed because people using translators separately expect different results from each instance.

Moreover, if they want a shared cache, they might just use the Translate class.

Also yea you can PR the new LRU logic anytime you want !

@joeperpetua
Copy link
Author

Thank you all guys for the help 🙌🙌

@ZhymabekRoman
Copy link
Contributor

ZhymabekRoman commented Jan 24, 2023

New PR done: #76

translate git:(main) ipython
Python 3.9.2 (default, Feb 28 2021, 17:03:44) 
Type 'copyright', 'credits' or 'license' for more information
IPython 8.8.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: from translatepy.translators.google import GoogleTranslate

In [2]: g = GoogleTranslate(service_url="translate.google.es")

In [3]: g.language("casa")
Out[3]: LanguageResult(service=Google, source=casa, result=spa)

In [4]: g = GoogleTranslate(service_url="translate.google.fr")

In [5]: g.language("casa")
Out[5]: LanguageResult(service=Google, source=casa, result=por)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants