Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

extract_number regression - should return the last number in the utterance, is returning first #152

Open
JarbasAl opened this issue Nov 4, 2020 · 18 comments
Labels
bug Something isn't working en relates to english language wontfix This will not be worked on

Comments

@JarbasAl
Copy link
Collaborator

JarbasAl commented Nov 4, 2020

Previously extract_number always returned the last number in the utterance

the rationale was that this kind of query would just work

"i want option 1, no i mean option 3"

@ChristopherRogers1991 rewrote that function in MycroftAI/mycroft-core#1977

the old version (updated for ease of testing in current LF) looked like this https://paste.ofcode.org/U5zbcVNcuUEyGVhXkYB83A

but that rewrite made it extract the first number instead of last, there isnt a single test for it in the codebase, this is a breaking change and change of paradigm, been in LF since 0.1.0

i ran into this while writing tests for unrelated functionality and think extract_number really needs to return the last number again

@JarbasAl JarbasAl added bug Something isn't working en relates to english language labels Nov 4, 2020
@ChanceNCounter
Copy link
Contributor

It's tricky. I agree with Jarbas 100% that it should return the last number, and I agree as to why. I'm also unafraid of breaking changes in a module that's very clearly early-versioned (and alpha.)

But I am afraid of breaking half a dozen Mycroft skills whose authors discovered and worked around this behavior.

I like DeprecationWarnings, and I dislike withholding progress/correctness for the sake of backward compatibility, but this bug has been there for a long time. What's the threshold for something like this, do we think in terms of adoption or in terms of years?

@JarbasAl
Copy link
Collaborator Author

JarbasAl commented Nov 4, 2020

i dont think any skill depends on this, if anything some skill that ocasionally breaks because of this will magically fix itself

either way mycroft-core can pin versions to 0.3.1 to protect skills

i cant think of a single reason this behaviour would be wanted much less depended on, i'm very confident there are no skills out there depending or expecting it

relevant discussion https://chat.mycroft.ai/community/pl/znrindyphidm9e58tgpjsftzsy

@ChanceNCounter
Copy link
Contributor

Well, putting it back has my approval, but I do think we should run a test branch against downstream for a week or two before we merge the change, just to see how many third-party skills we can break.

Not that it was definitive proof the last time we did that, but we did find a bunch of gotchas.

@krisgesling
Copy link
Contributor

I haven't come up with any real competing scenarios yet. So I think it's unlikely to break Skills.

To be safe we can get it into a mycroft-core release cycle early so it's getting tested in the wild on dev.

@ChristopherRogers1991
Copy link

Hey, just a few notes that will hopefully make a compelling counter argument (also on mobile, so hopefully this formats ok):

When I worked with the function, it did not consistently return the last number. It sometimes gave the first, and sometimes gave the last (if I recall correctly, fractional numbers tended to flip which you got).

The documentation did not (and to my knowledge, still does not) state which you would get. I would argue this makes the behavior 'undefined' and not bound by an API contract.

These two things have probably contributed to no one flagging this as a problem since that change went in 20 months ago.

Selecting the last when someone makes a mistake feels like an edge case, and kind of an odd one to target, especially now that this library has been extracted, and is no longer Mycroft specific. People very well may be using this library for things other than Mycroft skills.

Regardless of what it did when it was part of Mycroft, and whatever the initial intention, since the initial release of this library, the function has returned the first value.

Consider other functions like str.find, str.index, and re.search, all of which work left-to-right. This is standard convention, and not specific to Python. To take the last would subvert developer expectations.

I think a better solution would be to make an extract_last_number, akin to str.rfind and the like, and have developers use that, if they want the last number. Adding that and supporting both should be trivial, given the way things are currently written. It should even be easy to take an index (positive or negative), and return the corresponding value. (If this approach is taken, the docs should be updated, and this should define an API contract.)

I don't know how much my opinion counts here, but this is a hill I would die on. I feel pretty strongly that most developers would expect the first number to be returned.

@ChanceNCounter
Copy link
Contributor

@ChristopherRogers1991 I'm driving by, so haven't had a lot of time to consider your input, but I will - wanna make sure you know your opinion matters a lot. One of the weird things about FOSS is that you often get to find out how good a programmer you used to be, years after the fact. You were a pretty damn good one. I frequently hate your guts when I git blame, but just as frequently stand in awe.

So no need to genuflect =P

@JarbasAl
Copy link
Collaborator Author

JarbasAl commented Nov 5, 2020

the API contract is a good point, just want to add that while this was part of core while it was not documented it was debated in chat, if it didn't always return last number then that was a previous bug

historically we have merged half finished stuff and documentation has been lacking to put it nicely....

that said, i have no problem at all with "breaking contract" in any package which is clearly alpha and that didn't make it into v 1.0 yet, specially in a case like this where anything depending on that would have been depending on undocumented things. We shouldn't support bad decisions out of obligation, if anything we bump versions. Everyone should be pinning their versions either way

For me the question boils down to what is the correct thing to do, and your argument about standard convention is compelling

thank you for showing up! I tagged you hoping you would shime in, so your opinion indeed matters :)

@ChanceNCounter
Copy link
Contributor

On reflection, I think left-to-right as the conventional thing makes sense. I've also just checked extract_datetime, and it goes right-to-left. That's an internal inconsistency to think about. That function's scheduled for surgery either way, just putting it out there.

I think @ChristopherRogers1991's suggestion that we pick some other way to go "backward" is right. And, as was mentioned somewhere, whichever value this thing doesn't return is a 1-index from extract_numbers() ([1] or [-1]), so it's not hard to achieve as things stand. It might still be good to provide a direct way to get the last value, though, just because it'll surely run faster than the big kahuna.

@JarbasAl
Copy link
Collaborator Author

JarbasAl commented Nov 5, 2020

in the end this package is still voice-first, it is supposed to parse natural language. A convention is just that, and with this currently being more or less undefined behavior we can choose whatever makes more sense.

In that regard i disagree with @ChristopherRogers1991 in that we are targeting an edge case, we are targeting the main use case, which is a voice interaction, this has been extracted from core but is first-most meant to be used in there. Datetime follows the same rationale

Whatever we decide it should be consistent across the package, and in that regard changing datetime is a breaking change.... good news thats its gonna break soon anyways with #74 , so we should decide this and prepare for a 0.4 release

Keeping the bigger picture in mind, who are we targeting? Mycroft, and Mycroft is for everyone, and average people who can barely code will make skills, and these tools are meant mainly for those people. If the "coders convention" makes life harder for that people, while not really making life harder for devs (its just a quirk of the lib and you should be reading the docs anyway) i say we should focus on helping skill makers. Blindly adhering to a standard just because "other packages do it" should not be our main motivation

I am not convinced for either option, the presented arguments are solid for both viewpoints. But i am leaning towards extracting last number

EDIT: whatever we decide @krisgesling should make it clear in docs, if we want to hand-hold skill developers the extract_last needs to be the recommended way in documentation

@ChristopherRogers1991
Copy link

I don't think targeting the bottom is a good approach - we should be aiming for the best developers. If the primary goal of the library is still to support Mycroft - good developers making good skills brings people to the platform. New developers making "hello world" skills won't do us a lot of good. Additionally, good developers will find places the library (and platform, in general) could be improved, and contribute back. The more they contribute, the more the newbies have to leverage to make their skills.

I'd also argue that behaving inconsistently relative to other libraries/conventions will be harder for both inexperienced and experienced developers, and we'll risk losing both. Good developers will get frustrated if things aren't consistent, and APIs aren't stable (see this article for one example of this happening to another project - I happen to have this one top of mind because it was recent, but it's not uncommon), and new developers will give up when things "don't make sense" and work the way other libraries do.

Digging in a little further on the 'this makes things harder for newbies' point, I think we need to think wholistically. In other words, when considering what will be "easiest" for new developers, it's not enough to only consider this library, and this use case. When things don't work as expected, an experience dev will look at the docs and/or source, to figure things out. But new developers often aren't capable of doing that (consider all the basic questions that come through chat as people get started - many of which have documented answers). If they have any experience at all, they've probably worked with things are are left-to-right by convention. Adhering to that will make them feel at home; deviating will trip them up.

@JarbasAl
Copy link
Collaborator Author

JarbasAl commented Nov 5, 2020

To be clear my aim isn't targeting the bottom, just not forget that the bottom is there, the important thing is to make sure skill makers don't trip up on this, that can be achieved in different ways like documentation and solid examples/good practices in the official skills.

A library can be opinionated, this would be an example of it, but we need good reasons for that. What you describe should be the baseline and deviating from it absolutely needs to bring an advantage, this case doesn't seem to bring that much of an advantage so you have convinced me

In the end this also comes from needing to define what the function is supposed to do, if it is only supposed to extract a number like the name says you are absolutely right, if we want to be "smart" about it and try to extract the most relevant number then the function should be named accordingly

@JarbasAl
Copy link
Collaborator Author

JarbasAl commented Nov 5, 2020

@krisgesling you have the final word, but i think this can be closed as it seems we reached a consensus

We should however ensure extract_datetime and any future extract_XXX follow the same principle

@JarbasAl JarbasAl added the wontfix This will not be worked on label Nov 5, 2020
@ChanceNCounter
Copy link
Contributor

fwiw, re: new and/or less skilled developers, the flip side of parsing utterances the way we think users expect is that inconsistency with other libs. Inconsistency between libs is also a killer for noobs.

Remember how much datetime confused you when you had to from datetime import datetime? I sure hated it.

@JarbasAl
Copy link
Collaborator Author

JarbasAl commented Nov 5, 2020

we could just add some util methods, extract_first_number and extract_last_number

extract_first_number would only be syntactic sugar for convenience but would be an alias for extract_number, extract_last_number would be the recommended method for skills (in official docs and in official skills that people will use as templates)

@ChanceNCounter
Copy link
Contributor

💯 i like it. the sugar is unambiguous, the existing function signature returns the same thing it's been returning for ages, and both results are easy to find.

@JarbasAl
Copy link
Collaborator Author

JarbasAl commented Nov 5, 2020

whoever picks up on this needs to write a test suite, i have identified tests that are returning the last number, most are returning the first

a big TODO was left in the unittests with the identified test cases, please see https://github.com/MycroftAI/lingua-franca/pull/146/files#diff-6f34c464b101d5d6fad9a3e6373e21ff64cbc709890a98617af28f4da44b574fR74

@ChristopherRogers1991
Copy link

Hmm... I thought I had just wrapped extract_numbers_en, or similar, and just grabbed the first out of there (been too long since I had my hands in the code). In any case, that would be my recommendation - just convert everything, and then wrap that and return the value based on the index. As long as we correctly convert all the numbers, that approach guarantees we return the correct value - no possibility of random bugs flipping which value you get.

@JarbasAl
Copy link
Collaborator Author

JarbasAl commented Nov 5, 2020

actually extract_numbers wraps extract_number, ie, extract_number returns the left-most sequentially

this seems to me like a missing break statement somewhere maybe, or maybe something implemented assuming we were under the "return rightmost number" paradigm. I don't want to clutter the other PR with yet another bug unrelated bugfix, i only set out to fix #143

a follow up PR should test this extensively and add tests for this left-first behaviour. I think i will uncover more issues related to these failing tests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working en relates to english language wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

4 participants