Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite Expression class? #527

Open
veni-vidi-vici-dormivi opened this issue Sep 19, 2024 · 6 comments
Open

Rewrite Expression class? #527

veni-vidi-vici-dormivi opened this issue Sep 19, 2024 · 6 comments

Comments

@veni-vidi-vici-dormivi
Copy link
Collaborator

I was thinking why we do not actually pass functions as arguments instead of a strings to the Expression class. I was thinking something like this:

import scipy
import numpy as np
from typing import Callable, Union, List, Tuple
import inspect
import scipy.stats

class Expression:
    def __init__(self, 
                 distribution: Union[scipy.stats.rv_continuous, scipy.stats.rv_discrete], 
                 parameters: List[Tuple[str, Callable]]):
        
        self.distribution = distribution
        self.parameters = parameters

    def sample(self, **kwargs):
        distribution_args = {}
        
        for param_name, func in self.parameters:
            # Get the function's signature to see what arguments it requires
            func_signature = inspect.signature(func)
            func_args = {k: v for k, v in kwargs.items() if k in func_signature.parameters}
            
            # Call the function with the matched arguments
            distribution_args[param_name] = func(**func_args)

        # Generate a sample using the distribution and the computed parameter values
        return self.distribution.rvs(**distribution_args)

# Example functions for loc and scale (and potentially other parameters)
def loc_func(c1: float, c2: float, x: float | np.ndarray):
    return c1 + c2 * x

def shape_func(c3: float):
    return c3

# Create parameter list as an iterable (list of tuples) for loc, scale, and others
param_list = [
    ('loc', loc_func),
    ('scale', shape_func)  # Example additional parameter
]

# Initialize Expression with scipy.stats.norm or any other distribution
myexpression = Expression(scipy.stats.norm, param_list)
x = np.array([0.5, 0.6, 0.7, 0.8, 0.9])

# Sample using the expression with arguments for loc, scale, and shape functions
sample = myexpression.sample(c1=1, c2=2, c3=1, x=x)
print(sample)

And then feeding the functions to the optimization routines like we would after parsing? This is not incredibly thought through yet but it might be a way around the parsing.

@mathause
Copy link
Member

mathause commented Sep 19, 2024

Thanks for getting this discussion started. For me it would be important to understand the requirements better & I try to do this by going over the code / writing tests. Some random thoughts and questions:

  • I had ideas in a similar direction & can imagine that passing functions could be faster and clearer. But we don't know the structure of the function.
  • We could get rid of the eval and exec calls which would make me happy 😄
  • Do we need to know the structure of the functions?
  • Can it be that we want to have the same coeff for two params (e.g. loc=c1, scale=c1)?
  • How can we differentiate coeffs from covariates?
    • Pass coeffs by position and covariates as kwargs?
  • Should coeffs be an array? That would align with the structure of the minimize function.
    def loc_func(c, x):
        return c[0] + c[1] * x
    
    def shape_func(c):
        return c[2]
  • Are discrete distributions actually an option?
  • Should we create a class for each distribution we support?
    • I guess there are 'only' about 5-15 that are actually relevant.
    • We could 'manually' pass the bounds
      # prepary basic boundaries on parameters: incomplete, did not find a way to
      # evaluate automatically the limits on shape parameters
    • Maybe this would help with setting the initial values? At least for some of the distributions.
  • Same comment for the covariates-parameter functions.

@veni-vidi-vici-dormivi
Copy link
Collaborator Author

A problem with going away from the expression as a string could be that it becomes harder to save the expression. I think it is important to save the expression in some way, only saving the parameters won't get you anywhere if you lost the expression, but we could handle that by still setting a name and asking the user to put all relevant information there. Then we could save the name as an attribute with the parameters.

Should coeffs be an array?

I like that idea, then we could get rid of the signature call and saving could also be more flexible.

Can it be that we want to have the same coeff for two params

That is possible for the structure I gave above, I'm just not sure how the optimization would handle that...

@yquilcaille
Copy link
Collaborator

I see the appeal of this solution, but I see there a major problem. In summary, it clashes with the objective of Expression, and future applications will be hindered by the need to provide all functions along.

The whole objective of the class Expression was to finally be capable of doing all distributions and all expressions on parameters. To have full flexibility.

With this alternative, the user will have to define its functions and pass them as arguments. We have to make sure that the same functions are used for training AND emulation. Thus pass the functions to the users. Thus have a script with all potential functions. Which is precisely what i wanted to avoid, being constrained to predefined functions. In my experience, there will always be new ideas, new expressions. So, every single time, we would have to add new functions to the script with all predefined functions. I sincerely fear that it will be messy to have them all.

So, we have to choose, do we pass directly the functions and lose the full flexibility, or do we stick to the strings, with the fix on the parsing? My personal opinion is for the full flexibility, because of how much time i've spent in this code, because of future developments & applications. But again, I see the appeal of your solution.

One extra point: the object returned by the alternative class is the .rvs object. It should be a scipy,.stats distribution, so that we can call anything. I've already to call logpdf, logpmf, cdf, sf, isf & support, and for plots, even more.

On the other points raised by @mathause:

  • exec & eval: yes, required with the original Expression, but unless i'm mistaken, they are used correctly here. So even if these functions are dangerous, it should be fine. Hopefully?
  • what do you mean by structure of the function?
  • Same coefficients in multiple parameters? Yes. I suspect that Lorenzo will need that for extreme precipitations given the expressions used in Extreme Event Attribution.
  • Yes, discrete distributions are already used for the number of days with extreme fire weather and for the length of the fire season.
  • Creating class for each distribution? I'm strongly against that. Again, the very purpose of this class is full flexibility. It will create codes with lots of redundancies, too long for nothing. And every new distribution will require additional developments.
  • Setting bounds on parameters, yes, but already done as much as i think we can. There are already some defaults for the scale. Nothing on the location, because not bounded. The shape is less evident, because it depends very much on each distribution. Then setting bounds using the data, not as trivial, because the mean & spread are expressions of all parameters that vary from an expression to another. This is why the first guess is so tricky to get. Then, setting bounds on coefficients isnt straightforward, i investigated that for global optimizations, I couldnt find a proper solution.

Of course, it requires some understanding of the string as input, with the parsing failing with extra commas (but potentially easy fix). It also requires

@mathause
Copy link
Member

Thanks for your reply - it's important to understand the requirements and design decisions.

  1. distribution

    The idea is to create one parent class and have each distribution subclass it, so we only have to change small stuff for each distribution - similar to what I do in https://github.com/mathause/dist_cov/blob/main/dist_cov/distributions.py (but that would have to be independent of the covariate structure). Of course doing that for all the distributions in scipy would still be a chore...

  2. covariate structure

    I agree that one big advantage of the "string" based approach is that it can easily be serialized. This is less elegant for a function-based approach (but can still be done). However, passing strings also has limitations, e.g. currently only math and numpy methods are availabe, and this is more difficult to extend than e.g. a missing distribution. Of course here we cannot be exhaustive.


  • exec & eval: yes, required with the original Expression, but unless i'm mistaken, they are used correctly here. So even if these functions are dangerous, it should be fine. Hopefully?

Yes I think the usage is fine - but I still don't like it - in principle (sorry) and because it makes the code more difficult to understand (and is also a bit slower). You do a parsing step which might avoid arbitrary code execution (but that is a difficult problem to solve in general).

  • what do you mean by structure of the function?

If we need to know that we e.g. call c1 + c2 * Tg or if we only need to know the result.

  • Creating class for each distribution? I'm strongly against that. Again, the very purpose of this class is full flexibility. It will create codes with lots of redundancies, too long for nothing. And every new distribution will require additional developments.

See my comment above.

  • Setting bounds on parameters...

This is something we could do for the shape if we "implement" each distribution separately but does not work otherwise.

@yquilcaille
Copy link
Collaborator

I understand the principle of writing a subclass for each distribution... But this is precusely what i want to avoid. Yes, it would take quite some time indeed, and it will slow down future developments. That is why the object returned by Expression is directly a scipy.stats distribution, because everything is already in there.

Regarding the covariates, I also understand your point... but there are limits with my version (constrained to math/numpy only), and with your version (constrained to prepared functions). And you want to avoid developing our own syntax... but using prepared functions is also developing our own syntax. Thus, same problem.

As I said, I have already implemented bounds on parameters.

To summarize:

  • approach class with string: own syntax, very flexible, but eval/exec is risky
  • approach class with subclasses & functions: own syntax, only prepared options (development!), no risks from eval/exec.

Can we not find a middle ground? I understand that you dislike the eval/exec, but I feel like we would lose a lot of flexibility and time, now and in the future.

@mathause
Copy link
Member

mathause commented Oct 7, 2024

@yquilcaille we are now working through your code and getting familiar with it and understanding what it needs to be able to do.


To answer one of questions: yes it is required to understand the structure of the expression. we need to know which coefficient belongs to which param (e.g. that c1 belongs to loc). See fg_ind_loc

self.fg_ind_loc = np.array(
[
self.expr_fit.coefficients_list.index(c)
for c in self.expr_fit.coefficients_dict["loc"]
]
)

x[self.fg_ind_loc] = x_loc

(as a side note fg_ind_loc should be a property of Expression and not of dist_cov)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants