Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add experiment analysis functionalities #185

Draft
wants to merge 49 commits into
base: main
Choose a base branch
from

Conversation

ludovico-lanni
Copy link
Collaborator

@ludovico-lanni ludovico-lanni commented Jul 21, 2024

import pandas as pd
import numpy as np

# Constants
NUM_ORDERS = 10000
NUM_CUSTOMERS = 3000
EXPERIMENT_GROUPS = ['control', 'treatment_1', 'treatment_2']
GROUP_SIZE = NUM_CUSTOMERS // len(EXPERIMENT_GROUPS)

# Seed for reproducibility
np.random.seed(42)

# Generate customer_ids
customer_ids = np.arange(1, NUM_CUSTOMERS + 1)

# Shuffle and split customer_ids into experiment groups
np.random.shuffle(customer_ids)
experiment_group = np.repeat(EXPERIMENT_GROUPS, GROUP_SIZE)
experiment_group = np.concatenate((experiment_group, np.random.choice(EXPERIMENT_GROUPS, NUM_CUSTOMERS - len(experiment_group))))

# Assign customers to groups
customer_group_mapping = dict(zip(customer_ids, experiment_group))

# Generate orders
order_ids = np.arange(1, NUM_ORDERS + 1)
customers = np.random.choice(customer_ids, NUM_ORDERS)
order_values = np.abs(np.random.normal(loc=10, scale=2, size=NUM_ORDERS))  # Normally distributed around 10 and positive
order_delivery_times = np.abs(np.random.normal(loc=30, scale=5, size=NUM_ORDERS))  # Normally distributed around 30 minutes and positive
order_city_codes = np.random.randint(1, 3, NUM_ORDERS)  # Random city codes between 1 and 2

# Create DataFrame
data = {
    'order_id': order_ids,
    'customer_id': customers,
    'experiment_group': [customer_group_mapping[customer_id] for customer_id in customers],
    'order_value': order_values,
    'order_delivery_time_in_minutes': order_delivery_times,
    'order_city_code': order_city_codes
}

df = pd.DataFrame(data)
df.order_city_code = df.order_city_code.astype(str)

# Show the first few rows of the DataFrame
print(df.head())

from cluster_experiments.analysis_plan import AnalysisPlan
from cluster_experiments.metric import SimpleMetric
from cluster_experiments.dimension import Dimension
from cluster_experiments.variant import Variant
from cluster_experiments.hypothesis_test import HypothesisTest

dimension__city_code = Dimension(
    name='order_city_code',
    values=['1','2']
)

metric__order_value = SimpleMetric(
    alias='AOV',
    name='order_value'
)

metric__delivery_time = SimpleMetric(
    alias='AVG DT',
    name='order_delivery_time_in_minutes'
)

test__order_value = HypothesisTest(
    metric=metric__order_value,
    analysis_type="clustered_ols",
    analysis_config={"cluster_cols":["customer_id"]},
    dimensions=[dimension__city_code]
)

test__delivery_time = HypothesisTest(
    metric=metric__delivery_time,
    analysis_type="gee",
    analysis_config={"cluster_cols":["customer_id"]}
)

variants = [
    Variant('control', is_control=True),
    Variant('treatment_1', is_control=False),
    Variant('treatment_2', is_control=False)
]

analysis_plan = AnalysisPlan(
    tests=[test__order_value, test__delivery_time],
    variants=variants,
    variant_col='experiment_group',
    alpha=0.01
)

results = analysis_plan.analyze(exp_data=df)

print(results)

@ludovico-lanni ludovico-lanni marked this pull request as draft July 21, 2024 15:42
@codecov-commenter
Copy link

codecov-commenter commented Jul 21, 2024

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 32.70440% with 214 lines in your changes missing coverage. Please review.

Project coverage is 83.16%. Comparing base (d4ea384) to head (0cc21d3).

Files with missing lines Patch % Lines
cluster_experiments/inference/analysis_plan.py 0.00% 71 Missing ⚠️
cluster_experiments/experiment_analysis.py 28.57% 55 Missing ⚠️
cluster_experiments/inference/hypothesis_test.py 29.85% 47 Missing ⚠️
cluster_experiments/inference/metric.py 51.16% 21 Missing ⚠️
cluster_experiments/inference/dimension.py 52.17% 11 Missing ⚠️
cluster_experiments/inference/variant.py 58.33% 5 Missing ⚠️
cluster_experiments/inference/analysis_results.py 84.00% 4 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@             Coverage Diff             @@
##             main     #185       +/-   ##
===========================================
- Coverage   96.77%   83.16%   -13.62%     
===========================================
  Files          10       16        +6     
  Lines        1179     1497      +318     
===========================================
+ Hits         1141     1245      +104     
- Misses         38      252      +214     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@david26694
Copy link
Owner

hey, I like the flexibility in here but I can also think of many cases were we use the same dimensions and analysis for the whole analysis plan. Therefore, I would also allow for this API:

import pandas as pd
import numpy as np

# Constants
NUM_ORDERS = 10000
NUM_CUSTOMERS = 3000
EXPERIMENT_GROUPS = ['control', 'treatment_1', 'treatment_2']
GROUP_SIZE = NUM_CUSTOMERS // len(EXPERIMENT_GROUPS)

# Seed for reproducibility
np.random.seed(42)

# Generate customer_ids
customer_ids = np.arange(1, NUM_CUSTOMERS + 1)

# Shuffle and split customer_ids into experiment groups
np.random.shuffle(customer_ids)
experiment_group = np.repeat(EXPERIMENT_GROUPS, GROUP_SIZE)
experiment_group = np.concatenate((experiment_group, np.random.choice(EXPERIMENT_GROUPS, NUM_CUSTOMERS - len(experiment_group))))

# Assign customers to groups
customer_group_mapping = dict(zip(customer_ids, experiment_group))

# Generate orders
order_ids = np.arange(1, NUM_ORDERS + 1)
customers = np.random.choice(customer_ids, NUM_ORDERS)
order_values = np.abs(np.random.normal(loc=10, scale=2, size=NUM_ORDERS))  # Normally distributed around 10 and positive
order_delivery_times = np.abs(np.random.normal(loc=30, scale=5, size=NUM_ORDERS))  # Normally distributed around 30 minutes and positive
order_city_codes = np.random.randint(1, 3, NUM_ORDERS)  # Random city codes between 1 and 2

# Create DataFrame
data = {
    'order_id': order_ids,
    'customer_id': customers,
    'experiment_group': [customer_group_mapping[customer_id] for customer_id in customers],
    'order_value': order_values,
    'order_delivery_time_in_minutes': order_delivery_times,
    'order_city_code': order_city_codes
}

df = pd.DataFrame(data)
df.order_city_code = df.order_city_code.astype(str)

# Show the first few rows of the DataFrame
print(df.head())

from cluster_experiments.analysis_plan import AnalysisPlan
from cluster_experiments.metric import SimpleMetric
from cluster_experiments.dimension import Dimension
from cluster_experiments.variant import Variant
from cluster_experiments.hypothesis_test import HypothesisTest

dimension__city_code = Dimension(
    name='order_city_code',
    values=['1','2']
)

metric__order_value = SimpleMetric(
    alias='AOV',
    name='order_value'
)

metric__delivery_time = SimpleMetric(
    alias='AVG DT',
    name='order_delivery_time_in_minutes'
)

variants = [
    Variant('control', is_control=True),
    Variant('treatment_1', is_control=False),
    Variant('treatment_2', is_control=False)
]

# this next line or
# analysis_plan = SimpleAnalysisPlan(
# or
# analysis_plan = AnalysisPlan.from_raw(
analysis_plan = AnalysisPlan.from_metrics(
    metrics=[metric__delivery_time, metric__order_value],
    variants=variants,
    variant_col='experiment_group',
    alpha=0.01,
    dimensions=[dimension__city_code],
    analysis_type="clustered_ols",
    analysis_config={"cluster_cols":["customer_id"]},
)

results = analysis_plan.analyze(exp_data=df)

print(results)

wdyt?

Copy link
Owner

@david26694 david26694 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super good job ludo! Some general comments:

  • I'd move all of this into an experiment analysis folder or something like this, the code starts to feel too scattered
  • I understand we are missing the stacking of scorecards, unit tests, and supporting cupac. Anything else?
  • Have a look at the interface comment above pls.
    But again, super good job, and super fast!

cluster_experiments/analysis_plan.py Outdated Show resolved Hide resolved
cluster_experiments/analysis_plan.py Outdated Show resolved Hide resolved
cluster_experiments/analysis_plan.py Outdated Show resolved Hide resolved
cluster_experiments/analysis_plan.py Outdated Show resolved Hide resolved
cluster_experiments/analysis_results.py Outdated Show resolved Hide resolved
cluster_experiments/dimension.py Outdated Show resolved Hide resolved
cluster_experiments/experiment_analysis.py Show resolved Hide resolved
cluster_experiments/analysis_plan.py Outdated Show resolved Hide resolved
cluster_experiments/variant.py Outdated Show resolved Hide resolved
@david26694
Copy link
Owner

I guess we're also missing:

  • init from config
  • adding washover in analysis

@ludovico-lanni
Copy link
Collaborator Author

ludovico-lanni commented Aug 2, 2024

I covered all of the points we discussed :)

Only big thing that should be missing now are unit tests.
About initialising everything from config, I believe we can do it in a second iteration as it is not a hard requirement and the interface already looks quite simple to use.. wdyt?

Check this interface now:

#%%
import pandas as pd
import numpy as np

# Constants
NUM_ORDERS = 10000
NUM_CUSTOMERS = 3000
EXPERIMENT_GROUPS = ['control', 'treatment_1', 'treatment_2']
GROUP_SIZE = NUM_CUSTOMERS // len(EXPERIMENT_GROUPS)

# Seed for reproducibility
np.random.seed(42)

# Generate customer_ids
customer_ids = np.arange(1, NUM_CUSTOMERS + 1)

# Shuffle and split customer_ids into experiment groups
np.random.shuffle(customer_ids)
experiment_group = np.repeat(EXPERIMENT_GROUPS, GROUP_SIZE)
experiment_group = np.concatenate((experiment_group, np.random.choice(EXPERIMENT_GROUPS, NUM_CUSTOMERS - len(experiment_group))))

# Assign customers to groups
customer_group_mapping = dict(zip(customer_ids, experiment_group))

# Generate orders
order_ids = np.arange(1, NUM_ORDERS + 1)
customers = np.random.choice(customer_ids, NUM_ORDERS)
order_values = np.abs(np.random.normal(loc=10, scale=2, size=NUM_ORDERS))  # Normally distributed around 10 and positive
order_delivery_times = np.abs(np.random.normal(loc=30, scale=5, size=NUM_ORDERS))  # Normally distributed around 30 minutes and positive
order_city_codes = np.random.randint(1, 3, NUM_ORDERS)  # Random city codes between 1 and 2

# Create DataFrame
data = {
    'order_id': order_ids,
    'customer_id': customers,
    'experiment_group': [customer_group_mapping[customer_id] for customer_id in customers],
    'order_value': order_values,
    'order_delivery_time_in_minutes': order_delivery_times,
    'order_city_code': order_city_codes
}

df = pd.DataFrame(data)
df.order_city_code = df.order_city_code.astype(str)

pre_exp_df = df.assign(
    order_value = lambda df: df['order_value'] + np.random.normal(loc=0, scale=1, size=NUM_ORDERS),
    order_delivery_time_in_minutes = lambda df: df['order_delivery_time_in_minutes'] + np.random.normal(loc=0, scale=2, size=NUM_ORDERS)
).sample(int(NUM_ORDERS/3))

# Show the first few rows of the DataFrame
print(df.head())
print(pre_exp_df.head())

from cluster_experiments.inference.analysis_plan import AnalysisPlan
from cluster_experiments.inference.metric import SimpleMetric
from cluster_experiments.inference.dimension import Dimension
from cluster_experiments.inference.variant import Variant
from cluster_experiments.inference.hypothesis_test import HypothesisTest
from cluster_experiments import TargetAggregation

dimension__city_code = Dimension(
    name='order_city_code',
    values=['1','2']
)

metric__order_value = SimpleMetric(
    alias='AOV',
    name='order_value'
)

metric__delivery_time = SimpleMetric(
    alias='AVG DT',
    name='order_delivery_time_in_minutes'
)

test__order_value = HypothesisTest(
    metric=metric__order_value,
    analysis_type="clustered_ols",
    analysis_config={"cluster_cols":["customer_id"]},
    dimensions=[dimension__city_code]
)

cupac__model = TargetAggregation(agg_col="customer_id", target_col="order_delivery_time_in_minutes")

test__delivery_time = HypothesisTest(
    metric=metric__delivery_time,
    analysis_type="gee",
    analysis_config={"cluster_cols":["customer_id"], "covariates":["estimate_order_delivery_time_in_minutes"]},
    cupac_config={"cupac_model":cupac__model,
                  "target_col":"order_delivery_time_in_minutes"}
)

variants = [
    Variant('control', is_control=True),
    Variant('treatment_1', is_control=False),
    Variant('treatment_2', is_control=False)
]

analysis_plan = AnalysisPlan(
    tests=[test__order_value, test__delivery_time],
    variants=variants,
    variant_col='experiment_group',
    alpha=0.01
)

results = analysis_plan.analyze(exp_data=df, pre_exp_data=pre_exp_df)

print(results)

results_df = results.to_dataframe()

#%%

simple_analysis_plan = AnalysisPlan.from_metrics(
    metrics=[metric__delivery_time, metric__order_value],
    variants=variants,
    variant_col='experiment_group',
    alpha=0.01,
    dimensions=[dimension__city_code],
    analysis_type="clustered_ols",
    analysis_config={"cluster_cols":["customer_id"]},
)

simple_results = simple_analysis_plan.analyze(exp_data=df, verbose=True)

simple_results_df = simple_results.to_dataframe()

Copy link
Owner

@david26694 david26694 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very good, left some suggestions

cluster_experiments/inference/metric.py Outdated Show resolved Hide resolved
cluster_experiments/inference/analysis_plan.py Outdated Show resolved Hide resolved
cluster_experiments/inference/analysis_plan.py Outdated Show resolved Hide resolved
cluster_experiments/inference/analysis_plan.py Outdated Show resolved Hide resolved

for treatment_variant in self.treatment_variants:
for dimension in test.dimensions:
for dimension_value in dimension.iterate_dimension_values():
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wdyt about adding the query of the dimension in the Dimension class? I think it's cleaner, the HypothesisTest shouldn't know what a dimension is

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to keep it as is, because the test knows what a dimension is, given the fact that it takes a list of dimensions as attributes when instantiating a test object. This way we use a prepare_data() method at the test level

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok I see your point below about self.dimensions not being utilised in hypotesis_test directly. I'll leave this as an open topic as I'm not sure of a better solution

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's see if this makes sense: we change analysisplan to have this:

    def __init__(
        self,
        tests: List[HypothesisTest],
        variants: List[Variant],
        dimensions: List[Dimension],
        variant_col: str = "treatment",
        alpha: float = 0.05,
    ):

and the filter is done on the dimension,

changing this loop for

            for treatment_variant in self.treatment_variants:
                for dimension in self.dimensions:
                    for dimension_value in dimension.iterate_dimension_values():

Two questions:

  • Does this make sense or am I missing something?
  • Would this change the main interfaces (if it doesn't, we could leave it for later)?

cluster_experiments/inference/analysis_plan.py Outdated Show resolved Hide resolved
self.metric = metric
self.analysis_type = analysis_type
self.analysis_config = analysis_config or {}
self.dimensions = [DefaultDimension()] + (dimensions or [])
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not used, do we need it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, but this is used in the analysis_plan code, because in the most generic case each test can have a different list of dimensions to slice on. This is a good question, not sure if there's a better design

cluster_experiments/inference/analysis_results.py Outdated Show resolved Hide resolved
@david26694
Copy link
Owner

About initialising everything from config, I believe we can do it in a second iteration as it is not a hard requirement and the interface already looks quite simple to use.. wdyt?

agree! but have a look at the experimentanalysis comment then

@ludovico-lanni
Copy link
Collaborator Author

Addresses all the points you reviewed. Focusing on the handling of the analysis class to allow for flexibility and extensibility, I proposed a simple-enough solution through a custom mapper in case the user wants to use it. Check it out at the end of this script.

#%%
import pandas as pd
import numpy as np

# Constants
NUM_ORDERS = 10000
NUM_CUSTOMERS = 3000
EXPERIMENT_GROUPS = ['control', 'treatment_1', 'treatment_2']
GROUP_SIZE = NUM_CUSTOMERS // len(EXPERIMENT_GROUPS)

# Seed for reproducibility
np.random.seed(42)

# Generate customer_ids
customer_ids = np.arange(1, NUM_CUSTOMERS + 1)

# Shuffle and split customer_ids into experiment groups
np.random.shuffle(customer_ids)
experiment_group = np.repeat(EXPERIMENT_GROUPS, GROUP_SIZE)
experiment_group = np.concatenate((experiment_group, np.random.choice(EXPERIMENT_GROUPS, NUM_CUSTOMERS - len(experiment_group))))

# Assign customers to groups
customer_group_mapping = dict(zip(customer_ids, experiment_group))

# Generate orders
order_ids = np.arange(1, NUM_ORDERS + 1)
customers = np.random.choice(customer_ids, NUM_ORDERS)
order_values = np.abs(np.random.normal(loc=10, scale=2, size=NUM_ORDERS))  # Normally distributed around 10 and positive
order_delivery_times = np.abs(np.random.normal(loc=30, scale=5, size=NUM_ORDERS))  # Normally distributed around 30 minutes and positive
order_city_codes = np.random.randint(1, 3, NUM_ORDERS)  # Random city codes between 1 and 2

# Create DataFrame
data = {
    'order_id': order_ids,
    'customer_id': customers,
    'experiment_group': [customer_group_mapping[customer_id] for customer_id in customers],
    'order_value': order_values,
    'order_delivery_time_in_minutes': order_delivery_times,
    'order_city_code': order_city_codes
}

df = pd.DataFrame(data)
df.order_city_code = df.order_city_code.astype(str)

pre_exp_df = df.assign(
    order_value = lambda df: df['order_value'] + np.random.normal(loc=0, scale=1, size=NUM_ORDERS),
    order_delivery_time_in_minutes = lambda df: df['order_delivery_time_in_minutes'] + np.random.normal(loc=0, scale=2, size=NUM_ORDERS)
).sample(int(NUM_ORDERS/3))

# Show the first few rows of the DataFrame
print(df.head())
print(pre_exp_df.head())

from cluster_experiments.inference.analysis_plan import AnalysisPlan
from cluster_experiments.inference.metric import SimpleMetric
from cluster_experiments.inference.dimension import Dimension
from cluster_experiments.inference.variant import Variant
from cluster_experiments.inference.hypothesis_test import HypothesisTest
from cluster_experiments import TargetAggregation

dimension__city_code = Dimension(
    name='order_city_code',
    values=['1','2']
)

metric__order_value = SimpleMetric(
    alias='AOV',
    name='order_value'
)

metric__delivery_time = SimpleMetric(
    alias='AVG DT',
    name='order_delivery_time_in_minutes'
)

test__order_value = HypothesisTest(
    metric=metric__order_value,
    analysis_type="clustered_ols",
    analysis_config={"cluster_cols":["customer_id"]},
    dimensions=[dimension__city_code]
)

cupac__model = TargetAggregation(agg_col="customer_id", target_col="order_delivery_time_in_minutes")

test__delivery_time = HypothesisTest(
    metric=metric__delivery_time,
    analysis_type="gee",
    analysis_config={"cluster_cols":["customer_id"], "covariates":["estimate_order_delivery_time_in_minutes"]},
    cupac_config={"cupac_model":cupac__model,
                  "target_col":"order_delivery_time_in_minutes"}
)

variants = [
    Variant('control', is_control=True),
    Variant('treatment_1', is_control=False),
    Variant('treatment_2', is_control=False)
]

analysis_plan = AnalysisPlan(
    tests=[test__order_value, test__delivery_time],
    variants=variants,
    variant_col='experiment_group',
    alpha=0.01
)

results = analysis_plan.analyze(exp_data=df, pre_exp_data=pre_exp_df)

print(results)

results_df = results.to_dataframe()

#%% Run a simple analysis plan with two metrics and one dimension

simple_analysis_plan = AnalysisPlan.from_metrics(
    metrics=[metric__delivery_time, metric__order_value],
    variants=variants,
    variant_col='experiment_group',
    alpha=0.01,
    dimensions=[dimension__city_code],
    analysis_type="clustered_ols",
    analysis_config={"cluster_cols":["customer_id"]},
)

simple_results = simple_analysis_plan.analyze(exp_data=df, verbose=True)

simple_results_df = simple_results.to_dataframe()

#%% Run a simple analysis plan with one metric and one dimension and by using a custom ExperimentAnalysis class

from cluster_experiments.experiment_analysis import ClusteredOLSAnalysis


class CustomExperimentAnalysis(ClusteredOLSAnalysis):
    def __init__(self, **kwargs):
        super().__init__(**kwargs)

custom_simple_analysis_plan = AnalysisPlan.from_metrics(
    metrics=[metric__order_value],
    variants=variants,
    variant_col='experiment_group',
    alpha=0.01,
    dimensions=[dimension__city_code],
    analysis_type="custom_clustered_ols",
    analysis_config={"cluster_cols":["customer_id"]},
    custom_analysis_type_mapper={"custom_clustered_ols": CustomExperimentAnalysis}
)

custom_simple_results = custom_simple_analysis_plan.analyze(exp_data=df, verbose=True)

custom_simple_results_df = custom_simple_results.to_dataframe()

@ludovico-lanni
Copy link
Collaborator Author

We are still missing the unit tests. Should we proceed? Are we happy with the interface? @david26694

@david26694
Copy link
Owner

We are still missing the unit tests. Should we proceed? Are we happy with the interface? @david26694

happy with the interface! also, feel free to remove python 3.8 from the github workflow :)

Copy link
Owner

@david26694 david26694 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, tests missing and general scaffolding of the feature (adding it in readme, docs, new lib version in setup.py).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants