Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for closed markets benchmarks + a little cleaning up #17

Merged
merged 9 commits into from
Feb 20, 2024

Conversation

kongzii
Copy link
Contributor

@kongzii kongzii commented Feb 16, 2024

Summary by CodeRabbit

  • New Features
    • Enhanced prediction capabilities in benchmarking agents with new methods.
    • Added MarketResolution enum for improved prediction evaluation based on market resolution.
    • Introduced new enums and fields in market and prediction models for advanced filtering and resolution handling.
  • Refactor
    • Refined prediction and market evaluation logic for clarity and efficiency.
    • Simplified prediction object structure in test scenarios.
  • Tests
    • Updated test cases to align with the new prediction and market model changes.

Copy link

coderabbitai bot commented Feb 16, 2024

Warning

Rate Limit Exceeded

@kongzii has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 5 minutes and 1 seconds before requesting another review.

How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.
Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.
Please see our FAQ for further information.

Commits Files that changed from the base of the PR and between 5f7bc81 and 60d6e2c.

Walkthrough

The updates focus on enhancing the prediction market agent tooling, particularly around agent benchmarking and market analysis. Key modifications include the introduction of new methods and properties for more nuanced prediction capabilities and market analysis, the addition of enums for market filtering and resolution, and refinements in how predictions are evaluated and reported. These changes aim to provide more detailed, flexible, and accurate tools for analyzing and interacting with prediction markets.

Changes

File(s) Summary
.../benchmark/agents.py
.../benchmark/benchmark.py
Replaced evaluate with is_predictable in AbstractBenchmarkedAgent. Added new methods and updated prediction logic in RandomAgent and FixedAgent. Refactored prediction evaluation, adjusted outcome comparison, simplified ratio computation, and updated expected returns calculation.
.../benchmark/utils.py Added MarketResolution enum, validators, and properties to models. Updated functions to handle new fields and logic. Added filter_ parameter to get_markets function.

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share

Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>.
    • Generate unit-tests for this file.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit tests for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai generate interesting stats about this repository from git and render them as a table.
    • @coderabbitai show all the console.log statements in this repository.
    • @coderabbitai read src/utils.ts and generate unit tests.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger a review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • The JSON schema for the configuration file is available here.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/coderabbit-overrides.v2.json

CodeRabbit Discord Community

Join our Discord Community to get help, request features, and share feedback.

@kongzii kongzii marked this pull request as ready for review February 19, 2024 08:12
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 7

Configuration used: CodeRabbit UI

Commits Files that changed from the base of the PR and between b04007c and aa5451b.
Files selected for processing (4)
  • prediction_market_agent_tooling/benchmark/agents.py (3 hunks)
  • prediction_market_agent_tooling/benchmark/benchmark.py (7 hunks)
  • prediction_market_agent_tooling/benchmark/utils.py (7 hunks)
  • tests/test_benchmark.py (4 hunks)
Additional comments: 7
prediction_market_agent_tooling/benchmark/agents.py (3)
  • 16-20: The introduction of the is_predictable method in place of evaluate is a significant change. Ensure that all subclasses and usages of the previously existing evaluate method have been updated accordingly to prevent any runtime errors.
  • 28-32: The check_and_predict method provides a streamlined way to check predictability before making a prediction. This is a good practice as it encapsulates the predictability check and prediction within a single method, reducing the risk of errors and improving code readability.
  • 58-74: The check_and_predict_restricted method effectively combines the predictability check with the restricted prediction process. This method ensures that predictions adhere to the specified time restrictions, which is crucial for maintaining the integrity of predictions in scenarios with time-sensitive information.
prediction_market_agent_tooling/benchmark/benchmark.py (4)
  • 16-16: The addition of MarketResolution to the import statements indicates the introduction of a new enum to handle market resolution statuses. This change aligns with the PR objectives to refine benchmarking logic based on market resolution status.
  • 272-274: The comparison logic for determining the correct outcome percentage has been updated to use the probable_resolution field from both prediction and market objects. This change is part of the refinement of benchmarking logic based on market resolution status and is a logical update to ensure outcome comparison is aligned with the probable resolution of markets.
  • 289-300: The method _compute_precision_and_recall_percentages now uses the probable_resolution field to calculate precision and recall percentages for yes and no outcomes. This update is part of the benchmarking logic refinement and ensures that the evaluation metrics are accurately calculated based on the probable resolution of markets. The logic appears correct and aligns with the objectives of enhancing the benchmarking process.
  • 352-352: The method _compute_ratio_evaluated_as_answerable now calculates the proportion of predictions evaluated as answerable based on the is_predictable field. This change supports the enhancement of agent predictability by accurately reflecting the proportion of markets where agents could make predictions.

Comment on lines +138 to +144
prediction = (
agent.check_and_predict(market_question=market.question)
if not market.is_resolved
else agent.check_and_predict_restricted(
market_question=market.question,
time_restriction_up_to=market.created_time, # TODO: Add support for resolved_at and any time in between.
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic within get_prediction_result method has been updated to differentiate between resolved and unresolved markets, using check_and_predict for the latter and check_and_predict_restricted for the former. This change is part of enhancing agent predictability in various market conditions. However, there's a TODO comment regarding adding support for resolved_at and any time in between, which suggests incomplete implementation for handling different time restrictions.

Consider implementing the handling for resolved_at and other time restrictions to fully support the new prediction capabilities in restricted or closed market scenarios.

Comment on lines +392 to +405
f"{p.outcome_prediction.p_yes:.2f} [{p.outcome_prediction.probable_resolution.value}]"
if p.is_predictable
and p.outcome_prediction # Is answerable and answered
else "N/A"
if not p.evaluation
and not p.outcome_prediction # Not evaluated for some reason
else "S"
if p.evaluation
and not p.evaluation.is_predictable # Skipped (evaluated to be not predictable)
else "F"
if p.evaluation
and p.evaluation.is_predictable
and not p.outcome_prediction # Failed (no prediction)
else should_not_happen(
f"Unexpected case in get_markets_summary() for {p}."
else (
"S"
if not p.is_predictable # Skipped (evaluated to be not predictable)
else (
"F"
if p.is_predictable
and not p.outcome_prediction # Failed (no prediction)
else should_not_happen(
f"Unexpected case in get_markets_summary() for {p}."
)
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The get_markets_summary method has been updated to include the probable_resolution in the market summary generation. This change enhances the summary information by providing insights into the resolution status of markets. However, the handling of unexpected cases using should_not_happen might not be the best approach for error handling. It's recommended to log unexpected cases or handle them more gracefully to avoid potential crashes or uninformative error messages.

Consider improving the error handling in the get_markets_summary method to log unexpected cases or handle them more gracefully, ensuring the application remains robust and informative in case of errors.

- else should_not_happen(f"Unexpected case in get_markets_summary() for {p}.")
+ else "Unexpected case"  # Log or handle unexpected cases more gracefully

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Suggested change
f"{p.outcome_prediction.p_yes:.2f} [{p.outcome_prediction.probable_resolution.value}]"
if p.is_predictable
and p.outcome_prediction # Is answerable and answered
else "N/A"
if not p.evaluation
and not p.outcome_prediction # Not evaluated for some reason
else "S"
if p.evaluation
and not p.evaluation.is_predictable # Skipped (evaluated to be not predictable)
else "F"
if p.evaluation
and p.evaluation.is_predictable
and not p.outcome_prediction # Failed (no prediction)
else should_not_happen(
f"Unexpected case in get_markets_summary() for {p}."
else (
"S"
if not p.is_predictable # Skipped (evaluated to be not predictable)
else (
"F"
if p.is_predictable
and not p.outcome_prediction # Failed (no prediction)
else should_not_happen(
f"Unexpected case in get_markets_summary() for {p}."
)
)
f"{p.outcome_prediction.p_yes:.2f} [{p.outcome_prediction.probable_resolution.value}]"
if p.is_predictable
and p.outcome_prediction # Is answerable and answered
else (
"S"
if not p.is_predictable # Skipped (evaluated to be not predictable)
else (
"F"
if p.is_predictable
and not p.outcome_prediction # Failed (no prediction)
else "Unexpected case" # Log or handle unexpected cases more gracefully

Comment on lines +434 to +440
if prediction.outcome_prediction.probable_resolution == MarketResolution.YES
and market.yes_outcome_price > 0
else 0
)
no_shares = (
bet_units / 0.5 # market.no_outcome_price
if prediction.outcome_prediction.p_yes <= buy_yes_threshold
if prediction.outcome_prediction.probable_resolution == MarketResolution.NO
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The calculate_expected_returns method has been updated to include logic based on the probable_resolution field. This change is part of the updates to expected returns calculation based on the probable resolution of markets. The logic appears correct and aligns with the objectives of refining the benchmarking process. However, the hardcoded bet units and the assumption of a 50/50 market state might limit the flexibility and accuracy of the expected returns calculation.

Consider adding support for dynamic bet sizes and more accurate share price calculations to improve the flexibility and accuracy of the expected returns calculation.

Comment on lines +34 to +44
def is_predictable_restricted(
self,
market_question: str,
time_restriction_up_to: datetime,
) -> bool:
"""
Override if the agent can decide to not predict the question, before doing the hard work.

Data used for the evaluation must be restricted to the time_restriction_up_to.
"""
return True
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The is_predictable_restricted method introduces time-based restrictions for predictions. It's crucial to ensure that all data used for prediction respects the time_restriction_up_to parameter to avoid leaking future information into the prediction process.

Consider adding a concrete implementation example or guidance in the method's documentation to help developers understand how to enforce the time restriction correctly.

Comment on lines +46 to +55
def predict_restricted(
self,
market_question: str,
time_restriction_up_to: datetime,
) -> Prediction:
"""
Predict the outcome of the market question.

Data used for the prediction must be restricted to the time_restriction_up_to.
"""
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The predict_restricted method correctly raises NotImplementedError, indicating that subclasses must provide their own implementation. This is consistent with the abstract nature of the AbstractBenchmarkedAgent class. However, it would be beneficial to include documentation or examples on how to implement this method, especially regarding the handling of time_restriction_up_to.

Would you like me to provide an example implementation or further guidance on handling the time_restriction_up_to parameter?

Comment on lines +79 to +93
def predict(self, market_question: str) -> Prediction:
p_yes, confidence = random.random(), random.random()
return Prediction(
evaluation=evaluated,
outcome_prediction=OutcomePrediction(
p_yes=p_yes,
confidence=confidence,
info_utility=None,
),
)

def predict_restricted(
self, market_question: str, time_restriction_up_to: datetime
) -> Prediction:
return self.predict(market_question)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RandomAgent class's predict_restricted method simply delegates to the predict method without considering the time_restriction_up_to parameter. This might not align with the intended use of predict_restricted, which is supposed to restrict data based on time. Ensure that the implementation of predict_restricted in RandomAgent (and other agents if applicable) correctly handles the time restriction.

Consider implementing a time-restricted version of the prediction logic in RandomAgent.predict_restricted to ensure it adheres to the intended behavior of making predictions with time-based restrictions.

Comment on lines +102 to +115
def predict(self, market_question: str) -> Prediction:
p_yes, confidence = 1.0 if self.fixed_answer else 0.0, 1.0
return Prediction(
evaluation=evaluated,
outcome_prediction=OutcomePrediction(
p_yes=p_yes,
confidence=confidence,
info_utility=None,
),
)

def predict_restricted(
self, market_question: str, time_restriction_up_to: datetime
) -> Prediction:
return self.predict(market_question)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the RandomAgent, the FixedAgent class's predict_restricted method does not account for the time_restriction_up_to parameter and directly calls the predict method. This could lead to incorrect behavior if the prediction logic is supposed to be time-sensitive.

It's recommended to adjust the FixedAgent.predict_restricted method to ensure it respects the time_restriction_up_to parameter, providing accurate predictions based on the specified time restrictions.

@@ -13,51 +13,84 @@ def __init__(self, agent_name: str, max_workers: t.Optional[int] = None):
self.agent_name = agent_name
self.max_workers = max_workers # Limit the number of workers that can run this worker in parallel threads

def evaluate(self, market_question: str) -> EvaluatedQuestion:
raise NotImplementedError
def is_predictable(self, market_question: str) -> bool:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reduced the evaluate research predict pattern to just the is-predictable predict pattern, where is-predictable defaults to True now. So agents are free to be implemented anyhow they want.

return Prediction(is_predictable=is_predictable)
return self.predict(market_question=market_question)

def is_predictable_restricted(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plus, there are _restricted versions of them, they are called instead of the original ones, if we are doing benchmark on resolved markets.


@property
def is_resolved(self) -> bool:
return self.resolution is not None
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the original is_resolved and changed it to this property to not have two sources of the same information.

else MarketResolution.YES
if self.p_yes > 0.5
else MarketResolution.NO
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added probable_resolution, which can be used always, no matter if it is an open or resolved market.

Motivation is that if the market is resolved, it doesn't mean that if the final p_yes is >0.5, the market was resolved to True.

def binary_answer(self) -> bool:
return self.p_yes > 0.5
def probable_resolution(self) -> MarketResolution:
return MarketResolution.YES if self.p_yes > 0.5 else MarketResolution.NO
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added it to the outcome prediction as well, just for consistency + we don't need to always compare >0.5.

@@ -233,7 +298,3 @@ def get_llm_api_call_cost(
model_cost += model_costs[model]["completion_tokens"] * completion_tokens
model_cost /= 1000
return model_cost


def should_not_happen(message: str, E: t.Type[Exception] = RuntimeError) -> t.NoReturn:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was duplicate, it's defined in the other utils as well.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 0

Configuration used: CodeRabbit UI

Commits Files that changed from the base of the PR and between aa5451b and a80b892.
Files selected for processing (1)
  • prediction_market_agent_tooling/benchmark/utils.py (8 hunks)
Files skipped from review as they are similar to previous changes (1)
  • prediction_market_agent_tooling/benchmark/utils.py

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 0

Configuration used: CodeRabbit UI

Commits Files that changed from the base of the PR and between a80b892 and 5f7bc81.
Files selected for processing (1)
  • prediction_market_agent_tooling/benchmark/utils.py (8 hunks)
Files skipped from review as they are similar to previous changes (1)
  • prediction_market_agent_tooling/benchmark/utils.py

@kongzii kongzii merged commit d226831 into main Feb 20, 2024
6 checks passed
@kongzii kongzii deleted the peter/benchmark-closed-markets branch February 20, 2024 14:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants