-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optionally filter out invalid markets #482
Conversation
WalkthroughThe changes introduce a new boolean attribute Changes
Possibly related PRs
Suggested reviewers
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🧰 Additional context used🔇 Additional comments (4)
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
|
||
@pytest.mark.skipif(not RUN_PAID_TESTS, reason="This test costs money to run.") | ||
@pytest.mark.parametrize( | ||
"question, invalid", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Outside diff range and nitpick comments (7)
tests/tools/test_is_invalid.py (2)
7-81
: LGTM: Well-structured test function with appropriate decorators.The test function is well-designed with the use of
skipif
for cost management andparametrize
for multiple test cases. This structure allows for comprehensive testing of theis_invalid
function across various scenarios.Consider improving the error message in the assertion:
- assert ( - is_invalid(question=question) == invalid - ), f"Question is not evaluated correctly." + assert ( + is_invalid(question=question) == invalid + ), f"Question '{question}' is not evaluated correctly. Expected: {invalid}, Got: {is_invalid(question=question)}"This change will provide more context in case of test failures, making debugging easier.
🧰 Tools
🪛 Ruff
81-81: f-string without any placeholders
Remove extraneous
f
prefix(F541)
9-77
: LGTM: Comprehensive set of test cases covering various scenarios.The test cases provide a good coverage of different question types and edge cases. The comments on some test cases are helpful in understanding the reasoning behind the expected outcomes.
Consider the following suggestions to further improve the test cases:
- Group similar test cases together (e.g., time-related questions, violence-related questions) for better organization.
- Add more edge cases, such as questions with ambiguous wording or borderline validity.
- Include test cases for non-English questions to ensure language independence.
Example of grouping similar test cases:
@pytest.mark.parametrize( "question, invalid", [ # Time-related questions ("Russian nuke in space?", False), ("Russian nuke in space by March 31?", False), ("Russian nuke in space in 2024?", False), ("Russian nuke in space in 24 days?", True), # Violence-related questions ("Will Hezbollah leader Hassan Nasrallah be confirmed dead by 6 October 2024?", True), ("Will Netanyahu die before the end of October 7th 2024", True), ("Will Travis James Mullis be executed in Texas on 1 October 2024?", False), ("Will the second nitrogen gas execution in Alabama be carried out by 5 October 2024?", False), ("Will someone beat me?", True), # Other categories... ] )This organization will make it easier to maintain and expand the test suite in the future.
prediction_market_agent_tooling/tools/utils.py (2)
25-26
: Improved documentation forLLM_SUPER_LOW_TEMPERATURE
The updated comment provides valuable context by referencing a specific experiment. This helps developers understand when to use this constant versus a zero value.
Consider adding a brief summary of the experiment's findings directly in the comment to save developers time. For example:
# Experiment (https://github.com/gnosis/prediction-market-agent/pull/438) showed that using LLM_SUPER_LOW_TEMPERATURE produces more consistent results than 0 for gpt-4-turbo.
27-28
: Addition ofLLM_SEED
for consistent resultsThe introduction of
LLM_SEED
is a good practice for ensuring reproducibility in LLM-based operations. The comment clearly explains its purpose.Consider using a more unique seed value instead of 0. While 0 is a valid seed, using a more distinctive number (e.g., a project-specific constant or a hash of a string) could help avoid potential conflicts with other systems that might use 0 as a default seed. For example:
LLM_SEED = 48273 # Project-specific seed for consistent resultsAlso, it might be beneficial to add a comment about whether this seed should be configurable in production environments.
prediction_market_agent_tooling/deploy/agent.py (2)
299-299
: LGTM: New attribute added with appropriate default valueThe new
allow_invalid_questions
attribute is added with a default value ofFalse
, which is a safe choice. This attribute will control whether invalid questions are allowed in the market verification process.Consider adding a docstring to explain the purpose and usage of this attribute, especially how it interacts with the
verify_market
method.
408-410
: LGTM: New check for invalid questions addedThe new check for invalid questions is correctly implemented, using the
allow_invalid_questions
attribute and theis_invalid
function. This addition aligns with the PR objectives to enhance the decision-making process for agents operating within the prediction market framework.Consider simplifying the condition for better readability:
- if not self.allow_invalid_questions and is_invalid(market.question): - return False - + return self.allow_invalid_questions or not is_invalid(market.question)This change combines the new check with the existing return statement, making the method more concise and easier to understand at a glance.
🧰 Tools
🪛 Ruff
408-411: Return the negated condition directly
Inline condition
(SIM103)
prediction_market_agent_tooling/tools/is_invalid.py (1)
67-69
: Improve the function docstring for clarityThe current docstring does not accurately describe the function's purpose. Update it to reflect that the function determines if a question is invalid according to predefined criteria.
Suggested docstring:
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
⛔ Files ignored due to path filters (1)
pyproject.toml
is excluded by!**/*.toml
📒 Files selected for processing (4)
- prediction_market_agent_tooling/deploy/agent.py (3 hunks)
- prediction_market_agent_tooling/tools/is_invalid.py (1 hunks)
- prediction_market_agent_tooling/tools/utils.py (1 hunks)
- tests/tools/test_is_invalid.py (1 hunks)
🧰 Additional context used
🪛 Ruff
prediction_market_agent_tooling/deploy/agent.py
408-411: Return the negated condition directly
Inline condition
(SIM103)
tests/tools/test_is_invalid.py
81-81: f-string without any placeholders
Remove extraneous
f
prefix(F541)
🔇 Additional comments (8)
tests/tools/test_is_invalid.py (1)
1-4
: LGTM: Imports are appropriate and well-structured.The imports are correctly placed and include all necessary components for the test file.
prediction_market_agent_tooling/tools/utils.py (1)
Line range hint
1-283
: Overall assessment of changes in utils.pyThe changes in this file, while minor, contribute positively to the project:
- The updated comment for
LLM_SUPER_LOW_TEMPERATURE
provides better context for its usage.- The addition of
LLM_SEED
improves reproducibility for LLM operations.These changes align well with the PR objectives of clarifying functionality and improving the decision-making process for agents in the prediction market framework. The improved documentation and focus on consistent results will benefit developers working with this tooling.
prediction_market_agent_tooling/deploy/agent.py (2)
69-69
: LGTM: New import added correctlyThe new import for the
is_invalid
function is correctly placed with other imports from the same package. This function will be used to check if a question is invalid in the market verification process.
69-69
: Summary: Effective implementation of invalid market filteringThe changes introduced in this PR successfully implement the ability to filter out invalid markets, as per the PR objectives. Here's a summary of the key changes:
- The
is_invalid
function is imported to check for invalid questions.- A new
allow_invalid_questions
attribute is added to control the filtering behavior.- The
verify_market
method is updated to use these new components.These changes enhance the decision-making process for agents by allowing them to avoid participating in markets with invalid questions. The implementation is flexible, as it can be toggled using the
allow_invalid_questions
attribute.Overall, these changes improve the robustness of the prediction market agent tooling without introducing any apparent issues or bugs.
Also applies to: 299-299, 408-410
prediction_market_agent_tooling/tools/is_invalid.py (4)
81-81
: Verify the attribute for OpenAI API keyEnsure that
openai_api_key_secretstr_v1
is the correct attribute to access the OpenAI API key fromAPIKeys
. If the attribute is incorrect, it may cause authentication issues.
61-61
:⚠️ Potential issueVerify the model name 'gpt-4o'
The default value for the
engine
parameter is set to'gpt-4o'
. Please verify that'gpt-4o'
is the correct model name. If it's a typo, consider correcting it to'gpt-4'
or the intended model name.
71-72
:⚠️ Potential issueCorrect the import statement for
ChatOpenAI
The module
langchain_openai
does not exist. To importChatOpenAI
, use the correct module path from the LangChain library.Apply this diff to fix the imports:
Likely invalid or redundant comment.
84-90
:⚠️ Potential issueAdjust the prompt formatting and LLM invocation
The current approach to formatting the prompt and invoking the LLM may not align with the LangChain API. Update the code to correctly format the prompt and call the LLM to obtain the completion.
Apply this diff to adjust the code:
Note: Ensure that
callbacks
is the correct parameter based on the return value ofget_langfuse_langchain_config()
.Likely invalid or redundant comment.
- The market's question is about immoral violence, dead or assatination. | ||
- The violant event can be caused by a single conscious being. | ||
- The violant event is done ilegally. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct spelling errors in the prompt text
There are several spelling mistakes in the prompt that may affect the clarity of instructions provided to the language model:
- Line 23: "dead or assatination" should be "death or assassination".
- Lines 24-25: "The violant event" should be "The violent event".
- Line 25: "ilegally" should be "illegally".
Apply the following diff to fix the spelling errors:
Committable suggestion was skipped due to low confidence.
llm = ChatOpenAI( | ||
model=engine, | ||
temperature=temperature, | ||
seed=seed, | ||
api_key=APIKeys().openai_api_key_secretstr_v1, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update ChatOpenAI
initialization with correct parameters
The initialization of ChatOpenAI
includes incorrect parameter names and unsupported parameters:
- Use
model_name
instead ofmodel
. - The
seed
parameter is not supported byChatOpenAI
. - Use
openai_api_key
instead ofapi_key
.
Apply this diff to correct the initialization:
Committable suggestion was skipped due to low confidence.
logger.error("langchain not installed, skipping is_invalid") | ||
return True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reconsider handling of ImportError
When the langchain
library is not installed, the function logs an error and returns True
, indicating the question is invalid. This behavior might not be appropriate, as it could lead to incorrect results. Consider raising an exception or returning a value that indicates the check could not be performed.
Suggestion:
Committable suggestion was skipped due to low confidence.
- Valid: Will Trump win the 2020 US presidential election? (The main source of uncertainty is the vote of US citizens, not a potential murder of a presidential candidate). | ||
- Which could give an incentive only to specific participants to commit an immoral violent action, but are in practice unlikely. | ||
- Valid: Will the US be engaged in a military conflict with a UN member state in 2021? (It’s unlikely for the US to declare war in order to win a bet on this market). | ||
- Valid: Will Derek Chauvin go to jail for the murder of George Flyod? (It’s unlikely that the jurors would collude to make a wrong verdict in order to win this market). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix typo in "George Floyd"
In line 36, correct the spelling of "George Flyod" to "George Floyd".
Apply the following diff:
Committable suggestion was skipped due to low confidence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
niiice! just a couple typos, but otherwise lgtm
def is_invalid( | ||
question: str, | ||
engine: str = "gpt-4o", | ||
temperature: float = LLM_SUPER_LOW_TEMPERATURE, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🥶
|
||
Follow a chain of thought to evaluate if the question is invalid: | ||
|
||
First, write the parts of the following question: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by this bit of the prompt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -22,8 +22,10 @@ | |||
T = TypeVar("T") | |||
|
|||
# t=0 is mathematically impossible and it's not clear how OpenAI (and others) handle it, as a result, even with t=0, gpt-4-turbo produces very different outputs, | |||
# it seems that using a very low temperature is the best way to have as consistent outputs as possible: https://community.openai.com/t/why-the-api-output-is-inconsistent-even-after-the-temperature-is-set-to-0/329541/12 | |||
# see this experiment to figure out if you should use LLM_SUPER_LOW_TEMPERATURE or just 0: https://github.com/gnosis/prediction-market-agent/pull/438. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this would defo be worth making a twitter thread out of!
Co-authored-by: Evan Griffiths <[email protected]>
We have two functions that seem similar (I was thinking about merging them first), but serve a bit different purpose:
is_invalid
- this evaluates if the question should be resolved as invalid in the end; theoretically, we can have agent that doesn't care about it, e.g. this one Create "Invalid agent" prediction-market-agent#505is_predictable_binary
- this evaluates if the question is predictable by the agentTo show the difference:
Will X be alive at DD.MM.YYYY?
--> this is predictable, but it's invalid because it's violentWhat will be the price of GNO at DD.MM.YYYY?
-> this isn't predictable because our agents support binary markets only, but it's a valid question