Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhanced testing, added workflow caching, added lambda, fixed deployment-failing state_update.py bug #11

Conversation

olsonadr
Copy link
Collaborator

@olsonadr olsonadr commented May 25, 2022

Merging feature branch "3-split-similarity-lambda-into-sub-lambdas" into main repo!

Major changes here are:

  • Added virtual environment and npm caching in Actions workflows that drastically speeds up repeated work (cached per-branch, keyed by versions and file-hashes)
  • Fixed issue with state_update.py failing on long input text by truncating to 2000 words (max is 4096 but GitHub Actions runs out of memory, and we thought optimizing that to raise this limit could be another PR).
  • Expanded testing to ensure that AWS architecture matches what is expected (must modify expected architecture template in tests/unit/testing_materials/expected_template.json before arbitrarily changing this (hopefully enforcing test-based-dev)
  • Renamed lambda similar to text-to-db-similar
  • Created new lambda (using test-based-dev) embed-to-db-similar that does the same as above but with embedding input (potentially to be updated in soon to be PR based on your issue API Does Not Return a Web Format #9 )
  • Refactored and cleaned some code for readability, error handling, and documentation

Next steps are:

  • Add 3 lambdas to: (wanted to start with what we have and get things to you fast, and I'll refocus on these next two tonight or tomorrow morning)
    • generate embedding from input text
    • to get cosine sim between to embeddings directly
    • get cosine sim between to input texts directly (generating embeddings under the hood)
  • Potentially switch up embed formatting in api responses
  • Work to raise the truncation word limit (2000->4096)

As a warning, the first test runs may take some time as there are more lambdas and more tests, and first deployment takes a very long time (~30 mins) because it has to delete the infrastructure associated with the similar lambda and replace it with 2 new lambdas

broceni and others added 30 commits April 29, 2022 16:12
…ting in similar.py. Updated state.csv, and renamed old one for testing purposes.
…y, cleaned up testrun.py and formatted apitest.py
olsonadr and others added 11 commits May 24, 2022 14:57
- GET request query strings were wrapped in multiple quotes, so we now literal_eval up to 10 times
- GET requests were being passed with {  \"  } in the url instead of {  "  } around the text/embed parameter

Also added...
- Added descriptive exceptions for bad statusCode's on api testing
…ine model max length, and thus truncate abnormally long tensors of tokens, solving an indexing error encountered while updating incident 195. Removed manual truncation in similar.py. Refactored the names of a couple constants in similar.py.
@olsonadr olsonadr changed the title Enhanced testing, added lambda, fixed deployment-failing state_update.py bug Enhanced testing, added workflow caching, added lambda, fixed deployment-failing state_update.py bug May 25, 2022
@olsonadr
Copy link
Collaborator Author

Im not sure why testing failed immediately on AWS credentials here. Do you expect this?

@smcgregor
Copy link
Contributor

Pull requests likely are not getting the environment variables since that is a security risk. Should I squash/merge this to main and we sort it out there?

@@ -0,0 +1,154 @@
# Imports
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is used mostly for dev and scratch work, and generated the embeddings used in the example invokes with embeddings in the testing_materials folder

@@ -0,0 +1,192 @@
# Imports
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new file looks very similar to text-to-db-similar.py so I'll throw a comment on each of the changes for ease of review

dataframe.loc[i, "incident_id"]
) for i in range(len(state))]

# Process input text for text-to-db-similar computation
Copy link
Collaborator Author

@olsonadr olsonadr May 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment has a typo. It should be embed-to-db-similar not text-to-db-similar.

Regardless, this function is the same as process_input_text from text-to-db-similar, but with an embedding (list of floats) rather than a string as input. It differs in that it skips the step of processing the input text into an embedding


# Get input from body or query string
try:
if ('embed' in event):
Copy link
Collaborator Author

@olsonadr olsonadr May 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and in the following lines it looks for 'embed' instead of 'text' in the input request's body

result['body']['warnings'].append(
f'Zero results requested with the "num" value of 0. Use value <0 for maximum possible.')

# Handle unicode in event_text and parse it to a list (removing up to 10 levels of nested quotes, just in case)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here it decodes the embed_text string from the body of the request into a literal list by using literal_eval, but it must do this a couple times because AWS passing around can sometimes add a couple layers of quotes around the string


# Found event_text, use it and return result
try:
print(type(embed))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a leftover debug print that can be removed

# Found event_text, use it and return result
try:
print(type(embed))
res = process_input_list(embed, best_of)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here it calls process_input_list instead of process_input_text

@@ -1,15 +1,48 @@
# General helper imports
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the testing I mention in the PR that uses expected_template.json

assert type(resource['Properties']) == dict

# Check that the app stack template contains this resource
template.has_resource_properties(type=resource['Type'], props=resource['Properties'])
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This uses an approximate comparison to ensure that anytime we have specified some shape or value in a JSON object in the provided expected_template.json file, it must match what is generated by our app.py stack (specifically looking for the Type and Properties attributes)

@olsonadr
Copy link
Collaborator Author

olsonadr commented May 25, 2022

Pull requests likely are not getting the environment variables since that is a security risk. Should I squash/merge this to main and we sort it out there?

Sorry, I was adding clarifying comments to make your life easier if you were going to step through, but you can feel free @smcgregor !

@smcgregor
Copy link
Contributor

When does the Lambda URL change? Every build? Only when asked to?

@olsonadr
Copy link
Collaborator Author

olsonadr commented May 25, 2022

My understanding is that the API URL (which each lambda has a route on) shouldn't change unless the entire stack is destroyed and rebuilt (or potentially when requested), but that if you rename a lambda, it renames it's route's endpoint on the API. Sorry, I didn't have your UI work in mind when making that change and should have warned you more directly about that.

Are you seeing the whole URL change beyond the route to the lambda itself?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants