Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/collection api #181

Draft
wants to merge 44 commits into
base: develop
Choose a base branch
from
Draft

Feature/collection api #181

wants to merge 44 commits into from

Conversation

lukavdplas
Copy link
Contributor

This adds an API for collections which are saved in the triplestore. close #159, close #160

It adds list, retrieve, create, update, and delete endpoints for collections. (Which replace the existing API for collections.)

This is implemented as a "regular" JSON API, but collections are identified by an URI instead of an integer.

This update only implements the API on the backend, so it will break the frontend (which expects a slightly different API for collections). I expect adjusting the frontend will go smoothly, but I did not want to bloat this PR.

In the meantime, you can test the browsable API at /api/collections/.

This update also does not include a data migration or adjust the existing data migration script; I'll add that in a later PR.

Other changes:

  • Projects now explicitly store their URI in the database. This makes lookups easier.
  • The core URL configuration is moved from the vre app to the project module edpop.
  • Adds defined namespaces (implement defined namespaces #180)

@lukavdplas lukavdplas marked this pull request as draft July 16, 2024 14:13
Comment on lines 7 to 18
def _name_to_slug(name: str) -> str:
no_spaces = lambda s: re.sub(r'\s+', '_', s)
clean = lambda s: re.sub(r'[^a-z0-9\-_]', '', s)
strip_start = lambda s: re.sub(r'^\W+', '', s)
strip_end = lambda s: re.sub(r'\W+$', '', s)

return strip_end(strip_start(clean(no_spaces(str.lower(name)))))


def collection_uri(name: str):
id = _name_to_slug(name)
return URIRef(settings.RDF_NAMESPACE_ROOT + 'collections/' + id)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This constructs the URI for a collection based on the "human" name, i.e. "My collection" will be stored as my_collection. This simplifies the frontend UI, which is nice for an initial version, though it may lead to some annoyances long-term.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What annoyances?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

URIs need to be unique, so the request will fail if the URI already exists. This isn't really a problem right now, but if a group of people is enthusiastically adding data on the same topics, conflicts may become more common.

I think there are some minor annoyances that can pop up if you get these warnings:

  • You can get a response that the URI is taken while the existing collection is privately owned by another user. This makes technical sense, but it feels odd.
  • Names don't need to be unique but URIs do, so this can feel un-intuitive. For instance, if "My collection" / my_collection exists, you can get the name you want by creating "My collection 2" / my_collection_2, and then renaming it to "My collection" (which will preserve the URI my_collection_2).
  • "my collection" and "My Collection :)" would both be encoded as my_collection, so the user may feel like they don't have full control over the URI. Which is annoying if their collection gets rejected on the basis of the URI.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe number slugs automatically, then?

@lukavdplas
Copy link
Contributor Author

Current status: this PR is ready for review; I've made it a draft because it should only be merged in combination with a frontend update.

Request for review: I recommend exploring the browsable API and/or looking at api_test.py to evaluate the actual endpoints. I'm particularly interested if you foresee any difficulties in the translation to frontend functions, or edge cases that can lead to issues.

The technical implementation is still a bit experimental, but quite well-tested, so it should work well enough for a first version. Suggestions/comments are welcome but addressing them may be better left until later.

@jgonggrijp
Copy link
Member

@lukavdplas I merged develop into a local copy of your branch so I could use Docker to try the browsable API. There were a few minor merge conflicts that were easy to fix and I had to update backend/requirements.{in,txt} to replace the feature/json-ld branch by develop for the restframework_rdf package, because the former branch was already merged.

Unfortunately, I cannot get past the migrations of the projects app. It gets stuck on number 2, log below. Does this look familiar to you and if so, do you know how to address it? The Project class does have a graph method.

2024-07-17 17:14:53 System check identified no issues (0 silenced).
2024-07-17 17:14:54 Operations to perform:
2024-07-17 17:14:54   Apply all migrations: admin, auth, authtoken, contenttypes, projects, sessions, vre
2024-07-17 17:14:54 Running migrations:
2024-07-17 17:14:54 Traceback (most recent call last):
2024-07-17 17:14:54   File "/usr/src/app/backend/manage.py", line 22, in <module>
2024-07-17 17:14:54     execute_from_command_line(sys.argv)
2024-07-17 17:14:54   File "/usr/local/lib/python3.9/site-packages/django/core/management/__init__.py", line 442, in execute_from_command_line
2024-07-17 17:14:54     utility.execute()
2024-07-17 17:14:54   File "/usr/local/lib/python3.9/site-packages/django/core/management/__init__.py", line 436, in execute
2024-07-17 17:14:54     self.fetch_command(subcommand).run_from_argv(self.argv)
2024-07-17 17:14:54   File "/usr/local/lib/python3.9/site-packages/django/core/management/base.py", line 412, in run_from_argv
2024-07-17 17:14:54     self.execute(*args, **cmd_options)
2024-07-17 17:14:54   File "/usr/local/lib/python3.9/site-packages/django/core/management/base.py", line 458, in execute
2024-07-17 17:14:54     output = self.handle(*args, **options)
2024-07-17 17:14:54   File "/usr/local/lib/python3.9/site-packages/django/core/management/base.py", line 106, in wrapper
2024-07-17 17:14:54     res = handle_func(*args, **kwargs)
2024-07-17 17:14:54   File "/usr/local/lib/python3.9/site-packages/django/core/management/commands/migrate.py", line 356, in handle
2024-07-17 17:14:54     post_migrate_state = executor.migrate(
2024-07-17 17:14:54   File "/usr/local/lib/python3.9/site-packages/django/db/migrations/executor.py", line 135, in migrate
2024-07-17 17:14:54     state = self._migrate_all_forwards(
2024-07-17 17:14:54   File "/usr/local/lib/python3.9/site-packages/django/db/migrations/executor.py", line 167, in _migrate_all_forwards
2024-07-17 17:14:54     state = self.apply_migration(
2024-07-17 17:14:54   File "/usr/local/lib/python3.9/site-packages/django/db/migrations/executor.py", line 252, in apply_migration
2024-07-17 17:14:54     state = migration.apply(state, schema_editor)
2024-07-17 17:14:54   File "/usr/local/lib/python3.9/site-packages/django/db/migrations/migration.py", line 132, in apply
2024-07-17 17:14:54     operation.database_forwards(
2024-07-17 17:14:54   File "/usr/local/lib/python3.9/site-packages/django/db/migrations/operations/special.py", line 193, in database_forwards
2024-07-17 17:14:54     self.code(from_state.apps, schema_editor)
2024-07-17 17:14:54   File "/usr/src/app/backend/projects/migrations/0002_researchgroups_to_projects.py", line 22, in research_groups_to_projects
2024-07-17 17:14:54     store_project_graph(Project, project, True)
2024-07-17 17:14:54   File "/usr/src/app/backend/projects/signals.py", line 21, in store_project_graph
2024-07-17 17:14:54     g = instance.graph()
2024-07-17 17:14:54 AttributeError: 'Project' object has no attribute 'graph'
2024-07-17 17:14:54   Applying projects.0002_researchgroups_to_projects...

@lukavdplas
Copy link
Contributor Author

It does. This happens because migrations use a reconstruction of the model based on the migration history, rather than the actual class. So this migration tried to access the custom method graph(), but that isn't available during the migration.

This is caused by a mistake in the source code, I'll update it and let you know.

@lukavdplas
Copy link
Contributor Author

Okay, I've updated it and it looks like the migrations run fine now. I'm not very happy with the solution (importing the actual model class during the migration) but it works.

@jgonggrijp
Copy link
Member

Getting one for Collection as well:

2024-07-18 14:48:00 System check identified no issues (0 silenced).
2024-07-18 14:48:00 Operations to perform:
2024-07-18 14:48:00   Apply all migrations: admin, auth, authtoken, contenttypes, projects, sessions, vre
2024-07-18 14:48:00 Running migrations:
2024-07-18 14:48:01   Applying projects.0002_researchgroups_to_projects... OK
2024-07-18 14:48:01   Applying projects.0003_project_uri... OK
2024-07-18 14:48:01   Applying projects.0004_fill_project_uri... OK
2024-07-18 14:48:01   Applying projects.0005_alter_project_uri... OK
2024-07-18 14:48:01 Traceback (most recent call last):
2024-07-18 14:48:01   File "/usr/src/app/backend/manage.py", line 22, in <module>
2024-07-18 14:48:01     execute_from_command_line(sys.argv)
2024-07-18 14:48:01   File "/usr/local/lib/python3.9/site-packages/django/core/management/__init__.py", line 442, in execute_from_command_line
2024-07-18 14:48:01     utility.execute()
2024-07-18 14:48:01   File "/usr/local/lib/python3.9/site-packages/django/core/management/__init__.py", line 436, in execute
2024-07-18 14:48:01     self.fetch_command(subcommand).run_from_argv(self.argv)
2024-07-18 14:48:01   File "/usr/local/lib/python3.9/site-packages/django/core/management/base.py", line 412, in run_from_argv
2024-07-18 14:48:01     self.execute(*args, **cmd_options)
2024-07-18 14:48:01   File "/usr/local/lib/python3.9/site-packages/django/core/management/base.py", line 458, in execute
2024-07-18 14:48:01     output = self.handle(*args, **options)
2024-07-18 14:48:01   File "/usr/local/lib/python3.9/site-packages/django/core/management/base.py", line 106, in wrapper
2024-07-18 14:48:01     res = handle_func(*args, **kwargs)
2024-07-18 14:48:01   File "/usr/local/lib/python3.9/site-packages/django/core/management/commands/migrate.py", line 356, in handle
2024-07-18 14:48:01     post_migrate_state = executor.migrate(
2024-07-18 14:48:01   File "/usr/local/lib/python3.9/site-packages/django/db/migrations/executor.py", line 135, in migrate
2024-07-18 14:48:01     state = self._migrate_all_forwards(
2024-07-18 14:48:01   File "/usr/local/lib/python3.9/site-packages/django/db/migrations/executor.py", line 167, in _migrate_all_forwards
2024-07-18 14:48:01     state = self.apply_migration(
2024-07-18 14:48:01   File "/usr/local/lib/python3.9/site-packages/django/db/migrations/executor.py", line 252, in apply_migration
2024-07-18 14:48:01     state = migration.apply(state, schema_editor)
2024-07-18 14:48:01   File "/usr/local/lib/python3.9/site-packages/django/db/migrations/migration.py", line 132, in apply
2024-07-18 14:48:01     operation.database_forwards(
2024-07-18 14:48:01   File "/usr/local/lib/python3.9/site-packages/django/db/migrations/operations/special.py", line 193, in database_forwards
2024-07-18 14:48:01     self.code(from_state.apps, schema_editor)
2024-07-18 14:48:01   File "/usr/src/app/backend/vre/migrations/0006_annotation_context_collection_context.py", line 46, in save_collection_managing_group_as_context
2024-07-18 14:48:01     matching_name = obj.managing_group.filter(name=obj.name)
2024-07-18 14:48:01 AttributeError: 'Collection' object has no attribute 'name'
2024-07-18 14:48:01   Applying vre.0006_annotation_context_collection_context...

@lukavdplas
Copy link
Contributor Author

Ah, that should be fixed now! Thanks for trying all this out btw 😅

Copy link
Member

@jgonggrijp jgonggrijp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was a bit confused by this comment:

Current status: this PR is ready for review; I've made it a draft because it should only be merged in combination with a frontend update.

The pull request compares to feature/rdf-modeling-utils, not develop. Surely, you are planning to build the feature/rdf-modeling-utils branch incrementally, and the current branch does not need to be complete with frontend adaptations before you merge it?

Side note: even on develop, I wouldn't necessarily be against you merging this without corresponding frontend changes. We just did something like that in #174. As far as I'm concerned, only on master/main does everything need to be completely consistent and bug-free. The purpose of release branches is to ensure this.

Anyway, back to the actual changes. I like the cleanness of the code, but I would prefer more docstrings and comments to explain why things are the way they are. I also noticed a bug.

I created a new collection with the slug extra through the browsable API. The URI of this collection is http://localhost:8000/rdf/collections/extra. Create, retrieve, update and delete all work. However:

  • In the triplestore, the slug of the associated project is duplicated. Its URI is http://localhost:8000/rdf/project/Dev/Dev everywhere (so at least the URI is consistent). Note that project is singular as well. By contrast, my project that was migrated from a group has a plural projects and contains the slug only once: http://localhost:8000/rdf/projects/dev_g.
  • I updated my newly created collection to associate it with the dev_g project instead. However, the corresponding object node in the triplestore now held the URI http://localhost:8000/rdf/project/dev_g/dev_g.

Some more details in the comments. Keep up the sweet refactoring!


# try to create the same collection again
fail_response = post_collection(client, project.name)
assert is_client_error(fail_response.status_code)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once database synchronization is added, I would also want to test here that there is no duplicate entry in the database.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My impression was that if you add triples that already exist in blazegraph (in the same graph), it will have no effect. That is, the same triple won't be stored twice. Since the request to make a collection is idempotent, and this test makes the same request twice, I don't think you could check if the second time was executed?

What does make sense to me is to create a different collection in the second request (e.g. with a different description), to check that the create request doesn't store the new data. (This scenario is also why the api should reject the request.)

Correct me if I'm wrong about blazegraph here, though!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right about Blazegraph, but I meant the representation of the collection in the PostgreSQL database. Hence "once database synchronization is added".

backend/collect/api_test.py Outdated Show resolved Hide resolved
backend/collect/rdf_models.py Show resolved Hide resolved
backend/collect/utils.py Outdated Show resolved Hide resolved
backend/collect/utils.py Show resolved Hide resolved
Comment on lines +6 to +7
('Test!!', 'test'),
('Test test test', 'test_test_test'),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These test cases are quite conservative. Add some Bobby Tables and some chimpanzee speak as well.

backend/projects/migrations/0004_fill_project_uri.py Outdated Show resolved Hide resolved
from triplestore.utils import Quads


class CollectionsField(RDFQuadField):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a docstring to explain the purpose of this class.

Comment on lines +12 to +30
def get(self, instance: RDFModel):
return [
s
for (s, p, o, g) in self._stored_quads(instance)
]

def _stored_quads(self, instance: RDFModel) -> Quads:
store = settings.RDFLIB_STORE
results = store.query(f'''
SELECT ?col WHERE {{
?col a edpopcol:Collection ;
as:context <{instance.uri}> .
}}
''', initNs={'as': AS, 'edpopcol': EDPOPCOL})

return [
(result, AS.context, instance.uri, Graph(store, result))
for (result, ) in results
]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure, but it looks to me as if the query in _stored_quads gets the information that get is looking for. However, _stored_quads decorates this information with additional data, which get then peels off again. Wouldn't it be more straightforward and efficient to put the query in get and then call get from _stored_quads?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get the confusion. The reason is that _stored_quads is also used to update the triplestore. When you call save() on the model, the field will compare _stored_quads() with _quads_to_store() to update the graph. _stored_quads() is also used when you call delete().

So _stored_quads() returns the representation that is saved in the triplestore, which is needed to make updates. get() translates the graph representation to one that is more convenient in the model.

This is implemented on the parent class (RDFQuadField), but it looks convoluted in isolation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but you are in full control here because you are overriding both methods. What stops you from writing this?

Suggested change
def get(self, instance: RDFModel):
return [
s
for (s, p, o, g) in self._stored_quads(instance)
]
def _stored_quads(self, instance: RDFModel) -> Quads:
store = settings.RDFLIB_STORE
results = store.query(f'''
SELECT ?col WHERE {{
?col a edpopcol:Collection ;
as:context <{instance.uri}> .
}}
''', initNs={'as': AS, 'edpopcol': EDPOPCOL})
return [
(result, AS.context, instance.uri, Graph(store, result))
for (result, ) in results
]
def get(self, instance: RDFModel):
store = settings.RDFLIB_STORE
results = store.query(f'''
SELECT ?col WHERE {{
?col a edpopcol:Collection ;
as:context <{instance.uri}> .
}}
''', initNs={'as': AS, 'edpopcol': EDPOPCOL})
return [
s
for (s, ) in results
]
def _stored_quads(self, instance: RDFModel) -> Quads:
store = settings.RDFLIB_STORE
return [
(result, AS.context, instance.uri, Graph(store, result))
for result in self.get(instance)
]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right. Nothing, but to explain why it wasn't written like that: in general, _stored_triples/_stored_quads may contain information that is ignored by the model (and thus not retrievable from the output of get()).

For instance, you could write this field to be agnostic about the graph context. In that case, _stored_quads should include the graph the as:context triple was stored in, because that information is necessary for save/delete operations. But get() can filter it out.

So in general, it makes sense to write _stored_triples/_stored_quads first, and then convert that to an internal value in get(). In this specific case, though, you're right that the dependency can flow either way.

backend/projects/signals.py Show resolved Hide resolved
Base automatically changed from feature/rdf-modelling-utils to develop August 5, 2024 09:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

2 participants