Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rest_api: filter, exclude, transform API responses #495

Merged

Conversation

francescomucio
Copy link
Contributor

@francescomucio francescomucio commented Jun 12, 2024

Tell us what you do here

As per the associated issue, rest_api: Allow the REST API config object to exclude rows, columns, and transform data:

  • when specifying a resource it will be possible to pass:
    • a row_filter, in the form of a function or lambda
    • a list of columns (exclude_columns), as jsonpaths, that need to be removed from the output
    • a transform, in the form of a function or lambda

I have added the additional properties to typing.py and added the code to handle them.

Related Issues

This solves this issue

Additional Context

sources/rest_api/__init__.py Outdated Show resolved Hide resolved
sources/rest_api/__init__.py Outdated Show resolved Hide resolved
sources/rest_api/__init__.py Outdated Show resolved Hide resolved
sources/rest_api/__init__.py Outdated Show resolved Hide resolved
@burnash burnash self-assigned this Jun 17, 2024
@willi-mueller willi-mueller changed the title handle_filter_exclude_transform rest_api: filter, exclude, transform API responses Jun 21, 2024
@burnash
Copy link
Collaborator

burnash commented Jul 25, 2024

@francescomucio thank you for your contribution and for proposing the enhancement to the rest_api configuration. These are valuable suggestions and they can can certainly improve the flexibility of resource configuration.

However, I see a few potential drawbacks with the proposed interface:

  1. Flexibility in defining the precedence of filters and transforms: the current approach of specifying row_filter, exclude_columns, and transform separately might lead to confusion regarding the order of operations. It's crucial to have a clear order of applying these operations to ensure predictable results.

  2. Consistency with existing dlt resource methods: one possible improvement is to align with the existing DltResource's methods add_map() and add_filter(). This could make the configuration more intuitive for users who are already familiar with the dlt. (Most likely the implementation should also be based on add_map() and add_filter() where possilbe.)

To address these issues, I suggest an alternative format that puts all operations in one operations list. This makes it clear and flexible by showing the exact order of operations. Taking the example from #494, e.g.:

{
    "name": "my_nicely_named_resource",
    "endpoint": {
        "path": "endpoint_name"
    },
    "operations": [
        {"filter": lambda x: x["id"] == 3},
        {"map": "delete_fields", "fields": ["id", "another_column"]},
        {
            "map": "rename_fields",  
            "fields": {
                "user_id": "my_user_id",
                "timestamp": "my_timestamp"
            }
        },
        {"map": my_function}
    ]
}

Some other pre-defined possible filters:

{"filter": "range", "field": "date", "from": "2021-01-01", "to": "2021-12-31"}

or

{"filter": "in_set", "field": "category", "values": ["tech", "finance", "health"]}

and so on.

Again, thank you for the suggestion and valuable input. What do you think about this? Looking forward to your feedback.

@francescomucio
Copy link
Contributor Author

I like the idea, I like that it's possible to transform/map the data before or after a filter and doing both operations multiple times.

To start I would keep it simple, just with map or filter

    "operations": [
        {"filter": lambda x: x["id"] == 3},
        {"map": my_function}
    ]

and then extend with additional functionalities, each with its own key word (nothing that cannot be handled passing a callable, just as sugar for the developers):

   "operations": [
       {"delete_fields": ["id", "another_column"]},
       {"rename_fields": {
               "user_id": "my_user_id",
               "timestamp": "my_timestamp"
           }
       }
   ]

What do you think?

@rudolfix
Copy link
Contributor

Overall this looks good

  1. We call those functions transforms not operations. Maybe we could rename to transform?
  2. The structure as a list of functions is OK
  3. I agree with Francesco: let's start with python functions. If we want to implement a few standard transformations we keep them here: dlt.sources.helpers.transform I'm totally for adding functions to rename or delete fields
  4. If we do (3) we can skip discussion how to declare functions with arguments in dicts/yaml which we then should reuse everywhere

@burnash
Copy link
Collaborator

burnash commented Jul 30, 2024

I'm not very firm on "operations", but I was looking for a keyword that could work for both "map" and "filter". In my opinion "transform" is could be a bit specific so it's hard to fit "filter" under it. But I may be wrong. So the alternatives to "operations" could be:

  1. "transform" / "transformations"
  2. "processing_steps" to better align with underlying API.

Other than that also agreed with @francescomucio we can start with Python functions.

@burnash burnash added the enhancement New feature or request label Aug 5, 2024
@francescomucio
Copy link
Contributor Author

francescomucio commented Aug 12, 2024

I went with processing_steps, I have also added an initial test case for a filter, I will add one for a map and maybe something a bit more complex. Also probably a test to remove columns could be nice :)

@burnash @willi-mueller please take a look at it

Comment on lines 259 to 265
def process(resource, processing_steps) -> Any:
for step in processing_steps:
if "filter" in step:
resource.add_filter(step["filter"])
if "map" in step:
resource.add_map(step["map"])
return resource
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome!

Copy link
Collaborator

@burnash burnash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, @francescomucio! I've listed a couple of minor improvements.
Also, let's extend the tests to cover "map" and using processing steps in a child resource. A good improvement is to also extend the "offline"/"mocked" tests (https://github.com/dlt-hub/verified-sources/blob/master/tests/rest_api/test_rest_api_source_offline.py). Let me know if you have capacity for this.

sources/rest_api/__init__.py Outdated Show resolved Hide resolved
Comment on lines 263 to 265
# row_filter: Optional[Callable[[Any], bool]]
# transform: Optional[Callable[[Any], Any]]
# exclude_columns: Optional[List[jsonpath.TJsonPath]]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove the commented code and implement these ready-to-use operation as a follow-up PR

Copy link
Collaborator

@burnash burnash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating the tests. I've to questions about the yield in the comments.

method=method,
path=path,
params=params,
json=json,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a typo or there's a reason to remove json param?

@@ -278,11 +289,10 @@ def paginate_resource(
incremental_cursor_transform,
)

yield from client.paginate(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to keep the from statement.

@francescomucio francescomucio force-pushed the 494_restapi_filter_exclude_transform branch from c6b3ea2 to c65d26f Compare August 15, 2024 13:40
Copy link
Collaborator

@burnash burnash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update. Please see my comment about the mutable default.

sources/rest_api/__init__.py Outdated Show resolved Hide resolved
@francescomucio francescomucio force-pushed the 494_restapi_filter_exclude_transform branch from 621a53f to d376292 Compare August 15, 2024 20:38
Copy link
Collaborator

@burnash burnash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, thank you @francescomucio!

The examples from the tests are very good we'll use them in a follow-up documentation PR

@burnash burnash merged commit 687e7dd into dlt-hub:master Aug 16, 2024
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

rest_api: Allow the REST API config object to exclude rows, columns, and transform data
5 participants