Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

creates a single source in extract for all resource instances passed as list #1535

Merged
merged 3 commits into from
Sep 14, 2024

Conversation

rudolfix
Copy link
Collaborator

@rudolfix rudolfix commented Jul 2, 2024

Description

We discovered peculiar problem with rest_api when users were passing a list of resources to the run function that contained a resource and a transformer:

     page = 1

    @dlt.resource(name="pages")
    def gen_pages():
        nonlocal page
        while True:
            yield {"page": page}
            if page == 10:
                return
            page += 1

    @dlt.transformer(name="subpages")
    def get_subpages(page_item):
        yield from [
            {
                "page": page_item["page"],
                "subpage": subpage,
            }
            for subpage in range(1, 11)
        ]

    pipeline = dlt.pipeline("test_resource_transformer_standalone", destination="duckdb")
    # here we must combine resources and transformers using the same instance
    info = pipeline.run([gen_pages, gen_pages | get_subpages])

in the case above only last page is passed to the transformer (see the commits for tests and details)
the root cause is that each resource in the list was packaged in a separate source and extracted separately. that prevented any DAG optimizations and gen_pages was extracted twice.

here we change the behavior where a single dlt source is used to extract all the resources in the list

@rudolfix rudolfix self-assigned this Jul 2, 2024
Copy link

netlify bot commented Jul 2, 2024

Deploy Preview for dlt-hub-docs canceled.

Name Link
🔨 Latest commit 1f26f72
🔍 Latest deploy log https://app.netlify.com/sites/dlt-hub-docs/deploys/66e494729541520008ee968d

@rudolfix rudolfix requested a review from burnash July 3, 2024 08:49
@rudolfix rudolfix added the sprint Marks group of tasks with core team focus at this moment label Jul 3, 2024
sh-rp
sh-rp previously approved these changes Jul 3, 2024
Copy link
Collaborator

@sh-rp sh-rp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, i was going to suggest doing this for another reason anyway: parallelism will not work on a list of resources in the current version, now it will :)

PS: Maybe a small note about this behavior is warranted in the docs, I'm not sure.

@rudolfix
Copy link
Collaborator Author

rudolfix commented Jul 3, 2024

@sh-rp I'll add this to our release notes as one of the changes

@rudolfix
Copy link
Collaborator Author

rudolfix commented Jul 3, 2024

@sh-rp also you are partially right with the parallelism! we have tests that are passing a list of many resources with the same names. and those tests are failing. We'd need to package them in separate sources and execute them one by one

@rudolfix rudolfix removed the sprint Marks group of tasks with core team focus at this moment label Jul 8, 2024
@rudolfix rudolfix added sprint Marks group of tasks with core team focus at this moment breaking This issue introduces breaking change and removed breaking This issue introduces breaking change labels Aug 21, 2024
@rudolfix rudolfix merged commit 12e2f12 into devel Sep 14, 2024
58 checks passed
@rudolfix rudolfix deleted the rfix/extracts-list-of-resources-single-source branch September 14, 2024 07:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sprint Marks group of tasks with core team focus at this moment
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

2 participants