Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added deploy with modal. #1805

Open
wants to merge 30 commits into
base: devel
Choose a base branch
from
Open

Conversation

dat-a-man
Copy link
Collaborator

Description

Added deploy with modal

@dat-a-man dat-a-man self-assigned this Sep 13, 2024
Copy link

netlify bot commented Sep 13, 2024

Deploy Preview for dlt-hub-docs canceled.

Name Link
🔨 Latest commit e5d9a30
🔍 Latest deploy log https://app.netlify.com/sites/dlt-hub-docs/deploys/6708a44ecb306a000807e990

@dat-a-man dat-a-man added the documentation Improvements or additions to documentation label Sep 13, 2024

### Capturing deletes

One limitation of our simple approach above is that it does not capture updates or deletions of data. This isn’t a hard requirement yet for our use cases, but it appears that `dlt` does have a [Postgres CDC replication feature](https://dlthub.com/docs/dlt-ecosystem/verified-sources/pg_replication) that we are considering.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use relative links for the pages in the docs. E.g. ./dlt-ecosystem/...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @burnash. Updated the link. There's one thing though, the doc is not showing in the GitHub deploy preview here. But when using "npm" locally it shows fine.

@dat-a-man dat-a-man force-pushed the docs/how-to-deploy-using-modal branch 2 times, most recently from 0333c54 to 8a49dce Compare September 16, 2024 08:27
@dat-a-man dat-a-man assigned adrianbr and unassigned adrianbr Sep 16, 2024
Copy link
Collaborator

@burnash burnash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good content, @dat-a-man. I've added some suggestions to improve the style.


## Introduction to Modal

[Modal](https://modal.com/blog/analytics-stack) is a serverless platform designed for developers. It allows you to run and deploy code in the cloud without managing infrastructure.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there the link from Modal should go to the https://modal.com/. I can see that the blog post is already linked from another section below.

Copy link
Collaborator Author

@dat-a-man dat-a-man Sep 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thanks! Corrected.


## Building Data Pipelines with `dlt`

**`dlt`** is an open-source Python library that allows you to declaratively load data sources into well-structured tables or datasets. It does this through automatic schema inference and evolution. The library simplifies building data pipelines by providing functionality to support the entire extract and load process.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
**`dlt`** is an open-source Python library that allows you to declaratively load data sources into well-structured tables or datasets. It does this through automatic schema inference and evolution. The library simplifies building data pipelines by providing functionality to support the entire extract and load process.
dlt is an open-source Python library that allows you to declaratively load data sources into well-structured tables or datasets. It does this through automatic schema inference and evolution. The library simplifies building data pipelines by providing functionality to support the entire extract and load process.

Let's tone down the formatting here


**`dlt`** is an open-source Python library that allows you to declaratively load data sources into well-structured tables or datasets. It does this through automatic schema inference and evolution. The library simplifies building data pipelines by providing functionality to support the entire extract and load process.

### How does `dlt` integrate with Modal for pipeline orchestration?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### How does `dlt` integrate with Modal for pipeline orchestration?
### How does dlt integrate with Modal for pipeline orchestration?

Through the docs, please use plain "dlt" (no backticks) when referring the dlt as a project. Use backticks only when referring to dlt as a code (e.g. dlt the Python module in the script or dlt the command in context of command line)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks!


To know more, please refer to [Modals's documentation.](https://modal.com/docs)

## Building Data Pipelines with `dlt`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## Building Data Pipelines with `dlt`
## Building data pipelines with dlt
  1. Through the docs please use the sentence case capitalization

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted, Thanks!


### How does `dlt` integrate with Modal for pipeline orchestration?

To illustrate setting up a pipeline in Modal, we’ll be using the following example: [Building a cost-effective analytics stack with Modal, dlt, and dbt.](https://modal.com/blog/analytics-stack)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
To illustrate setting up a pipeline in Modal, well be using the following example: [Building a cost-effective analytics stack with Modal, dlt, and dbt.](https://modal.com/blog/analytics-stack)
As an example of how to set up a pipeline in Modal, we'll use the [building a cost-effective analytics stack with Modal, dlt, and dbt.](https://modal.com/blog/analytics-stack) case study.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks!


Here’s our `dlt` setup copying data from our Postgres read replica into Snowflake:

1. Run the `dlt` SQL database setup to initialize their `sql_database_pipeline.py` template:
Copy link
Collaborator

@burnash burnash Sep 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
1. Run the `dlt` SQL database setup to initialize their `sql_database_pipeline.py` template:
1. Run the `dlt init` CLI command to initialize the SQL database source and setup the `sql_database_pipeline.py` template:

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! However, I don't see these changes on GitHub. Is there a chance you haven't pushed the updates to GitHub?


## How to run dlt on Modal

Here’s our `dlt` setup copying data from our Postgres read replica into Snowflake:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Here’s our `dlt` setup copying data from our Postgres read replica into Snowflake:
Here’s a dlt project setup to copy data from our Postgres read replica into Snowflake:

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated


As an example of how to set up a pipeline in Modal, we'll use the [building a cost-effective analytics stack with Modal, dlt, and dbt.](https://modal.com/blog/analytics-stack) case study.

The example demonstrates automating a workflow to load data from Postgres to Snowflake using `dlt`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The example demonstrates automating a workflow to load data from Postgres to Snowflake using `dlt`.
The example demonstrates automating a workflow to load data from Postgres to Snowflake using dlt.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Copy link
Collaborator

@burnash burnash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @dat-a-man thanks for the updates, please see my review comments


Here’s our `dlt` setup copying data from our Postgres read replica into Snowflake:

1. Run the `dlt` SQL database setup to initialize their `sql_database_pipeline.py` template:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's also not clear what do we do with sql_database_pipeline.py? Are we discarding it? Or we're adding the code below to sql_database_pipeline.py?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

app = modal.App("dlt-postgres-pipeline", image=image)
```

3. Wrap the provided `load_table_from_database` with the Modal Function decorator, Modal Secrets containing your database credentials, and a daily cron schedule
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we take load_table_from_database from sql_database_pipeline.py we should note that. Otherwise it may be unclear.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added the context

pass
```

4. Write your `dlt` pipeline:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where the user should put the code from this section? Is it still goes to sql_database_pipeline.py?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It goes to sql_database_pipeline.py

4. Write your `dlt` pipeline:
```py
# Modal Secrets are loaded as environment variables which are used here to create the SQLALchemy connection string
pg_url = f'postgresql://{os.environ["PGUSER"]}:{os.environ["PGPASSWORD"]}@localhost:{os.environ["PGPORT"]}/{os.environ["PGDATABASE"]}'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the dlt-native way to configure connection with environment variables: https://dlthub.com/docs/devel/general-usage/credentials/setup#environment-variables that should eliminate the need of manual connection string construction and usage of ConnectionStringCredentials

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a note about this in step 3; I tested it, too, and it worked for source creds.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey original author here :). are you saying it's better practice to define the sql connection string as a single env variable and then reassign the env variable in the pipeline? e.g.

  1. Set a Modal secret like POSTGRES_CREDENTIAL_STRING = 'postgresql://sdfsd:sdlfkj' (this gets mounted as an env variable)
  2. In the pipeline, call os.environ["TASK_SOURCES__SQL_DATABASE__CREDENTIALS"] = os.environ["POSTGRES_CREDENTIAL_STRING"]?

Copy link
Contributor

@AstrakhantsevaAA AstrakhantsevaAA Oct 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey @kning! I would say it's a matter of taste, if you prefer string connection, use it, if not, don't, dlt supports both. In this example, I think, Anton wants to reduce the amount of code and unnecessary manipulations. For example, in this case you can avoid this

credentials = ConnectionStringCredentials(pg_url)
and
destination=dlt.destinations.snowflake(snowflake_url),

info = pipeline.run(source_1, write_disposition="merge")
print(info)
```

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the next step is missing: how this code ends up on Modal? How to trigger runs?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added step 5

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This runs the pipeline once, but might be worth adding that you need to run modal deploy to actually schedule the pipelin.

@kning
Copy link

kning commented Sep 30, 2024

the more i think about it actually, maybe it makes sense to write a really pared down example for this space that is runnable end-to-end for the user (e.g. using duckdb) and linking out to our blog post for a "real-world example". happy to help contribute a pared down example

@kning
Copy link

kning commented Sep 30, 2024

here's a simpler gist that should just work if you run modal run dlt_example.py and will deploy a daily scheduled job with modal deploy dlt_example.py.

i think this section will have better engagement if the user can simply copy-paste a script and it works immediately; we can adapt this to your docs style and perhaps just link out to the original blog post as a more detailed, real-world example of dlt (i also need to update that one to be compatible with 1.1.0).

lmk what you think! also happy to chat i know ive shared a lot of info here haha.

https://gist.github.com/kning/6a2af9e08ebaad0e486968f98c1939be

@AstrakhantsevaAA
Copy link
Contributor

@kning hey! Thanks for your thoughts here, your idea with testing is great! We actually practice this, you can find here some getting started snippets that we test on every CI/CD run. We can also add your gist to our testing process, we just need to understand what we call a successful run, we can run this command modal run dlt_example.py on every CI/CD run and stop it immediately if it ran without errors, is that enough? or should it be deployed as well?

@kning
Copy link

kning commented Oct 3, 2024

running modal run dlt_example.py should be sufficient, but you'd also need to set up a Modal account and set the MODAL_TOKEN_ID and MODAL_TOKEN_SECRET variables in your CI environment.

also checked out the snippets, are those ever surfaced in the docs? i guess i'd expect it to be synced with the snippets on this page but it looks different.

@AstrakhantsevaAA
Copy link
Contributor

@kning

MODAL_TOKEN_ID and MODAL_TOKEN_SECRET variables in your CI environment.

It shouldn't be a problem.

i guess i'd expect it to be synced with the snippets on this page but it looks different.

We changed our docs significantly recently, and getting started page was removed and replaced with intro. Relevant example you can find here: doc and snippets

@kning
Copy link

kning commented Oct 3, 2024

i see. how do you think we should move forward then with the modal snippet? ideally i'd like to see a "deploy with modal" page that explains how to create a modal account and the runnable code snippet (which should also regularly be run somehow to ensure that it's correct) and finally a link to the blog post for a "real-world" example. but i guess from what i understand it seems that code in the docs page and CI/CD snippets are managed separately?

@AstrakhantsevaAA
Copy link
Contributor

@dat-a-man will do that, he will create a snippet file with example and use tags to ingest this snippet into doc page, here we will run modal command to test it

@kning
Copy link

kning commented Oct 9, 2024

amazing looks way cleaner now, thanks!

Makefile Outdated
@@ -65,6 +65,7 @@ lint-and-test-snippets:
poetry run mypy --config-file mypy.ini docs/website docs/examples docs/tools --exclude docs/tools/lint_setup --exclude docs/website/docs_processed
poetry run flake8 --max-line-length=200 docs/website docs/examples docs/tools
cd docs/website/docs && poetry run pytest --ignore=node_modules
modal run docs/walkthroughs/deploy-a-pipeline/deploy-with-modal-snippets.py
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
modal run docs/walkthroughs/deploy-a-pipeline/deploy-with-modal-snippets.py
modal run docs/website/docs/walkthroughs/deploy-a-pipeline/deploy-with-modal-snippets.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants