-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Doc 302 new etl tutorial #25320
base: master
Are you sure you want to change the base?
Doc 302 new etl tutorial #25320
Changes from all commits
c275842
054141c
89be27a
59f5a64
bf7b65b
d6d69cf
19d3236
9b8bdc2
6f078db
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,87 @@ | ||
--- | ||
title: Build an ETL Pipeline | ||
description: Learn how to build an ETL pipeline with Dagster | ||
last_update: | ||
date: 2024-08-10 | ||
author: Pedram Navid | ||
--- | ||
|
||
# Build your first ETL pipeline | ||
|
||
Welcome to this hands-on tutorial where you'll learn how to build an ETL pipeline with Dagster while exploring key parts of Dagster. | ||
If you haven't already, complete the [Quick Start](/getting-started/quickstart) tutorial to get familiar with Dagster. | ||
|
||
## What you'll learn | ||
|
||
- Setting up a Dagster project with the recommended project structure | ||
- Creating Assets and using Resources to connect to external systems | ||
- Adding metadata to your assets | ||
- Building dependencies between assets | ||
- Running a pipeline by materializing assets | ||
- Adding schedules, sensors, and partitions to your assets | ||
|
||
## Step 1: Set up your Dagster environment | ||
|
||
First, set up a new Dagster project. | ||
|
||
1. Open your terminal and create a new directory for your project: | ||
|
||
```bash title="Create a new directory" | ||
mkdir dagster-etl-tutorial | ||
cd dagster-etl-tutorial | ||
``` | ||
|
||
2. Create a virtual environment and activate it: | ||
|
||
```bash title="Create a virtual environment" | ||
python -m venv venv | ||
source venv/bin/activate | ||
# On Windows, use `venv\Scripts\activate` | ||
``` | ||
|
||
3. Install Dagster and the required dependencies: | ||
|
||
```bash title="Install Dagster and dependencies" | ||
pip install dagster dagster-webserver pandas | ||
``` | ||
|
||
## Step 2: Copying Data Files | ||
|
||
Next we will get the raw data for the project. | ||
Check warning on line 50 in docs/docs-beta/docs/tutorial/01-etl-tutorial-introduction.md GitHub Actions / runner / vale
|
||
|
||
1. Create a new folder for the raw data: | ||
|
||
```bash title="Create the data directory" | ||
mkdir data | ||
cd data | ||
``` | ||
|
||
2. Copy the raw csv files: | ||
Check failure on line 59 in docs/docs-beta/docs/tutorial/01-etl-tutorial-introduction.md GitHub Actions / runner / vale
Check failure on line 59 in docs/docs-beta/docs/tutorial/01-etl-tutorial-introduction.md GitHub Actions / runner / vale
|
||
|
||
```bash title="Copy the csv files" | ||
curl -L -o products.csv https://raw.githubusercontent.com/dagster-io/dagster/refs/heads/master/examples/docs_beta_snippets/docs_beta_snippets/guides/tutorials/etl_tutorial/data/products.csv | ||
|
||
curl -L -o sales_reps.csv https://raw.githubusercontent.com/dagster-io/dagster/refs/heads/master/examples/docs_beta_snippets/docs_beta_snippets/guides/tutorials/etl_tutorial/data/sales_reps.csv | ||
|
||
curl -L -o sales_data.csv https://raw.githubusercontent.com/dagster-io/dagster/refs/heads/master/examples/docs_beta_snippets/docs_beta_snippets/guides/tutorials/etl_tutorial/data/sales_data.csv | ||
Check warning on line 66 in docs/docs-beta/docs/tutorial/01-etl-tutorial-introduction.md GitHub Actions / runner / vale
|
||
``` | ||
3. Copy Sample Request json file | ||
Check failure on line 68 in docs/docs-beta/docs/tutorial/01-etl-tutorial-introduction.md GitHub Actions / runner / vale
Check failure on line 68 in docs/docs-beta/docs/tutorial/01-etl-tutorial-introduction.md GitHub Actions / runner / vale
|
||
|
||
```bash title="Create the sample request" | ||
mkdir sample_request | ||
cd sample_request | ||
curl -L -o request.json https://raw.githubusercontent.com/dagster-io/dagster/refs/heads/master/examples/docs_beta_snippets/docs_beta_snippets/guides/tutorials/etl_tutorial/data/sample_request/request.json | ||
|
||
Check warning on line 74 in docs/docs-beta/docs/tutorial/01-etl-tutorial-introduction.md GitHub Actions / runner / vale
|
||
# navigating back to the root directory | ||
cd../.. | ||
``` | ||
|
||
|
||
## What you've learned | ||
|
||
- Set up a Python virtual environment and installed Dagster | ||
- Copied raw data for project | ||
|
||
## Next steps | ||
|
||
- Continue this tutorial with [setting up your dagster project ](/tutorial/dagster-project-setup) |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,125 @@ | ||
--- | ||
title: Dagster Project Setup | ||
description: Learn how to setup a Dagster project from scratch | ||
last_update: | ||
date: 2024-10-16 | ||
author: Alex Noonan | ||
--- | ||
|
||
# Dagster Project Setup | ||
|
||
## What you'll learn | ||
|
||
- Setting up a Dagster project with the recommended project structure | ||
|
||
|
||
## Step 1: Create Dagster Project Files | ||
|
||
Dagster needs several project files to run. These files are common in Python Package managment and help manage project configurationa and dependencies. | ||
|
||
The setup.cfg file is an INI-style configuration file that contains option defaults for setup.py commands. | ||
|
||
1. Create Config file | ||
|
||
```bash title="Create Config file" | ||
echo -e "[metadata]\nname = dagster_etl_tutorial" > setup.cfg | ||
``` | ||
|
||
2. Create Setup Python File | ||
|
||
The setup.py file is a build script for configuring Python packages. In a Dagster project, you use setup.py to defin any Python packages your project depends on, including Dagster itself. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There's a minor typo in the word Spotted by Graphite Reviewer |
||
|
||
```bash title="Create Setup file" | ||
echo > setup.py | ||
``` | ||
|
||
|
||
Open that python file and put the following code in there. | ||
|
||
|
||
```python title="Setup.py" | ||
from setuptools import find_packages, setup | ||
|
||
setup( | ||
name="dagster_etl_tutorial", | ||
packages=find_packages(exclude=["dagster_etl_tutorial_tests"]), | ||
install_requires=[ | ||
"dagster", | ||
"dagster-cloud", | ||
"duckdb" | ||
], | ||
extras_require={"dev": ["dagster-webserver", "pytest"]}, | ||
) | ||
``` | ||
3. Create Toml file | ||
|
||
The pyproject.toml file is a configuation file that specifices package core metadata in a static, tool agnostic way. | ||
|
||
|
||
```bash title="Create Pyproject file" | ||
echo > pyproject.toml | ||
``` | ||
|
||
Open that file up and add the following | ||
|
||
```toml | ||
[build-system] | ||
requires = ["setuptools"] | ||
build-backend = "setuptools.build_meta" | ||
|
||
[tool.dagster] | ||
module_name = "dagster_tutorial.definitions" | ||
code_location_name = "dagster_tutorial" | ||
``` | ||
|
||
4. Create Dagster Python Module and Definitions file | ||
|
||
|
||
## Next we will create our Python Definitions file | ||
|
||
1. Create ETL tutorial directory | ||
|
||
```bash title="Create the tutorial directory" | ||
mkdir dagster-etl-tutorial | ||
cd dagster-etl-tutorial | ||
``` | ||
|
||
2. Create Dagster Definitions File | ||
|
||
In this guide we will use a simplified project structure to focus on core Dagster concepts. To accomplish this all of our code will be in one definitons file. | ||
|
||
|
||
```bash title="Create definitions.py file" | ||
echo > definitions.py | ||
``` | ||
|
||
## Materializing the Assets | ||
|
||
At this point your project should look like this. | ||
|
||
``` | ||
dagster-etl-tutorial/ | ||
├── etl_tutorial/ | ||
│ └── definitions.py | ||
├── data/ | ||
│ └── products.csv | ||
│ └── sales_data.csv | ||
│ └── sales_reps.csv | ||
│ └── sample_request/ | ||
│ └── request.json | ||
├── pyproject.toml | ||
├── setup.cfg | ||
├── setup.py | ||
``` | ||
The project structure shouldnt change much from here and we are at the right point to run dagster locally and see what our asset graph looks like and materialze them. | ||
|
||
|
||
|
||
## What you've learned | ||
|
||
- Set up a Python virtual environment and installed Dagster | ||
- Copied raw data for project | ||
|
||
## Next steps | ||
|
||
- Continue this tutorial with your [first asset](/tutorial/your-first-asset) |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,86 @@ | ||
--- | ||
title: Your First Asset | ||
Check warning on line 2 in docs/docs-beta/docs/tutorial/03-your-first-asset.md GitHub Actions / runner / vale
|
||
description: Get the project data and create your first Asset | ||
last_update: | ||
date: 2024-10-16 | ||
author: Alex Noonan | ||
--- | ||
|
||
# Your First Software Defined Asset | ||
|
||
Now that we have the raw data files and the Dagster project setup lets create some loading those csv's into duckdb. | ||
Check failure on line 11 in docs/docs-beta/docs/tutorial/03-your-first-asset.md GitHub Actions / runner / vale
Check failure on line 11 in docs/docs-beta/docs/tutorial/03-your-first-asset.md GitHub Actions / runner / vale
Check failure on line 11 in docs/docs-beta/docs/tutorial/03-your-first-asset.md GitHub Actions / runner / vale
Check warning on line 11 in docs/docs-beta/docs/tutorial/03-your-first-asset.md GitHub Actions / runner / vale
|
||
|
||
## What you'll learn | ||
|
||
- Creating our intial defintions object | ||
Check failure on line 15 in docs/docs-beta/docs/tutorial/03-your-first-asset.md GitHub Actions / runner / vale
Check failure on line 15 in docs/docs-beta/docs/tutorial/03-your-first-asset.md GitHub Actions / runner / vale
|
||
- Adding a duckdb resource | ||
- Building some basic software defined assets | ||
Check warning on line 17 in docs/docs-beta/docs/tutorial/03-your-first-asset.md GitHub Actions / runner / vale
|
||
|
||
## Building definitions object | ||
|
||
The definitions object [need docs reference] in Dagster serves as the central configuration point for defining and organizing various componenets within a Dagster Project. It acts as a container that holds all the necessary configurations for a code location, ensuring that everything is organized and easily accessible. | ||
Check warning on line 21 in docs/docs-beta/docs/tutorial/03-your-first-asset.md GitHub Actions / runner / vale
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There's a minor typo in the word "componenets". It should be spelled "components". Additionally, consider adding a link to the Dagster documentation for the Spotted by Graphite Reviewer |
||
|
||
1. Creating Definitions Object and duckdb resource | ||
|
||
Open the definitions.py file and add the following import statements and definitions object. | ||
Check warning on line 25 in docs/docs-beta/docs/tutorial/03-your-first-asset.md GitHub Actions / runner / vale
|
||
|
||
```python | ||
import json | ||
import os | ||
|
||
from dagster_duckdb import DuckDBResource | ||
|
||
import dagster as dg | ||
|
||
defs = dg.Definitions( | ||
assets=[], | ||
resources={"duckdb": DuckDBResource(database="data/mydb.duckdb")}, | ||
) | ||
``` | ||
|
||
## Loading raw data | ||
|
||
1. Products Asset | ||
|
||
We need to create an asset that creates a duckdb table for the products csv. Additionally we should add meta data to help categorize this asset and give us a preview of what it looks like in the Dagster UI. | ||
Check warning on line 45 in docs/docs-beta/docs/tutorial/03-your-first-asset.md GitHub Actions / runner / vale
|
||
|
||
<CodeExample filePath="guides/tutorials/etl_tutorial/etl_tutorial/definitions.py" language="python" lineStart="8" lineEnd="33"/> | ||
|
||
You'll notice here that we have meta data for the compute kind for this asset as well as making it part of the ingestion group. Additionally, at the end we add the row count and a preview of what the table looks like. | ||
Check warning on line 49 in docs/docs-beta/docs/tutorial/03-your-first-asset.md GitHub Actions / runner / vale
|
||
|
||
2. Sales Reps Asset | ||
|
||
This code will be very similar to the product asset but this time its focused on Sales Reps. | ||
|
||
<CodeExample filePath="guides/tutorials/etl_tutorial/etl_tutorial/definitions.py" language="python" lineStart="35" lineEnd="61"/> | ||
|
||
3. Sales Data Asset | ||
|
||
Same thing for Sales Data | ||
|
||
<CodeExample filePath="guides/tutorials/etl_tutorial/etl_tutorial/definitions.py" language="python" lineStart="62" lineEnd="87"/> | ||
|
||
4. Bringing our assets into the Definitions object | ||
|
||
Now to pull these assets into our definitions object simply add them to the empty list in the assets parameter. | ||
|
||
```python | ||
defs = dg.Definitions( | ||
assets=[products, | ||
sales_reps, | ||
sales_data, | ||
], | ||
resources={"duckdb": DuckDBResource(database="data/mydb.duckdb")} | ||
), | ||
``` | ||
|
||
## What you've learned | ||
|
||
- Created a Dagster Definition | ||
- Built our ingestion assets | ||
|
||
|
||
|
||
## Next steps | ||
|
||
- Continue this tutorial with your [Asset Dependencies] |
This file was deleted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a minor typo in the word "configurationa". It should be "configuration". This small correction will improve the readability of the documentation.
Spotted by Graphite Reviewer
Is this helpful? React 👍 or 👎 to let us know.