Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Doc 302 new etl tutorial #25320

Draft
wants to merge 9 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
87 changes: 87 additions & 0 deletions docs/docs-beta/docs/tutorial/01-etl-tutorial-introduction.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
---
title: Build an ETL Pipeline
description: Learn how to build an ETL pipeline with Dagster
last_update:
date: 2024-08-10
author: Pedram Navid
---

# Build your first ETL pipeline

Welcome to this hands-on tutorial where you'll learn how to build an ETL pipeline with Dagster while exploring key parts of Dagster.
If you haven't already, complete the [Quick Start](/getting-started/quickstart) tutorial to get familiar with Dagster.

## What you'll learn

- Setting up a Dagster project with the recommended project structure
- Creating Assets and using Resources to connect to external systems
- Adding metadata to your assets
- Building dependencies between assets
- Running a pipeline by materializing assets
- Adding schedules, sensors, and partitions to your assets

## Step 1: Set up your Dagster environment

First, set up a new Dagster project.

1. Open your terminal and create a new directory for your project:

```bash title="Create a new directory"
mkdir dagster-etl-tutorial
cd dagster-etl-tutorial
```

2. Create a virtual environment and activate it:

```bash title="Create a virtual environment"
python -m venv venv
source venv/bin/activate
# On Windows, use `venv\Scripts\activate`
```

3. Install Dagster and the required dependencies:

```bash title="Install Dagster and dependencies"
pip install dagster dagster-webserver pandas
```

## Step 2: Copying Data Files

Next we will get the raw data for the project.

Check warning on line 50 in docs/docs-beta/docs/tutorial/01-etl-tutorial-introduction.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line. Raw Output: {"message": "[Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line.", "location": {"path": "docs/docs-beta/docs/tutorial/01-etl-tutorial-introduction.md", "range": {"start": {"line": 50, "column": 47}}}, "severity": "WARNING"}

1. Create a new folder for the raw data:

```bash title="Create the data directory"
mkdir data
cd data
```

2. Copy the raw csv files:

Check failure on line 59 in docs/docs-beta/docs/tutorial/01-etl-tutorial-introduction.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Vale.Spelling] Did you really mean 'csv'? Raw Output: {"message": "[Vale.Spelling] Did you really mean 'csv'?", "location": {"path": "docs/docs-beta/docs/tutorial/01-etl-tutorial-introduction.md", "range": {"start": {"line": 59, "column": 17}}}, "severity": "ERROR"}

Check failure on line 59 in docs/docs-beta/docs/tutorial/01-etl-tutorial-introduction.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Dagster.spelling] Is 'csv' spelled correctly? Raw Output: {"message": "[Dagster.spelling] Is 'csv' spelled correctly?", "location": {"path": "docs/docs-beta/docs/tutorial/01-etl-tutorial-introduction.md", "range": {"start": {"line": 59, "column": 17}}}, "severity": "ERROR"}

```bash title="Copy the csv files"
curl -L -o products.csv https://raw.githubusercontent.com/dagster-io/dagster/refs/heads/master/examples/docs_beta_snippets/docs_beta_snippets/guides/tutorials/etl_tutorial/data/products.csv

curl -L -o sales_reps.csv https://raw.githubusercontent.com/dagster-io/dagster/refs/heads/master/examples/docs_beta_snippets/docs_beta_snippets/guides/tutorials/etl_tutorial/data/sales_reps.csv

curl -L -o sales_data.csv https://raw.githubusercontent.com/dagster-io/dagster/refs/heads/master/examples/docs_beta_snippets/docs_beta_snippets/guides/tutorials/etl_tutorial/data/sales_data.csv

Check warning on line 66 in docs/docs-beta/docs/tutorial/01-etl-tutorial-introduction.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line. Raw Output: {"message": "[Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line.", "location": {"path": "docs/docs-beta/docs/tutorial/01-etl-tutorial-introduction.md", "range": {"start": {"line": 66, "column": 197}}}, "severity": "WARNING"}
```
3. Copy Sample Request json file

Check failure on line 68 in docs/docs-beta/docs/tutorial/01-etl-tutorial-introduction.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Vale.Spelling] Did you really mean 'json'? Raw Output: {"message": "[Vale.Spelling] Did you really mean 'json'?", "location": {"path": "docs/docs-beta/docs/tutorial/01-etl-tutorial-introduction.md", "range": {"start": {"line": 68, "column": 24}}}, "severity": "ERROR"}

Check failure on line 68 in docs/docs-beta/docs/tutorial/01-etl-tutorial-introduction.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Dagster.spelling] Is 'json' spelled correctly? Raw Output: {"message": "[Dagster.spelling] Is 'json' spelled correctly?", "location": {"path": "docs/docs-beta/docs/tutorial/01-etl-tutorial-introduction.md", "range": {"start": {"line": 68, "column": 24}}}, "severity": "ERROR"}

```bash title="Create the sample request"
mkdir sample_request
cd sample_request
curl -L -o request.json https://raw.githubusercontent.com/dagster-io/dagster/refs/heads/master/examples/docs_beta_snippets/docs_beta_snippets/guides/tutorials/etl_tutorial/data/sample_request/request.json

Check warning on line 74 in docs/docs-beta/docs/tutorial/01-etl-tutorial-introduction.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line. Raw Output: {"message": "[Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line.", "location": {"path": "docs/docs-beta/docs/tutorial/01-etl-tutorial-introduction.md", "range": {"start": {"line": 74, "column": 1}}}, "severity": "WARNING"}
# navigating back to the root directory
cd../..
```


## What you've learned

- Set up a Python virtual environment and installed Dagster
- Copied raw data for project

## Next steps

- Continue this tutorial with [setting up your dagster project ](/tutorial/dagster-project-setup)
125 changes: 125 additions & 0 deletions docs/docs-beta/docs/tutorial/02-dagster-project-setup.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
---
title: Dagster Project Setup
description: Learn how to setup a Dagster project from scratch
last_update:
date: 2024-10-16
author: Alex Noonan
---

# Dagster Project Setup

## What you'll learn

- Setting up a Dagster project with the recommended project structure


## Step 1: Create Dagster Project Files

Dagster needs several project files to run. These files are common in Python Package managment and help manage project configurationa and dependencies.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a minor typo in the word "configurationa". It should be "configuration". This small correction will improve the readability of the documentation.

Spotted by Graphite Reviewer

Is this helpful? React 👍 or 👎 to let us know.


The setup.cfg file is an INI-style configuration file that contains option defaults for setup.py commands.

1. Create Config file

```bash title="Create Config file"
echo -e "[metadata]\nname = dagster_etl_tutorial" > setup.cfg
```

2. Create Setup Python File

The setup.py file is a build script for configuring Python packages. In a Dagster project, you use setup.py to defin any Python packages your project depends on, including Dagster itself.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a minor typo in the word defin. It should be define. This small correction will improve the clarity of the explanation for the setup.py file's purpose.

Spotted by Graphite Reviewer

Is this helpful? React 👍 or 👎 to let us know.


```bash title="Create Setup file"
echo > setup.py
```


Open that python file and put the following code in there.


```python title="Setup.py"
from setuptools import find_packages, setup

setup(
name="dagster_etl_tutorial",
packages=find_packages(exclude=["dagster_etl_tutorial_tests"]),
install_requires=[
"dagster",
"dagster-cloud",
"duckdb"
],
extras_require={"dev": ["dagster-webserver", "pytest"]},
)
```
3. Create Toml file

The pyproject.toml file is a configuation file that specifices package core metadata in a static, tool agnostic way.


```bash title="Create Pyproject file"
echo > pyproject.toml
```

Open that file up and add the following

```toml
[build-system]
requires = ["setuptools"]
build-backend = "setuptools.build_meta"

[tool.dagster]
module_name = "dagster_tutorial.definitions"
code_location_name = "dagster_tutorial"
```

4. Create Dagster Python Module and Definitions file


## Next we will create our Python Definitions file

1. Create ETL tutorial directory

```bash title="Create the tutorial directory"
mkdir dagster-etl-tutorial
cd dagster-etl-tutorial
```

2. Create Dagster Definitions File

In this guide we will use a simplified project structure to focus on core Dagster concepts. To accomplish this all of our code will be in one definitons file.


```bash title="Create definitions.py file"
echo > definitions.py
```

## Materializing the Assets

At this point your project should look like this.

```
dagster-etl-tutorial/
├── etl_tutorial/
│ └── definitions.py
├── data/
│ └── products.csv
│ └── sales_data.csv
│ └── sales_reps.csv
│ └── sample_request/
│ └── request.json
├── pyproject.toml
├── setup.cfg
├── setup.py
```
The project structure shouldnt change much from here and we are at the right point to run dagster locally and see what our asset graph looks like and materialze them.



## What you've learned

- Set up a Python virtual environment and installed Dagster
- Copied raw data for project

## Next steps

- Continue this tutorial with your [first asset](/tutorial/your-first-asset)
86 changes: 86 additions & 0 deletions docs/docs-beta/docs/tutorial/03-your-first-asset.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
---
title: Your First Asset

Check warning on line 2 in docs/docs-beta/docs/tutorial/03-your-first-asset.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line. Raw Output: {"message": "[Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line.", "location": {"path": "docs/docs-beta/docs/tutorial/03-your-first-asset.md", "range": {"start": {"line": 2, "column": 24}}}, "severity": "WARNING"}
description: Get the project data and create your first Asset
last_update:
date: 2024-10-16
author: Alex Noonan
---

# Your First Software Defined Asset

Now that we have the raw data files and the Dagster project setup lets create some loading those csv's into duckdb.

Check failure on line 11 in docs/docs-beta/docs/tutorial/03-your-first-asset.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Dagster.spelling] Is 'csv's' spelled correctly? Raw Output: {"message": "[Dagster.spelling] Is 'csv's' spelled correctly?", "location": {"path": "docs/docs-beta/docs/tutorial/03-your-first-asset.md", "range": {"start": {"line": 11, "column": 98}}}, "severity": "ERROR"}

Check failure on line 11 in docs/docs-beta/docs/tutorial/03-your-first-asset.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Vale.Spelling] Did you really mean 'csv's'? Raw Output: {"message": "[Vale.Spelling] Did you really mean 'csv's'?", "location": {"path": "docs/docs-beta/docs/tutorial/03-your-first-asset.md", "range": {"start": {"line": 11, "column": 98}}}, "severity": "ERROR"}

Check failure on line 11 in docs/docs-beta/docs/tutorial/03-your-first-asset.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Vale.Terms] Use 'DuckDB' instead of 'duckdb'. Raw Output: {"message": "[Vale.Terms] Use 'DuckDB' instead of 'duckdb'.", "location": {"path": "docs/docs-beta/docs/tutorial/03-your-first-asset.md", "range": {"start": {"line": 11, "column": 109}}}, "severity": "ERROR"}

Check warning on line 11 in docs/docs-beta/docs/tutorial/03-your-first-asset.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line. Raw Output: {"message": "[Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line.", "location": {"path": "docs/docs-beta/docs/tutorial/03-your-first-asset.md", "range": {"start": {"line": 11, "column": 116}}}, "severity": "WARNING"}

## What you'll learn

- Creating our intial defintions object

Check failure on line 15 in docs/docs-beta/docs/tutorial/03-your-first-asset.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Vale.Spelling] Did you really mean 'intial'? Raw Output: {"message": "[Vale.Spelling] Did you really mean 'intial'?", "location": {"path": "docs/docs-beta/docs/tutorial/03-your-first-asset.md", "range": {"start": {"line": 15, "column": 16}}}, "severity": "ERROR"}

Check failure on line 15 in docs/docs-beta/docs/tutorial/03-your-first-asset.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Dagster.spelling] Is 'intial' spelled correctly? Raw Output: {"message": "[Dagster.spelling] Is 'intial' spelled correctly?", "location": {"path": "docs/docs-beta/docs/tutorial/03-your-first-asset.md", "range": {"start": {"line": 15, "column": 16}}}, "severity": "ERROR"}
- Adding a duckdb resource
- Building some basic software defined assets

Check warning on line 17 in docs/docs-beta/docs/tutorial/03-your-first-asset.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line. Raw Output: {"message": "[Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line.", "location": {"path": "docs/docs-beta/docs/tutorial/03-your-first-asset.md", "range": {"start": {"line": 17, "column": 46}}}, "severity": "WARNING"}

## Building definitions object

The definitions object [need docs reference] in Dagster serves as the central configuration point for defining and organizing various componenets within a Dagster Project. It acts as a container that holds all the necessary configurations for a code location, ensuring that everything is organized and easily accessible.

Check warning on line 21 in docs/docs-beta/docs/tutorial/03-your-first-asset.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line. Raw Output: {"message": "[Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line.", "location": {"path": "docs/docs-beta/docs/tutorial/03-your-first-asset.md", "range": {"start": {"line": 21, "column": 321}}}, "severity": "WARNING"}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a minor typo in the word "componenets". It should be spelled "components". Additionally, consider adding a link to the Dagster documentation for the Definitions object to provide more context for readers.

Spotted by Graphite Reviewer

Is this helpful? React 👍 or 👎 to let us know.


1. Creating Definitions Object and duckdb resource

Open the definitions.py file and add the following import statements and definitions object.

Check warning on line 25 in docs/docs-beta/docs/tutorial/03-your-first-asset.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line. Raw Output: {"message": "[Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line.", "location": {"path": "docs/docs-beta/docs/tutorial/03-your-first-asset.md", "range": {"start": {"line": 25, "column": 93}}}, "severity": "WARNING"}

```python
import json
import os

from dagster_duckdb import DuckDBResource

import dagster as dg

defs = dg.Definitions(
assets=[],
resources={"duckdb": DuckDBResource(database="data/mydb.duckdb")},
)
```

## Loading raw data

1. Products Asset

We need to create an asset that creates a duckdb table for the products csv. Additionally we should add meta data to help categorize this asset and give us a preview of what it looks like in the Dagster UI.

Check warning on line 45 in docs/docs-beta/docs/tutorial/03-your-first-asset.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line. Raw Output: {"message": "[Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line.", "location": {"path": "docs/docs-beta/docs/tutorial/03-your-first-asset.md", "range": {"start": {"line": 45, "column": 207}}}, "severity": "WARNING"}

<CodeExample filePath="guides/tutorials/etl_tutorial/etl_tutorial/definitions.py" language="python" lineStart="8" lineEnd="33"/>

You'll notice here that we have meta data for the compute kind for this asset as well as making it part of the ingestion group. Additionally, at the end we add the row count and a preview of what the table looks like.

Check warning on line 49 in docs/docs-beta/docs/tutorial/03-your-first-asset.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line. Raw Output: {"message": "[Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line.", "location": {"path": "docs/docs-beta/docs/tutorial/03-your-first-asset.md", "range": {"start": {"line": 49, "column": 218}}}, "severity": "WARNING"}

2. Sales Reps Asset

This code will be very similar to the product asset but this time its focused on Sales Reps.

<CodeExample filePath="guides/tutorials/etl_tutorial/etl_tutorial/definitions.py" language="python" lineStart="35" lineEnd="61"/>

3. Sales Data Asset

Same thing for Sales Data

<CodeExample filePath="guides/tutorials/etl_tutorial/etl_tutorial/definitions.py" language="python" lineStart="62" lineEnd="87"/>

4. Bringing our assets into the Definitions object

Now to pull these assets into our definitions object simply add them to the empty list in the assets parameter.

```python
defs = dg.Definitions(
assets=[products,
sales_reps,
sales_data,
],
resources={"duckdb": DuckDBResource(database="data/mydb.duckdb")}
),
```

## What you've learned

- Created a Dagster Definition
- Built our ingestion assets



## Next steps

- Continue this tutorial with your [Asset Dependencies]
62 changes: 0 additions & 62 deletions docs/docs-beta/docs/tutorial/tutorial-etl.md

This file was deleted.

6 changes: 5 additions & 1 deletion docs/docs-beta/sidebars.ts
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,11 @@ const sidebars: SidebarsConfig = {
type: 'category',
label: 'Tutorial',
collapsed: false,
items: ['tutorial/tutorial-etl'],
items: [
'tutorial/01-etl-tutorial-introduction',
'tutorial/02-dagster-project-setup',
'tutorial/03-your-first-asset',
],
},
{
type: 'category',
Expand Down
2 changes: 1 addition & 1 deletion docs/docs-beta/src/theme/MDXComponents.tsx
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
// Import the original mapper
import MDXComponents from '@theme-original/MDXComponents';
import { PyObject } from '../components/PyObject';
import {PyObject} from '../components/PyObject';
import CodeExample from '../components/CodeExample';
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
Expand Down
Loading
Loading