-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[dagster-aws] [docs] add docs for PipesEMRClient #25011
Open
danielgafni
wants to merge
2
commits into
master
Choose a base branch
from
08-30-_dagster-aws_docs_add_docs_for_pipesemrclient
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Binary file not shown.
Binary file not shown.
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,199 @@ | ||
--- | ||
title: "Integrating AWS EMR with Dagster Pipes | Dagster Docs" | ||
description: "Learn to integrate Dagster Pipes with AWS EMR to launch external code from Dagster assets." | ||
--- | ||
|
||
# AWS EMR & Dagster Pipes | ||
|
||
This tutorial gives a short overview on how to use [Dagster Pipes](/concepts/dagster-pipes) with [AWS EMR](https://aws.amazon.com/emr/). | ||
|
||
The [dagster-aws](/\_apidocs/libraries/dagster-aws) integration library provides the <PyObject object="PipesEMRClient" module="dagster_aws.pipes" /> resource, which can be used to launch AWS EMR jobs from Dagster assets and ops. Dagster can receive regular events such as logs, asset checks, or asset materializations from jobs launched with this client. Using it requires minimal code changes to your EMR jobs. | ||
|
||
--- | ||
|
||
## Prerequisites | ||
|
||
- **In the Dagster environment**, you'll need to: | ||
|
||
- Install the following packages: | ||
|
||
```shell | ||
pip install dagster dagster-webserver dagster-aws | ||
``` | ||
|
||
Refer to the [Dagster installation guide](/getting-started/install) for more info. | ||
|
||
- **AWS authentication credentials configured.** If you don't have this set up already, refer to the [boto3 quickstart](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/quickstart.html). | ||
|
||
- **In AWS**: | ||
|
||
- An existing AWS account | ||
- Prepared infrastructure such as S3 buckets, IAM roles, and other resources required for your EMR job | ||
|
||
--- | ||
|
||
## Step 1: Install the dagster-pipes module in your EMR environment | ||
|
||
Choose one of the [options](https://spark.apache.org/docs/latest/api/python/user_guide/python_packaging.html#python-package-management) to install `dagster-pipes` in the EMR environment. | ||
|
||
For example, this `Dockerfile` can be used to package all required dependencies into a single [PEX](https://docs.pex-tool.org/) file (in practice, the most straightforward way to package Python dependencies for EMR jobs): | ||
|
||
```Dockerfile file=/guides/dagster/dagster_pipes/emr/Dockerfile | ||
# this Dockerfile can be used to create a venv archive for PySpark on AWS EMR | ||
|
||
FROM amazonlinux:2 AS builder | ||
|
||
RUN yum install -y python3 | ||
|
||
WORKDIR /build | ||
|
||
COPY --from=ghcr.io/astral-sh/uv:latest /uv /bin/uv | ||
|
||
ENV VIRTUAL_ENV=/build/.venv | ||
ENV PATH="$VIRTUAL_ENV/bin:$PATH" | ||
|
||
RUN uv python install --python-preference only-managed 3.9.16 && uv python pin 3.9.16 | ||
|
||
RUN uv venv .venv | ||
|
||
RUN --mount=type=cache,target=/root/.cache/uv \ | ||
uv pip install pex dagster-pipes boto3 pyspark | ||
|
||
RUN pex dagster-pipes boto3 pyspark -o /output/venv.pex && chmod +x /output/venv.pex | ||
|
||
# test imports | ||
RUN /output/venv.pex -c "import dagster_pipes, pyspark, boto3;" | ||
|
||
FROM scratch AS export | ||
|
||
COPY --from=builder /output/venv.pex /venv.pex | ||
``` | ||
|
||
The build can be launched with: | ||
|
||
```shell | ||
DOCKER_BUILDKIT=1 docker build --output type=local,dest=./output . | ||
``` | ||
|
||
Then, upload the produced `output/venv.pix` file to an S3 bucket: | ||
|
||
```shell | ||
aws s3 cp output/venv.pex s3://your-bucket/venv.pex | ||
``` | ||
|
||
Finally, use the `--files` and `spark.pyspark.python` options to specify the path to the PEX file in the `spark-submit` command: | ||
|
||
```shell | ||
spark-submit ... --files s3://your-bucket/venv.pex --conf spark.pyspark.python=./venv.pex | ||
``` | ||
|
||
--- | ||
|
||
## Step 2: Add dagster-pipes to the EMR job script | ||
|
||
Call `open_dagster_pipes` in the EMR script to create a context that can be used to send messages to Dagster: | ||
|
||
```python file=/guides/dagster/dagster_pipes/emr/script.py | ||
import boto3 | ||
from dagster_pipes import PipesS3MessageWriter, open_dagster_pipes | ||
from pyspark.sql import SparkSession | ||
|
||
|
||
def main(): | ||
with open_dagster_pipes( | ||
message_writer=PipesS3MessageWriter(client=boto3.client("s3")) | ||
) as pipes: | ||
pipes.log.info("Hello from AWS EMR!") | ||
|
||
spark = SparkSession.builder.appName("HelloWorld").getOrCreate() | ||
|
||
df = spark.createDataFrame( | ||
[(1, "Alice", 34), (2, "Bob", 45), (3, "Charlie", 56)], | ||
["id", "name", "age"], | ||
) | ||
|
||
# calculate a really important statistic | ||
avg_age = float(df.agg({"age": "avg"}).collect()[0][0]) | ||
|
||
# attach it to the asset materialization in Dagster | ||
pipes.report_asset_materialization( | ||
metadata={"average_age": {"raw_value": avg_age, "type": "float"}}, | ||
data_version="alpha", | ||
) | ||
|
||
spark.stop() | ||
|
||
print("Hello from stdout!") | ||
|
||
|
||
if __name__ == "__main__": | ||
main() | ||
``` | ||
|
||
--- | ||
|
||
## Step 3: Create an asset using the PipesEMRClient to launch the job | ||
|
||
In the Dagster asset/op code, use the `PipesEMRClient` resource to launch the job: | ||
|
||
```python file=/guides/dagster/dagster_pipes/emr/dagster_code.py startafter=start_asset_marker endbefore=end_asset_marker | ||
import os | ||
|
||
import boto3 | ||
from dagster_aws.pipes import PipesEMRClient, PipesS3MessageReader | ||
from mypy_boto3_emr.type_defs import InstanceFleetTypeDef | ||
|
||
from dagster import AssetExecutionContext, asset | ||
|
||
|
||
@asset | ||
def emr_pipes_asset(context: AssetExecutionContext, pipes_emr_client: PipesEMRClient): | ||
return pipes_emr_client.run( | ||
context=context, | ||
# see full reference here: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/emr/client/run_job_flow.html#EMR.Client.run_job_flow | ||
run_job_flow_params={}, | ||
).get_materialize_result() | ||
``` | ||
|
||
This will launch the AWS EMR job and wait for it completion. If the job fails, the Dagster process will raise an exception. If the Dagster process is interrupted while the job is still running, the job will be terminated. | ||
|
||
EMR application steps `stdout` and `stderr` will be forwarded to the Dagster process. | ||
|
||
--- | ||
|
||
## Step 4: Create Dagster definitions | ||
|
||
Next, add the `PipesEMRClient` resource to your project's <PyObject object="Definitions" /> object: | ||
|
||
```python file=/guides/dagster/dagster_pipes/emr/dagster_code.py startafter=start_definitions_marker endbefore=end_definitions_marker | ||
from dagster import Definitions # noqa | ||
|
||
|
||
defs = Definitions( | ||
assets=[emr_pipes_asset], | ||
resources={ | ||
"pipes_emr_client": PipesEMRClient( | ||
message_reader=PipesS3MessageReader( | ||
client=boto3.client("s3"), bucket=os.environ["DAGSTER_PIPES_BUCKET"] | ||
) | ||
) | ||
}, | ||
) | ||
``` | ||
|
||
Dagster will now be able to launch the AWS EMR job from the `emr_asset` asset, and receive logs and events from the job. | ||
|
||
--- | ||
|
||
## Related | ||
|
||
<ArticleList> | ||
<ArticleListItem | ||
title="Dagster Pipes" | ||
href="/concepts/dagster-pipes" | ||
></ArticleListItem> | ||
<ArticleListItem | ||
title="AWS EMR Pipes API reference" | ||
href="/_apidocs/libraries/dagster-aws#dagster_aws.pipes.PipesEMRClient" | ||
></ArticleListItem> | ||
</ArticleList> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
28 changes: 28 additions & 0 deletions
28
examples/docs_snippets/docs_snippets/guides/dagster/dagster_pipes/emr/Dockerfile
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,28 @@ | ||
# this Dockerfile can be used to create a venv archive for PySpark on AWS EMR | ||
|
||
FROM amazonlinux:2 AS builder | ||
|
||
RUN yum install -y python3 | ||
|
||
WORKDIR /build | ||
|
||
COPY --from=ghcr.io/astral-sh/uv:latest /uv /bin/uv | ||
|
||
ENV VIRTUAL_ENV=/build/.venv | ||
ENV PATH="$VIRTUAL_ENV/bin:$PATH" | ||
|
||
RUN uv python install --python-preference only-managed 3.9.16 && uv python pin 3.9.16 | ||
|
||
RUN uv venv .venv | ||
|
||
RUN --mount=type=cache,target=/root/.cache/uv \ | ||
uv pip install pex dagster-pipes boto3 pyspark | ||
|
||
RUN pex dagster-pipes boto3 pyspark -o /output/venv.pex && chmod +x /output/venv.pex | ||
|
||
# test imports | ||
RUN /output/venv.pex -c "import dagster_pipes, pyspark, boto3;" | ||
|
||
FROM scratch AS export | ||
|
||
COPY --from=builder /output/venv.pex /venv.pex |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops... was missing