Skip to content

Commit

Permalink
fix documentation (#26)
Browse files Browse the repository at this point in the history
Co-authored-by: theodore.meynard <[email protected]>
  • Loading branch information
theopinard and theodoremeynard authored Apr 22, 2024
1 parent a1b032a commit 3c8b807
Show file tree
Hide file tree
Showing 5 changed files with 123 additions and 117 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ Check out this blogpost if you want to [understand deeper its design motivation]

![ddataflow overview](docs/ddataflow.png)

You can find our documentation in the [docs folder](https://github.com/getyourguide/DDataFlow/tree/main/docs). And see the complete code reference [here](https://code.getyourguide.com/DDataFlow/ddataflow/ddataflow.html).
You can find our documentation under this [link](https://code.getyourguide.com/DDataFlow/).

## Features

Expand Down
Binary file removed ddataflow.png
Binary file not shown.
114 changes: 112 additions & 2 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,118 @@

DDataFlow is an end2end tests and local development solution for machine learning and data pipelines using pyspark.

## Features

It allows you to:
- Read a subset of our data so to speed up the running of the pipelines during tests
- Write to a test location our artifacts so you don't pollute production
- Download data for enabling local machine development

Below is the DDataFlow integration manual.
If you want to know how to use DDataFlow in the local machine, jump to [this section](local_development.md).

## Install Ddataflow

```sh
pip install ddataflow
```

## Mapping your data sources

DDataflow is declarative and is completely configurable a single configuration in DDataflow startup. To create a configuration for you project simply run:

```shell
ddataflow setup_project
```

You can use this config also in in a notebook, or using databricks-connect or in the repository with db-rocket. Example config below:

```python
#later save this script as ddataflow_config.py to follow our convention
from ddataflow import DDataflow
import pyspark.sql.functions as F

start_time = (datetime.now() - timedelta(days=1)).strftime("%Y-%m-%d")
end_time = datetime.now().strftime("%Y-%m-%d")

config = {
"data_sources": {
# data sources define how to access data
"events": {
"source": lambda spark: spark.table("events"),
# here we define the spark query to reduce the size of the data
# the filtering strategy will most likely dependend on the domain.
"filter": lambda df:
df.filter(F.col("date") >= start_time)
.filter(F.col("date") <= end_time)
.filter(F.col("event_name").isin(["BookAction", "ActivityCardImpression"])),
},
"ActivityCardImpression": {
# source can also be partquet files
"source": lambda spark: spark.read.parquet(
f"dbfs:/events/eventname/date={start_time}/"
)
},
},
"project_folder_name": "myproject",
}

# initialize the application and validate the configuration
ddataflow_client = DDataflow(**config)
```

## Replace the sources

Replace in your code the calls to the original data sources for the ones provided by ddataflow.

```py
spark.table('events') #...
spark.read.parquet("dbfs:/mnt/analytics/cleaned/v1/ActivityCardImpression") # ...
```

Replace with the following:

```py
from ddataflow_config import ddataflow_client

ddataflow_client.source('events')
ddataflow_client.source("ActivityCardImpression")
```

Its not a problem if you dont map all data sources if you dont map one it will keep going to production tables and
might be slower. From this point you can use dddataflow to run your pipelines on the sample data instead of the full data.

**Note: BY DEFAULT ddataflow is DISABLED, so the calls will attempt to go to production, which if done wrong can
lead to writing trash data**.

To enable DDataFlow you can either export an environment variable without changing the code.

```shell
# in shell or in the CICD pipeline
export ENABLE_DDATAFLOW=true
# run your pipeline as normal
python conduction_time_predictor/train.py
```

Or you can enable it programmatically in python

```shell
ddataflow_client.enable()
```

At any point in time you can check if the tool is enabled or disabled by running:

```py
ddataflow_client.print_status()
```

## Writing data

To write data we adivse you use the same code as production just write to a different destination.
DDataflow provides the path function that will return a staging path when ddataflow is enabled.

```py
final_path = ddataflow.path('/mnt/my/production/path')
# final_path=/mnt/my/production/path when ddataflow is DISABLED
# final path=$DDATAFLOW_FOLDER/project_name/mnt/my/production/path when ddataflow is ENABLED
```

And you are good to go!
112 changes: 0 additions & 112 deletions docs/integrator_manual.md

This file was deleted.

12 changes: 10 additions & 2 deletions mkdocs.yml
Original file line number Diff line number Diff line change
@@ -1,15 +1,23 @@
site_name: DDataflow
site_url: https://example.com/
site_url: https://code.getyourguide.com/DDataFlow/
repo_url: https://github.com/getyourguide/DDataFlow/
edit_uri: edit/main/docs/

theme:
name: material
icon:
edit: material/pencil
repo: fontawesome/brands/github
features:
- content.action.edit



markdown_extensions:
- pymdownx.superfences

nav:
- 'index.md'
- 'integrator_manual.md'
- 'local_development.md'
- 'sampling.md'
- API Reference:
Expand Down

0 comments on commit 3c8b807

Please sign in to comment.