fix documentation (#26)

Co-authored-by: theodore.meynard <[email protected]>
getyourguide · Apr 22, 2024 · 3c8b807 · 3c8b807
1 parent a1b032a
commit 3c8b807
Show file tree

Hide file tree

Showing 5 changed files with 123 additions and 117 deletions.
diff --git a/README.md b/README.md
@@ -5,7 +5,7 @@ Check out this blogpost if you want to [understand deeper its design motivation]
 
 ![ddataflow overview](docs/ddataflow.png)
 
-You can find our documentation in the [docs folder](https://github.com/getyourguide/DDataFlow/tree/main/docs). And see the complete code reference [here](https://code.getyourguide.com/DDataFlow/ddataflow/ddataflow.html).
+You can find our documentation under this [link](https://code.getyourguide.com/DDataFlow/).
 
 ## Features
 

diff --git a/ddataflow.png b/ddataflow.png
diff --git a/docs/index.md b/docs/index.md
@@ -2,8 +2,118 @@
 
 DDataFlow is an end2end tests and local development solution for machine learning and data pipelines using pyspark.
 
-## Features
-
+It allows you to:
 - Read a subset of our data so to speed up the running of the pipelines during tests
 - Write to a test location our artifacts so you don't pollute production
 - Download data for enabling local machine development
+
+Below is the DDataFlow integration manual.
+If you want to know how to use DDataFlow in the local machine, jump to [this section](local_development.md).
+
+## Install Ddataflow
+
+```sh
+pip install ddataflow
+```
+
+## Mapping your data sources
+
+DDataflow is declarative and is completely configurable a single configuration in DDataflow startup. To create a configuration for you project simply run:
+
+```shell
+ddataflow setup_project
+```
+
+You can use this config also in in a notebook, or using databricks-connect or in the repository with db-rocket. Example config below:
+
+```python
+#later save this script as ddataflow_config.py to follow our convention
+from ddataflow import DDataflow
+import pyspark.sql.functions as F
+
+start_time = (datetime.now() - timedelta(days=1)).strftime("%Y-%m-%d")
+end_time = datetime.now().strftime("%Y-%m-%d")
+
+config = {
+    "data_sources": {
+        # data sources define how to access data
+        "events": {
+            "source": lambda spark: spark.table("events"),
+            #  here we define the spark query to reduce the size of the data
+            #  the filtering strategy will most likely dependend on the domain.
+            "filter": lambda df:
+                df.filter(F.col("date") >= start_time)
+                    .filter(F.col("date") <= end_time)
+                    .filter(F.col("event_name").isin(["BookAction", "ActivityCardImpression"])),
+        },
+        "ActivityCardImpression": {
+            # source can also be partquet files
+            "source": lambda spark: spark.read.parquet(
+                f"dbfs:/events/eventname/date={start_time}/"
+            )
+        },
+    },
+    "project_folder_name": "myproject",
+}
+
+# initialize the application and validate the configuration
+ddataflow_client = DDataflow(**config)
+```
+
+## Replace the sources
+
+Replace in your code the calls to the original data sources for the ones provided by ddataflow.
+
+```py
+spark.table('events') #...
+spark.read.parquet("dbfs:/mnt/analytics/cleaned/v1/ActivityCardImpression") # ...
+```
+
+Replace with the following:
+
+```py
+from ddataflow_config import ddataflow_client
+
+ddataflow_client.source('events')
+ddataflow_client.source("ActivityCardImpression")
+```
+
+Its not a problem if you dont map all data sources if you dont map one it will keep going to production tables and
+might be slower. From this point you can use dddataflow to run your pipelines on the sample data instead of the full data.
+
+**Note: BY DEFAULT ddataflow is DISABLED, so the calls will attempt to go to production, which if done wrong can
+lead to writing trash data**.
+
+To enable DDataFlow you can either export an environment variable without changing the code.
+
+```shell
+# in shell or in the CICD pipeline
+export ENABLE_DDATAFLOW=true
+# run your pipeline as normal
+python conduction_time_predictor/train.py
+```
+
+Or you can enable it programmatically in python
+
+```shell
+ddataflow_client.enable()
+```
+
+At any point in time you can check if the tool is enabled or disabled by running:
+
+```py
+ddataflow_client.print_status()
+```
+
+## Writing data
+
+To write data we adivse you use the same code as production just write to a different destination.
+DDataflow provides the path function that will return a staging path when ddataflow is enabled.
+
+```py
+final_path = ddataflow.path('/mnt/my/production/path')
+# final_path=/mnt/my/production/path when ddataflow is DISABLED
+# final path=$DDATAFLOW_FOLDER/project_name/mnt/my/production/path when ddataflow is ENABLED
+```
+
+And you are good to go!
diff --git a/docs/integrator_manual.md b/docs/integrator_manual.md
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -1,15 +1,23 @@
 site_name: DDataflow
-site_url: https://example.com/
+site_url: https://code.getyourguide.com/DDataFlow/
+repo_url: https://github.com/getyourguide/DDataFlow/
+edit_uri: edit/main/docs/
 
 theme:
   name: material
+  icon:
+    edit: material/pencil
+    repo:  fontawesome/brands/github
+  features:
+    - content.action.edit
+
+
 
 markdown_extensions:
   - pymdownx.superfences
 
 nav:
   - 'index.md'
-  - 'integrator_manual.md'
   - 'local_development.md'
   - 'sampling.md'
   - API Reference: