Skip to content

Commit

Permalink
first ten pages
Browse files Browse the repository at this point in the history
  • Loading branch information
sh-rp committed Sep 16, 2024
1 parent edb164c commit ee80356
Show file tree
Hide file tree
Showing 10 changed files with 298 additions and 444 deletions.
2 changes: 1 addition & 1 deletion docs/website/docs/_book-onboarding-call.md
Original file line number Diff line number Diff line change
@@ -1 +1 @@
<a href="https://calendar.app.google/EMZRS6YhM11zTGQw7">book a call</a> with a dltHub Solutions Engineer
<a href="https://calendar.app.google/EMZRS6YhM11zTGQw7">Book a call</a> with a dltHub Solutions Engineer
110 changes: 38 additions & 72 deletions docs/website/docs/build-a-pipeline-tutorial.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,33 +7,31 @@ keywords: [getting started, quick start, basics]
# Building data pipelines with `dlt`, from basic to advanced

This in-depth overview will take you through the main areas of pipelining with `dlt`. Go to the
related pages you are instead looking for the [quickstart](getting-started.md), or the
related pages if you are instead looking for the [quickstart](getting-started.md), or the
[walkthroughs](walkthroughs).

## Why build pipelines with `dlt`?

`dlt` offers functionality to support the entire extract and load process. Let's look at the high level diagram:
`dlt` offers functionality to support the entire extract and load process. Let's look at the high-level diagram:

![dlt source resource pipe diagram](/img/dlt-high-level.png)

First, we have a `pipeline` function that can infer a schema from data and load the data to the destination.
We can use this pipeline with JSON data, dataframes, or other iterable objects such as generator functions.

First, we have a `pipeline` function, that can infer a schema from data and load the data to the destination.
We can use this pipeline with json data, dataframes, or other iterable objects such as generator functions.

This pipeline provides effortless loading via a schema discovery, versioning and evolution
engine that ensures you can "just load" any data with row and column level lineage.
This pipeline provides effortless loading via a schema discovery, versioning, and evolution
engine that ensures you can "just load" any data with row and column-level lineage.

By utilizing a `dlt pipeline`, we can easily adapt and structure data as it evolves, reducing the time spent on
maintenance and development.

This allows our data team to focus on leveraging the data and driving value, while ensuring
This allows our data team to focus on leveraging the data and driving value while ensuring
effective governance through timely notifications of any changes.

For extract, `dlt` also provides `source` and `resource` decorators that enable defining
how extracted data should be loaded, while supporting graceful,
scalable extraction via micro-batching and parallelism.


## The simplest pipeline: 1 liner to load data with schema evolution

```py
Expand Down Expand Up @@ -77,11 +75,11 @@ The data you can pass to it should be iterable: lists of rows, generators, or `d
just fine.

If you want to configure how the data is loaded, you can choose between `write_disposition`s
such as `replace`, `append` and `merge` in the pipeline function.
such as `replace`, `append`, and `merge` in the pipeline function.

Here is an example where we load some data to duckdb by `upserting` or `merging` on the id column found in the data.
Here is an example where we load some data to DuckDB by `upserting` or `merging` on the id column found in the data.
In this example, we also run a dbt package and then load the outcomes of the load jobs into their respective tables.
This will enable us to log when schema changes occurred and match them to the loaded data for lineage, granting us both column and row level lineage.
This will enable us to log when schema changes occurred and match them to the loaded data for lineage, granting us both column and row-level lineage.
We also alert the schema change to a Slack channel where hopefully the producer and consumer are subscribed.

```py
Expand Down Expand Up @@ -182,7 +180,7 @@ the correct order, accounting for any dependencies and transformations.
When deploying to Airflow, the internal DAG is unpacked into Airflow tasks in such a way to ensure
consistency and allow granular loading.

## Defining Incremental Loading
## Defining incremental loading

[Incremental loading](general-usage/incremental-loading.md) is a crucial concept in data pipelines that involves loading only new or changed
data instead of reloading the entire dataset. This approach provides several benefits, including
Expand Down Expand Up @@ -227,15 +225,15 @@ incrementally, deduplicating it, and performing the necessary merge operations.
Advanced state management in `dlt` allows you to store and retrieve values across pipeline runs
by persisting them at the destination but accessing them in a dictionary in code. This enables you
to track and manage incremental loading effectively. By leveraging the pipeline state, you can
preserve information, such as last values, checkpoints or column renames, and utilize them later in
preserve information, such as last values, checkpoints, or column renames, and utilize them later in
the pipeline.

## Transforming the Data
## Transforming the data

Data transformation plays a crucial role in the data loading process. You can perform
transformations both before and after loading the data. Here's how you can achieve it:

### Before Loading
### Before loading

Before loading the data, you have the flexibility to perform transformations using Python. You can
leverage Python's extensive libraries and functions to manipulate and preprocess the data as needed.
Expand All @@ -249,7 +247,7 @@ consistent mapping. The `dummy_source` generates dummy data with an `id` and `na
column, and the `add_map` function applies the `pseudonymize_name` transformation to each
record.

### After Loading
### After loading

For transformations after loading the data, you have several options available:

Expand Down Expand Up @@ -316,13 +314,9 @@ with pipeline.sql_client() as client:
```

In this example, the `execute_sql` method of the SQL client allows you to execute SQL
statements. The statement inserts a row with values into the `customers` table.

#### [Using Pandas](dlt-ecosystem/transformations/pandas.md)
statements. The statement inserts a row with values into the `customers` table.#### [Using Pandas](dlt-ecosystem/transformations/pandas.md)

You can fetch query results as Pandas data frames and perform transformations using Pandas
functionalities. Here's an example of reading data from the `issues` table in DuckDB and
counting reaction types using Pandas:
You can fetch query results as Pandas data frames and perform transformations using Pandas functionalities. Here's an example of reading data from the `issues` table in DuckDB and counting reaction types using Pandas:

```py
pipeline = dlt.pipeline(
Expand All @@ -341,90 +335,62 @@ with pipeline.sql_client() as client:
counts = reactions.sum(0).sort_values(0, ascending=False)
```

By leveraging these transformation options, you can shape and manipulate the data before or after
loading it, allowing you to meet specific requirements and ensure data quality and consistency.
By leveraging these transformation options, you can shape and manipulate the data before or after loading it, allowing you to meet specific requirements and ensure data quality and consistency.

## Adjusting the automated normalisation
## Adjusting the automated normalization

To streamline the process, `dlt` recommends attaching schemas to sources implicitly instead of
creating them explicitly. You can provide a few global schema settings and let the table and column
schemas be generated from the resource hints and the data itself. The `dlt.source` decorator accepts a
schema instance that you can create and modify within the source function. Additionally, you can
store schema files with the source Python module and have them automatically loaded and used as the
schema for the source.
To streamline the process, `dlt` recommends attaching schemas to sources implicitly instead of creating them explicitly. You can provide a few global schema settings and let the table and column schemas be generated from the resource hints and the data itself. The `dlt.source` decorator accepts a schema instance that you can create and modify within the source function. Additionally, you can store schema files with the source Python module and have them automatically loaded and used as the schema for the source.

By adjusting the automated normalization process in `dlt`, you can ensure that the generated database
schema meets your specific requirements and aligns with your preferred naming conventions, data
types, and other customization needs.
By adjusting the automated normalization process in `dlt`, you can ensure that the generated database schema meets your specific requirements and aligns with your preferred naming conventions, data types, and other customization needs.

### Customizing the Normalization Process
### Customizing the normalization process

Customizing the normalization process in `dlt` allows you to adapt it to your specific requirements.

You can adjust table and column names, configure column properties, define data type autodetectors,
apply performance hints, specify preferred data types, or change how ids are propagated in the
unpacking process.
You can adjust table and column names, configure column properties, define data type autodetectors, apply performance hints, specify preferred data types, or change how ids are propagated in the unpacking process.

These customization options enable you to create a schema that aligns with your desired naming
conventions, data types, and overall data structure. With `dlt`, you have the flexibility to tailor
the normalization process to meet your unique needs and achieve optimal results.
These customization options enable you to create a schema that aligns with your desired naming conventions, data types, and overall data structure. With `dlt`, you have the flexibility to tailor the normalization process to meet your unique needs and achieve optimal results.

Read more about how to configure [schema generation.](general-usage/schema.md)

### Exporting and Importing Schema Files
### Exporting and importing schema files

`dlt` allows you to export and import schema files, which contain the structure and instructions for
processing and loading the data. Exporting schema files enables you to modify them directly, making
adjustments to the schema as needed. You can then import the modified schema files back into `dlt` to
use them in your pipeline.
`dlt` allows you to export and import schema files, which contain the structure and instructions for processing and loading the data. Exporting schema files enables you to modify them directly, making adjustments to the schema as needed. You can then import the modified schema files back into `dlt` to use them in your pipeline.

Read more: [Adjust a schema docs.](walkthroughs/adjust-a-schema.md)

## Governance Support in `dlt` Pipelines
## Governance support in `dlt` pipelines

`dlt` pipelines offer robust governance support through three key mechanisms: pipeline metadata
utilization, schema enforcement and curation, and schema change alerts.
`dlt` pipelines offer robust governance support through three key mechanisms: pipeline metadata utilization, schema enforcement and curation, and schema change alerts.

### Pipeline Metadata
### Pipeline metadata

`dlt` pipelines leverage metadata to provide governance capabilities. This metadata includes load IDs,
which consist of a timestamp and pipeline name. Load IDs enable incremental transformations and data
vaulting by tracking data loads and facilitating data lineage and traceability.
`dlt` pipelines leverage metadata to provide governance capabilities. This metadata includes load IDs, which consist of a timestamp and pipeline name. Load IDs enable incremental transformations and data vaulting by tracking data loads and facilitating data lineage and traceability.

Read more about [lineage](general-usage/destination-tables.md#data-lineage).

### Schema Enforcement and Curation
### Schema enforcement and curation

`dlt` empowers users to enforce and curate schemas, ensuring data consistency and quality. Schemas
define the structure of normalized data and guide the processing and loading of data. By adhering to
predefined schemas, pipelines maintain data integrity and facilitate standardized data handling
practices.
`dlt` empowers users to enforce and curate schemas, ensuring data consistency and quality. Schemas define the structure of normalized data and guide the processing and loading of data. By adhering to predefined schemas, pipelines maintain data integrity and facilitate standardized data handling practices.

Read more: [Adjust a schema docs.](walkthroughs/adjust-a-schema.md)

### Schema evolution

`dlt` enables proactive governance by alerting users to schema changes. When modifications occur in
the source data’s schema, such as table or column alterations, `dlt` notifies stakeholders, allowing
them to take necessary actions, such as reviewing and validating the changes, updating downstream
processes, or performing impact analysis.
`dlt` enables proactive governance by alerting users to schema changes. When modifications occur in the source data’s schema, such as table or column alterations, `dlt` notifies stakeholders, allowing them to take necessary actions, such as reviewing and validating the changes, updating downstream processes, or performing impact analysis.

These governance features in `dlt` pipelines contribute to better data management practices,
compliance adherence, and overall data governance, promoting data consistency, traceability, and
control throughout the data processing lifecycle.
These governance features in `dlt` pipelines contribute to better data management practices, compliance adherence, and overall data governance, promoting data consistency, traceability, and control throughout the data processing lifecycle.

### Scaling and finetuning

`dlt` offers several mechanism and configuration options to scale up and finetune pipelines:
`dlt` offers several mechanisms and configuration options to scale up and finetune pipelines:

- Running extraction, normalization and load in parallel.
- Running extraction, normalization, and load in parallel.
- Writing sources and resources that are run in parallel via thread pools and async execution.
- Finetune the memory buffers, intermediary file sizes and compression options.
- Finetune the memory buffers, intermediary file sizes, and compression options.

Read more about [performance.](reference/performance.md)

### Other advanced topics

`dlt` is a constantly growing library that supports many features and use cases needed by the
community. [Join our Slack](https://dlthub.com/community)
to find recent releases or discuss what you can build with `dlt`.
`dlt` is a constantly growing library that supports many features and use cases needed by the community. [Join our Slack](https://dlthub.com/community) to find recent releases or discuss what you can build with `dlt`.
32 changes: 11 additions & 21 deletions docs/website/docs/general-usage/glossary.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,56 +8,46 @@ keywords: [glossary, resource, source, pipeline]

## [Source](source)

Location that holds data with certain structure. Organized into one or more resources.
A location that holds data with a certain structure, organized into one or more resources.

- If endpoints in an API are the resources, then the API is the source.
- If tabs in a spreadsheet are the resources, then the source is the spreadsheet.
- If tables in a database are the resources, then the source is the database.

Within this documentation, **source** refers also to the software component (i.e. Python function)
that **extracts** data from the source location using one or more resource components.
Within this documentation, **source** also refers to the software component (i.e., Python function) that **extracts** data from the source location using one or more resource components.

## [Resource](resource)

A logical grouping of data within a data source, typically holding data of similar structure and
origin.
A logical grouping of data within a data source, typically holding data of similar structure and origin.

- If the source is an API, then a resource is an endpoint in that API.
- If the source is a spreadsheet, then a resource is a tab in that spreadsheet.
- If the source is a database, then a resource is a table in that database.

Within this documentation, **resource** refers also to the software component (i.e. Python function)
that **extracts** the data from source location.
Within this documentation, **resource** also refers to the software component (i.e., Python function) that **extracts** the data from the source location.

## [Destination](../dlt-ecosystem/destinations)

The data store where data from the source is loaded (e.g. Google BigQuery).
The data store where data from the source is loaded (e.g., Google BigQuery).

## [Pipeline](pipeline)

Moves the data from the source to the destination, according to instructions provided in the schema
(i.e. extracting, normalizing, and loading the data).
Moves the data from the source to the destination, according to instructions provided in the schema (i.e., extracting, normalizing, and loading the data).

## [Verified source](../walkthroughs/add-a-verified-source)

A Python module distributed with `dlt init` that allows creating pipelines that extract data from a
particular **Source**. Such module is intended to be published in order for others to use it to
build pipelines.
A Python module distributed with `dlt init` that allows creating pipelines that extract data from a particular **Source**. Such a module is intended to be published in order for others to use it to build pipelines.

A source must be published to become "verified": which means that it has tests, test data,
demonstration scripts, documentation and the dataset produces was reviewed by a data engineer.
A source must be published to become "verified," which means that it has tests, test data, demonstration scripts, documentation, and the dataset produced was reviewed by a data engineer.

## [Schema](schema)

Describes the structure of normalized data (e.g. unpacked tables, column types, etc.) and provides
instructions on how the data should be processed and loaded (i.e. it tells `dlt` about the content
of the data and how to load it into the destination).
Describes the structure of normalized data (e.g., unpacked tables, column types, etc.) and provides instructions on how the data should be processed and loaded (i.e., it tells `dlt` about the content of the data and how to load it into the destination).

## [Config](credentials/setup#secrets.toml-and-config.toml)

A set of values that are passed to the pipeline at run time (e.g. to change its behavior locally vs.
in production).
A set of values that are passed to the pipeline at runtime (e.g., to change its behavior locally vs. in production).

## [Credentials](credentials/complex_types)

A subset of configuration whose elements are kept secret and never shared in plain text.
A subset of configuration whose elements are kept secret and never shared in plain text.
Loading

0 comments on commit ee80356

Please sign in to comment.