Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docs: fix broken code detection/highlighting #1810

Merged
merged 2 commits into from
Sep 16, 2024
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
84 changes: 38 additions & 46 deletions docs/website/docs/walkthroughs/add-incremental-configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,8 @@ slug: sql-incremental-configuration
---

# Add incremental configuration to SQL resources
Incremental loading is the act of loading only new or changed data and not old records that have already been loaded.
For example, a bank loading only the latest transactions or a company updating its database with new or modified user
Incremental loading is the act of loading only new or changed data and not old records that have already been loaded.
For example, a bank loading only the latest transactions or a company updating its database with new or modified user
information. In this article, we’ll discuss a few incremental loading strategies.

:::important
Expand All @@ -30,7 +30,7 @@ In this guide, we will discuss various incremental loading methods using `dlt`,

### 1. Full load (replace)

A full load strategy completely overwrites the existing data with the new dataset. This is useful when you want to
A full load strategy completely overwrites the existing data with the new dataset. This is useful when you want to
refresh the entire table with the latest data.

:::note
Expand All @@ -40,15 +40,14 @@ This strategy technically does not load only new data but instead reloads all da
Here’s a walkthrough:

1. The initial table, named "contact", in the SQL source looks like this:

| id | name | created_at |
| --- | --- | --- |
| 1 | Alice | 2024-07-01 |
| 2 | Bob | 2024-07-02 |

2. The python code illustrates the process of loading data from an SQL source into BigQuery using the `dlt` pipeline.
Please note the `write_disposition = "replace”` used below.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dat-a-man A new line after a numbered list item was the reason why the code block didn't render properly (see the screenshot)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @burnash, thanks for looking into it. Do you see some snippets and tables broken in deploy preview here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dat-a-man no, I think I fixed all of them for this page


2. The python code illustrates the process of loading data from an SQL source into BigQuery using the `dlt` pipeline. Please note the `write_disposition = "replace”` used below.

```py
def load_full_table_resource() -> None:
"""Load a full table, replacing existing data."""
Expand All @@ -71,23 +70,22 @@ Please note the `write_disposition = "replace”` used below.
```

3. After running the `dlt` pipeline, the data loaded into the BigQuery "contact" table looks like:

| Row | id | name | created_at | _dlt_load_id | _dlt_id |
| --- | --- | --- | --- | --- | --- |
| 1 | 1 | Alice | 2024-07-01 | 1721878309.021546 | tgyMM73iMz0cQg |
| 2 | 2 | Bob | 2024-07-02 | 1721878309.021546 | 88P0bD796pXo/Q |

4. Next, the "contact" table in the SQL source is updated—two new rows are added, and the row with `id = 2` is removed.
The updated data source ("contact" table) now presents itself as follows:

4. Next, the "contact" table in the SQL source is updated—two new rows are added, and the row with `id = 2` is removed The updated data source ("contact" table) now presents itself as follows:

| id | name | created_at |
| --- | --- | --- |
| 1 | Alice | 2024-07-01 |
| 3 | Charlie | 2024-07-03 |
| 4 | Dave | 2024-07-04 |

5. The "contact" table created in BigQuery after running the pipeline again:

| Row | id | name | created_at | _dlt_load_id | _dlt_id |
| --- | --- | --- | --- | --- | --- |
| 1 | 1 | Alice | 2024-07-01 | 1721878309.021546 | S5ye6fMhYECZA |
Expand All @@ -97,7 +95,7 @@ The updated data source ("contact" table) now presents itself as follows:
**What happened?**

After running the pipeline, the original data in the "contact" table (Alice and Bob) is completely replaced with the new
updated table with data “Charlie” and “Dave” added and “Bob” removed. This strategy is useful for scenarios where the entire
updated table with data “Charlie” and “Dave” added and “Bob” removed. This strategy is useful for scenarios where the entire
dataset needs to be refreshed/replaced with the latest information.

### 2. Append new records based on incremental ID
Expand All @@ -107,16 +105,14 @@ This strategy appends only new records to the table based on an incremental ID.
Here’s a walkthrough:

1. The initial table, named "contact", in the SQL source looks like this:

| id | name | created_at |
| --- | --- | --- |
| 1 | Alice | 2024-07-01 |
| 2 | Bob | 2024-07-02 |

2. The python code demonstrates loading data from an SQL source into BigQuery using an incremental variable, `id`.
This variable tracks new or updated records in the `dlt` pipeline. Please note the `write_disposition = "append”`
used below.

2. The python code demonstrates loading data from an SQL source into BigQuery using an incremental variable, `id`. This variable tracks new or updated records in the `dlt` pipeline. Please note the `write_disposition = "append”` used below.

```py
def load_incremental_id_table_resource() -> None:
"""Load a table incrementally based on an ID."""
Expand All @@ -138,23 +134,23 @@ used below.
```

3. After running the `dlt` pipeline, the data loaded into BigQuery "contact" table looks like:

| Row | id | name | created_at | _dlt_load_id | _dlt_id |
| --- | --- | --- | --- | --- | --- |
| 1 | 1 | Alice | 2024-07-01 | 1721878309.021546 | YQfmAu8xysqWmA |
| 2 | 2 | Bob | 2024-07-02 | 1721878309.021546 | Vcb5KKah/RpmQw |

4. Next, the "contact" table in the SQL source is updated—two new rows are added, and the row with `id = 2` is removed.
4. Next, the "contact" table in the SQL source is updated—two new rows are added, and the row with `id = 2` is removed.
The updated data source now presents itself as follows:

| id | name | created_at |
| --- | --- | --- |
| 1 | Alice | 2024-07-01 |
| 3 | Charlie | 2024-07-03 |
| 4 | Dave | 2024-07-04 |

5. The "contact" table created in BigQuery after running the pipeline again:

| Row | id | name | created_at | _dlt_load_id | _dlt_id |
| --- | --- | --- | --- | --- | --- |
| 1 | 1 | Alice | 2024-07-01 | 1721878309.021546 | OW9ZyAzkXg4D4w |
Expand All @@ -164,26 +160,23 @@ The updated data source now presents itself as follows:

**What happened?**

In this scenario, the pipeline appends new records (Charlie and Dave) to the existing data (Alice and Bob) without affecting
the pre-existing entries. This strategy is ideal when only new data needs to be added, preserving the historical data.
In this scenario, the pipeline appends new records (Charlie and Dave) to the existing data (Alice and Bob) without affecting the pre-existing entries. This strategy is ideal when only new data needs to be added, preserving the historical data.

### 3. Append new records based on timestamp ("created_at")

This strategy appends only new records to the table based on a date/timestamp field. It is useful for scenarios where records
are created with a timestamp, and you want to load only those records created after a certain date.
This strategy appends only new records to the table based on a date/timestamp field. It is useful for scenarios where records are created with a timestamp, and you want to load only those records created after a certain date.

Here’s a walkthrough:

1. The initial dataset, named "contact", in the SQL source looks like this:

| id | name | created_at |
| --- | --- | --- |
| 1 | Alice | 2024-07-01 00:00:00 |
| 2 | Bob | 2024-07-02 00:00:00 |

2. The python code illustrates the process of loading data from an SQL source into BigQuery using the `dlt` pipeline. Please
note the `write_disposition = "append"`, with `created_at` being used as the incremental parameter.

2. The python code illustrates the process of loading data from an SQL source into BigQuery using the `dlt` pipeline. Please note the `write_disposition = "append"`, with `created_at` being used as the incremental parameter.

```py
def load_incremental_timestamp_table_resource() -> None:
"""Load a table incrementally based on created_at timestamp."""
Expand All @@ -208,23 +201,23 @@ note the `write_disposition = "append"`, with `created_at` being used as the inc
```

3. After running the `dlt` pipeline, the data loaded into BigQuery "contact" table looks like:

| Row | id | name | created_at | _dlt_load_id | _dlt_id |
| --- | --- | --- | --- | --- | --- |
| 1 | 1 | Alice | 2024-07-01 00:00:00 UTC | 1721878309.021546 | 5H8ca6C89umxHA |
| 2 | 2 | Bob | 2024-07-02 00:00:00 UTC | 1721878309.021546 | M61j4aOSqs4k2w |

4. Next, the "contact" table in the SQL source is updated—two new rows are added, and the row with `id = 2` is removed.
4. Next, the "contact" table in the SQL source is updated—two new rows are added, and the row with `id = 2` is removed.
The updated data source now presents itself as follows:

| id | name | created_at |
| --- | --- | --- |
| 1 | Alice | 2024-07-01 00:00:00 |
| 3 | Charlie | 2024-07-03 00:00:00 |
| 4 | Dave | 2024-07-04 00:00:00 |

5. The "contact" table created in BigQuery after running the pipeline again:

| Row | id | name | created_at | _dlt_load_id | _dlt_id |
| --- | --- | --- | --- | --- | --- |
| 1 | 1 | Alice | 2024-07-01 00:00:00 UTC | 1721878309.021546 | Petj6R+B/63sWA |
Expand All @@ -239,21 +232,20 @@ retaining the existing data (Alice and Bob). This approach is useful for loading

### 4. Merge (Update/Insert) records based on timestamp ("last_modified_at") and ID

This strategy merges records based on a composite key of ID and a timestamp field. It updates existing records and inserts
This strategy merges records based on a composite key of ID and a timestamp field. It updates existing records and inserts
new ones as necessary.

Here’s a walkthrough:

1. The initial dataset, named ‘contact’, in the SQL source looks like this:

| id | name | last_modified_at |
| --- | --- | --- |
| 1 | Alice | 2024-07-01 00:00:00 |
| 2 | Bob | 2024-07-02 00:00:00 |

2. The Python code illustrates the process of loading data from an SQL source into BigQuery using the `dlt` pipeline. Please
note the `write_disposition = "merge"`, with `last_modified_at` being used as the incremental parameter.

2. The Python code illustrates the process of loading data from an SQL source into BigQuery using the `dlt` pipeline Please note the `write_disposition = "merge"`, with `last_modified_at` being used as the incremental parameter.

```py
def load_merge_table_resource() -> None:
"""Merge (update/insert) records based on last_modified_at timestamp and ID."""
Expand All @@ -266,7 +258,7 @@ note the `write_disposition = "merge"`, with `last_modified_at` being used as th
# Merge records, 'contact' table, based on ID and last_modified_at timestamp
source = sql_database().with_resources("contact")
source.contact.apply_hints(incremental=dlt.sources.incremental(
"last_modified_at", initial_value=datetime(2024, 4, 1, 0, 0, 0)),
"last_modified_at", initial_value=datetime(2024, 4, 1, 0, 0, 0)),
primary_key="id")

# Run the pipeline
Expand All @@ -279,21 +271,21 @@ note the `write_disposition = "merge"`, with `last_modified_at` being used as th
```

3. After running the `dlt` pipeline, the data loaded into BigQuery ‘contact’ table looks like:

| Row | id | name | last_modified_at | _dlt_load_id | _dlt_id |
| --- | --- | --- | --- | --- | --- |
| 1 | 1 | Alice | 2024-07-01 00:00:00 UTC | 1721878309.021546 | ObbVlxcly3VknQ |
| 2 | 2 | Bob | 2024-07-02 00:00:00 UTC | 1721878309.021546 | Vrlkus/haaKlEg |

4. Next, the "contact" table in the SQL source is updated— “Alice” is updated to “Alice Updated”, and a new row “Hank” is added:

| id | name | last_modified_at |
| --- | --- | --- |
| 1 | Alice Updated | 2024-07-08 00:00:00 |
| 3 | Hank | 2024-07-08 00:00:00 |

5. The "contact" table created in BigQuery after running the pipeline again:

| Row | id | name | last_modified_at | _dlt_load_id | _dlt_id |
| --- | --- | --- | --- | --- | --- |
| 1 | 2 | Bob | 2024-07-02 00:00:00 UTC | 1721878309.021546 | Cm+AcDZLqXSDHQ |
Expand All @@ -302,9 +294,9 @@ note the `write_disposition = "merge"`, with `last_modified_at` being used as th

**What happened?**

The pipeline updates the record for Alice with the new data, including the updated `last_modified_at` timestamp, and adds a
new record for Hank. This method is beneficial when you need to ensure that records are both updated and inserted based on a
The pipeline updates the record for Alice with the new data, including the updated `last_modified_at` timestamp, and adds a
new record for Hank. This method is beneficial when you need to ensure that records are both updated and inserted based on a
specific timestamp and ID.

The examples provided explain how to use `dlt` to achieve different incremental loading scenarios, highlighting the changes
The examples provided explain how to use `dlt` to achieve different incremental loading scenarios, highlighting the changes
before and after running each pipeline.
Loading