diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/filesystem/advanced.md b/docs/website/docs/dlt-ecosystem/verified-sources/filesystem/advanced.md index be08e9ff44..e1eeca0ee9 100644 --- a/docs/website/docs/dlt-ecosystem/verified-sources/filesystem/advanced.md +++ b/docs/website/docs/dlt-ecosystem/verified-sources/filesystem/advanced.md @@ -32,10 +32,10 @@ The filesystem ensures consistent file representation across bucket types and of #### `FileItem` fields -- `file_url` - complete URL of the file (e.g. `s3://bucket-name/path/file`). This field serves as a primary key. +- `file_url` - complete URL of the file (e.g., `s3://bucket-name/path/file`). This field serves as a primary key. - `file_name` - name of the file from the bucket URL. - `relative_path` - set when doing `glob`, is a relative path to a `bucket_url` argument. -- `mime_type` - file's mime type. It is sourced from the bucket provider or inferred from its extension. +- `mime_type` - file's MIME type. It is sourced from the bucket provider or inferred from its extension. - `modification_date` - file's last modification time (format: `pendulum.DateTime`). - `size_in_bytes` - file size. - `file_content` - content, provided upon request. @@ -90,7 +90,7 @@ example_xls = filesystem( bucket_url=BUCKET_URL, file_glob="../directory/example.xlsx" ) | read_excel("example_table") # Pass the data through the transformer to read the "example_table" sheet. -pipeline = dlt.pipeline(pipeline_name="my_pipeline", destination="duckdb", dataset_name="example_xls_data",) +pipeline = dlt.pipeline(pipeline_name="my_pipeline", destination="duckdb", dataset_name="example_xls_data") # Execute the pipeline and load the extracted data into the "duckdb" destination. load_info = pipeline.run(example_xls.with_name("example_xls_data")) # Print the loading information. @@ -119,7 +119,7 @@ def read_xml(items: Iterator[FileItemDict]) -> Iterator[TDataItems]: for file_obj in items: # Open the file object. with file_obj.open() as file: - # Parse the file to dict records + # Parse the file to dict records. yield xmltodict.parse(file.read()) # Set up the pipeline to fetch a specific XML file from a filesystem (bucket). @@ -143,14 +143,14 @@ You can get an fsspec client from the filesystem resource after it was extracted from dlt.sources.filesystem import filesystem, read_csv from dlt.sources.filesystem.helpers import fsspec_from_resource -# get filesystem source +# Get filesystem source. gs_resource = filesystem("gs://ci-test-bucket/") -# extract files +# Extract files. pipeline = dlt.pipeline(pipeline_name="my_pipeline", destination="duckdb") pipeline.run(gs_resource | read_csv()) -# get fs client +# Get fs client. fs_client = fsspec_from_resource(gs_resource) -# do any operation +# Do any operation. fs_client.ls("ci-test-bucket/standard_source/samples") ``` @@ -166,31 +166,32 @@ from dlt.common.storages.fsspec_filesystem import FileItemDict from dlt.sources.filesystem import filesystem def _copy(item: FileItemDict) -> FileItemDict: - # instantiate fsspec and copy file + # Instantiate fsspec and copy file dest_file = os.path.join(local_folder, item["file_name"]) - # create dest folder + # Create destination folder os.makedirs(os.path.dirname(dest_file), exist_ok=True) - # download file + # Download file item.fsspec.download(item["file_url"], dest_file) - # return file item unchanged + # Return file item unchanged return item BUCKET_URL = "gs://ci-test-bucket/" -# use recursive glob pattern and add file copy step +# Use recursive glob pattern and add file copy step downloader = filesystem(BUCKET_URL, file_glob="**").add_map(_copy) -# NOTE: you do not need to load any data to execute extract, below we obtain +# NOTE: You do not need to load any data to execute extract; below, we obtain # a list of files in a bucket and also copy them locally listing = list(downloader) print(listing) -# download to table "listing" +# Download to table "listing" pipeline = dlt.pipeline(pipeline_name="my_pipeline", destination="duckdb") load_info = pipeline.run( downloader.with_name("listing"), write_disposition="replace" ) -# pretty print the information on data that was loaded +# Pretty print the information on data that was loaded print(load_info) print(listing) print(pipeline.last_trace.last_normalize_info) -``` \ No newline at end of file +``` + diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/filesystem/basic.md b/docs/website/docs/dlt-ecosystem/verified-sources/filesystem/basic.md index 6eb02b4edf..5ae7de82da 100644 --- a/docs/website/docs/dlt-ecosystem/verified-sources/filesystem/basic.md +++ b/docs/website/docs/dlt-ecosystem/verified-sources/filesystem/basic.md @@ -10,7 +10,7 @@ Filesystem source allows loading files from remote locations (AWS S3, Google Clo To load unstructured data (`.pdf`, `.txt`, e-mail), please refer to the [unstructured data source](https://github.com/dlt-hub/verified-sources/tree/master/sources/unstructured_data). -## How Filesystem source works? +## How filesystem source works The Filesystem source doesn't just give you an easy way to load data from both remote and local files — it also comes with a powerful set of tools that let you customize the loading process to fit your specific needs. @@ -54,7 +54,7 @@ To get started with your data pipeline, follow these steps: dlt init filesystem duckdb ``` - [dlt init command](../../../reference/command-line-interface) will initialize + The [dlt init command](../../../reference/command-line-interface) will initialize [the pipeline example](https://github.com/dlt-hub/verified-sources/blob/master/sources/filesystem_pipeline.py) with the filesystem as the source and [duckdb](../../destinations/duckdb.md) as the destination. @@ -66,6 +66,8 @@ To get started with your data pipeline, follow these steps: ## Configuration + + ### Get credentials @@ -145,7 +147,7 @@ You don't need any credentials for the local filesystem. To provide credentials to the filesystem source, you can use [any method available](../../../general-usage/credentials/setup#available-config-providers) in `dlt`. One of the easiest ways is to use configuration files. The `.dlt` folder in your working directory -contains two files: `config.toml` and `secrets.toml`. Sensitive information, like passwords and +contains two files: `config.toml` and `secrets.toml`. Sensitive information, like passwords and access tokens, should only be put into `secrets.toml`, while any other configuration, like the path to a bucket, can be specified in `config.toml`. @@ -212,7 +214,7 @@ bucket_url="gs:////" Learn how to set up SFTP credentials for each authentication method in the [SFTP section](../../destinations/filesystem#sftp). -For example, in case of key-based authentication, you can configure the source the following way: +For example, in the case of key-based authentication, you can configure the source the following way: ```toml # secrets.toml @@ -229,7 +231,7 @@ bucket_url = "sftp://[hostname]/[path]" -You can use both native local filesystem paths and `file://` URI. Absolute, relative, and UNC Windows paths are supported. +You can use both native local filesystem paths and the `file://` URI. Absolute, relative, and UNC Windows paths are supported. You could provide an absolute filepath: @@ -239,7 +241,7 @@ You could provide an absolute filepath: bucket_url='file://Users/admin/Documents/csv_files' ``` -Or skip the schema and provide the local path in a format native for your operating system. For example, for Windows: +Or skip the schema and provide the local path in a format native to your operating system. For example, for Windows: ```toml [sources.filesystem] @@ -250,7 +252,7 @@ bucket_url='~\Documents\csv_files\' -You can also specify the credentials using Environment variables. The name of the corresponding environment +You can also specify the credentials using environment variables. The name of the corresponding environment variable should be slightly different from the corresponding name in the `toml` file. Simply replace dots `.` with double underscores `__`: @@ -260,7 +262,7 @@ export SOURCES__FILESYSTEM__AWS_SECRET_ACCESS_KEY = "Please set me up!" ``` :::tip -`dlt` supports more ways of authorizing with the cloud storage, including identity-based +`dlt` supports more ways of authorizing with cloud storage, including identity-based and default credentials. To learn more about adding credentials to your pipeline, please refer to the [Configuration and secrets section](../../../general-usage/credentials/complex_types#gcp-credentials). ::: @@ -310,7 +312,7 @@ or taken from the config: Full list of `filesystem` resource parameters: * `bucket_url` - full URL of the bucket (could be a relative path in the case of the local filesystem). -* `credentials` - cloud storage credentials of `AbstractFilesystem` instance (should be empty for the local filesystem). We recommend not to specify this parameter in the code, but put it in secrets file instead. +* `credentials` - cloud storage credentials of `AbstractFilesystem` instance (should be empty for the local filesystem). We recommend not specifying this parameter in the code, but putting it in a secrets file instead. * `file_glob` - file filter in glob format. Defaults to listing all non-recursive files in the bucket URL. * `files_per_page` - number of files processed at once. The default value is `100`. * `extract_content` - if true, the content of the file will be read and returned in the resource. The default value is `False`. @@ -332,15 +334,15 @@ filesystem_pipe = filesystem( #### Available transformers -- `read_csv()` - process `csv` files using `pandas` -- `read_jsonl()` - process `jsonl` files chuck by chunk -- `read_parquet()` - process `parquet` files using `pyarrow` -- `read_csv_duckdb()` - this transformer process `csv` files using DuckDB, which usually shows better performance, than `pandas`. +- `read_csv()` - processes `csv` files using `pandas` +- `read_jsonl()` - processes `jsonl` files chunk by chunk +- `read_parquet()` - processes `parquet` files using `pyarrow` +- `read_csv_duckdb()` - this transformer processes `csv` files using DuckDB, which usually shows better performance than `pandas`. :::tip We advise that you give each resource a [specific name](../../../general-usage/resource#duplicate-and-rename-resources) -before loading with `pipeline.run`. This will make sure that data goes to a table with the name you +before loading with `pipeline.run`. This will ensure that data goes to a table with the name you want and that each pipeline uses a [separate state for incremental loading.](../../../general-usage/state#read-and-write-pipeline-state-in-a-resource) ::: @@ -366,7 +368,7 @@ import dlt from dlt.sources.filesystem import filesystem, read_csv filesystem_pipe = filesystem(bucket_url="file://Users/admin/Documents/csv_files", file_glob="*.csv") | read_csv() -# tell dlt to merge on date +# Tell dlt to merge on date filesystem_pipe.apply_hints(write_disposition="merge", merge_key="date") # We load the data into the table_name table @@ -380,19 +382,19 @@ print(load_info) Here are a few simple ways to load your data incrementally: 1. [Load files based on modification date](#load-files-based-on-modification-date). Only load files that have been updated since the last time `dlt` processed them. `dlt` checks the files' metadata (like the modification date) and skips those that haven't changed. -2. [Load new records based on a specific column](#load-new-records-based-on-a-specific-column). You can load only the new or updated records by looking at a specific column, like `updated_at`. Unlike the first method, this approach would read all files every time and then filter the records which was updated. -3. [Combine loading only updated files and records](#combine-loading-only-updated-files-and-records). Finally, you can combine both methods. It could be useful if new records could be added to existing files, so you not only want to filter the modified files, but modified records as well. +2. [Load new records based on a specific column](#load-new-records-based-on-a-specific-column). You can load only the new or updated records by looking at a specific column, like `updated_at`. Unlike the first method, this approach would read all files every time and then filter the records which were updated. +3. [Combine loading only updated files and records](#combine-loading-only-updated-files-and-records). Finally, you can combine both methods. It could be useful if new records could be added to existing files, so you not only want to filter the modified files, but also the modified records. #### Load files based on modification date -For example, to load only new CSV files with [incremental loading](../../../general-usage/incremental-loading) you can use `apply_hints` method. +For example, to load only new CSV files with [incremental loading](../../../general-usage/incremental-loading), you can use the `apply_hints` method. ```py import dlt from dlt.sources.filesystem import filesystem, read_csv -# This configuration will only consider new csv files +# This configuration will only consider new CSV files new_files = filesystem(bucket_url="s3://bucket_name", file_glob="directory/*.csv") -# add incremental on modification time +# Add incremental on modification time new_files.apply_hints(incremental=dlt.sources.incremental("modification_date")) pipeline = dlt.pipeline(pipeline_name="my_pipeline", destination="duckdb") @@ -402,13 +404,13 @@ print(load_info) #### Load new records based on a specific column -In this example we load only new records based on the field called `updated_at`. This method may be useful if you are not able to -filter files by modification date because for example, all files are modified each time new record is appeared. +In this example, we load only new records based on the field called `updated_at`. This method may be useful if you are not able to +filter files by modification date because, for example, all files are modified each time a new record appears. ```py import dlt from dlt.sources.filesystem import filesystem, read_csv -# We consider all csv files +# We consider all CSV files all_files = filesystem(bucket_url="s3://bucket_name", file_glob="directory/*.csv") # But filter out only updated records @@ -425,11 +427,11 @@ print(load_info) import dlt from dlt.sources.filesystem import filesystem, read_csv -# This configuration will only consider modified csv files +# This configuration will only consider modified CSV files new_files = filesystem(bucket_url="s3://bucket_name", file_glob="directory/*.csv") new_files.apply_hints(incremental=dlt.sources.incremental("modification_date")) -# And in each modified file we filter out only updated records +# And in each modified file, we filter out only updated records filesystem_pipe = (new_files | read_csv()) filesystem_pipe.apply_hints(incremental=dlt.sources.incremental("updated_at")) pipeline = dlt.pipeline(pipeline_name="my_pipeline", destination="duckdb") @@ -459,7 +461,7 @@ print(load_info) ``` :::tip -You could also use `file_glob` to filter files by names. It works very well in simple cases, for example, filtering by extention: +You could also use `file_glob` to filter files by names. It works very well in simple cases, for example, filtering by extension: ```py from dlt.sources.filesystem import filesystem @@ -493,8 +495,8 @@ print(load_info) Windows supports paths up to 255 characters. When you access a path longer than 255 characters, you'll see a `FileNotFound` exception. - To go over this limit, you can use [extended paths](https://learn.microsoft.com/en-us/windows/win32/fileio/maximum-file-path-limitation?tabs=registry). - **Note that Python glob does not work with extended UNC paths**, so you will not be able to use them +To go over this limit, you can use [extended paths](https://learn.microsoft.com/en-us/windows/win32/fileio/maximum-file-path-limitation?tabs=registry). +**Note that Python glob does not work with extended UNC paths**, so you will not be able to use them ```toml [sources.filesystem] @@ -514,4 +516,5 @@ function to configure the resource correctly. Use `**` to include recursive file filesystem supports full Python [glob](https://docs.python.org/3/library/glob.html#glob.glob) functionality, while cloud storage supports a restricted `fsspec` [version](https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.spec.AbstractFileSystem.glob). - \ No newline at end of file + + diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/filesystem/index.md b/docs/website/docs/dlt-ecosystem/verified-sources/filesystem/index.md index 1441931340..0aaa07b0c3 100644 --- a/docs/website/docs/dlt-ecosystem/verified-sources/filesystem/index.md +++ b/docs/website/docs/dlt-ecosystem/verified-sources/filesystem/index.md @@ -12,8 +12,9 @@ The Filesystem source allows seamless loading of files from the following locati * remote filesystem (via SFTP) * local filesystem -The Filesystem source natively supports `csv`, `parquet`, and `jsonl` files and allows customization for loading any type of structured files. +The Filesystem source natively supports `csv`, `parquet`, and `jsonl` files and allows customization for loading any type of structured file. import DocCardList from '@theme/DocCardList'; - \ No newline at end of file + + diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/rest_api/advanced.md b/docs/website/docs/dlt-ecosystem/verified-sources/rest_api/advanced.md index 27d2cc0b6e..26add81def 100644 --- a/docs/website/docs/dlt-ecosystem/verified-sources/rest_api/advanced.md +++ b/docs/website/docs/dlt-ecosystem/verified-sources/rest_api/advanced.md @@ -9,15 +9,15 @@ keywords: [rest api, restful api] - `config`: The REST API configuration dictionary. - `name`: An optional name for the source. - `section`: An optional section name in the configuration file. -- `max_table_nesting`: Sets the maximum depth of nested table above which the remaining nodes are loaded as structs or JSON. -- `root_key` (bool): Enables merging on all resources by propagating root foreign key to nested tables. This option is most useful if you plan to change write disposition of a resource to disable/enable merge. Defaults to False. +- `max_table_nesting`: Sets the maximum depth of nested tables above which the remaining nodes are loaded as structs or JSON. +- `root_key` (bool): Enables merging on all resources by propagating the root foreign key to nested tables. This option is most useful if you plan to change the write disposition of a resource to disable/enable merge. Defaults to False. - `schema_contract`: Schema contract settings that will be applied to this resource. - `spec`: A specification of configuration and secret values required by the source. ### Response actions The `response_actions` field in the endpoint configuration allows you to specify how to handle specific responses or all responses from the API. For example, responses with specific status codes or content substrings can be ignored. -Additionally, all responses or only responses with specific status codes or content substrings can be transformed with a custom callable, such as a function. This callable is passed on to the requests library as a [response hook](https://requests.readthedocs.io/en/latest/user/advanced/#event-hooks). The callable can modify the response object and has to return it for the modifications to take effect. +Additionally, all responses or only responses with specific status codes or content substrings can be transformed with a custom callable, such as a function. This callable is passed on to the requests library as a [response hook](https://requests.readthedocs.io/en/latest/user/advanced/#event-hooks). The callable can modify the response object and must return it for the modifications to take effect. :::caution Experimental Feature This is an experimental feature and may change in future releases. @@ -55,7 +55,7 @@ from requests.models import Response from dlt.common import json def set_encoding(response, *args, **kwargs): - # sets the encoding in case it's not correctly detected + # Sets the encoding in case it's not correctly detected response.encoding = 'windows-1252' return response @@ -99,7 +99,7 @@ In this example, the resource will set the correct encoding for all responses fi ```py def set_encoding(response, *args, **kwargs): - # sets the encoding in case it's not correctly detected + # Sets the encoding in case it's not correctly detected response.encoding = 'windows-1252' return response @@ -122,3 +122,4 @@ source_config = { ``` In this example, the resource will set the correct encoding for all responses. More callables can be added to the list of response_actions. + diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/rest_api/basic.md b/docs/website/docs/dlt-ecosystem/verified-sources/rest_api/basic.md index 121769a11a..03214950f4 100644 --- a/docs/website/docs/dlt-ecosystem/verified-sources/rest_api/basic.md +++ b/docs/website/docs/dlt-ecosystem/verified-sources/rest_api/basic.md @@ -62,7 +62,7 @@ pipeline = dlt.pipeline( load_info = pipeline.run(source) ``` -Running this pipeline will create two tables in the DuckDB: `posts` and `comments` with the data from the respective API endpoints. The `comments` resource will fetch comments for each post by using the `id` field from the `posts` resource. +Running this pipeline will create two tables in DuckDB: `posts` and `comments` with the data from the respective API endpoints. The `comments` resource will fetch comments for each post by using the `id` field from the `posts` resource. ## Setup @@ -132,9 +132,11 @@ github_token = "your_github_token" ## Source configuration + + ### Quick example -Let's take a look at the GitHub example in `rest_api_pipeline.py` file: +Let's take a look at the GitHub example in the `rest_api_pipeline.py` file: ```py from dlt.sources.rest_api import RESTAPIConfig, rest_api_resources @@ -206,14 +208,14 @@ def load_github() -> None: The declarative resource configuration is defined in the `config` dictionary. It contains the following key components: -1. `client`: Defines the base URL and authentication method for the API. In this case it uses token-based authentication. The token is stored in the `secrets.toml` file. +1. `client`: Defines the base URL and authentication method for the API. In this case, it uses token-based authentication. The token is stored in the `secrets.toml` file. 2. `resource_defaults`: Contains default settings for all [resources](#resource-configuration). In this example, we define that all resources: - Have `id` as the [primary key](../../../general-usage/resource#define-schema) - Use the `merge` [write disposition](../../../general-usage/incremental-loading#choosing-a-write-disposition) to merge the data with the existing data in the destination. - - Send a `per_page` query parameter with each request to 100 to get more results per page. + - Send a `per_page=100` query parameter with each request to get more results per page. -3. `resources`: A list of [resources](#resource-configuration) to be loaded. Here, we have two resources: `issues` and `issue_comments`, which correspond to the GitHub API endpoints for [repository issues](https://docs.github.com/en/rest/issues/issues?apiVersion=2022-11-28#list-repository-issues) and [issue comments](https://docs.github.com/en/rest/issues/comments?apiVersion=2022-11-28#list-issue-comments). Note that we need a in issue number to fetch comments for each issue. This number is taken from the `issues` resource. More on this in the [resource relationships](#define-resource-relationships) section. +3. `resources`: A list of [resources](#resource-configuration) to be loaded. Here, we have two resources: `issues` and `issue_comments`, which correspond to the GitHub API endpoints for [repository issues](https://docs.github.com/en/rest/issues/issues?apiVersion=2022-11-28#list-repository-issues) and [issue comments](https://docs.github.com/en/rest/issues/comments?apiVersion=2022-11-28#list-issue-comments). Note that we need an issue number to fetch comments for each issue. This number is taken from the `issues` resource. More on this in the [resource relationships](#define-resource-relationships) section. Let's break down the configuration in more detail. @@ -227,7 +229,6 @@ from dlt.sources.rest_api import RESTAPIConfig ``` ::: - The configuration object passed to the REST API Generic Source has three main elements: ```py @@ -297,7 +298,7 @@ Both `resource1` and `resource2` will have the `per_page` parameter set to 100. This is a list of resource configurations that define the API endpoints to be loaded. Each resource configuration can be: - a dictionary with the [resource configuration](#resource-configuration). -- a string. In this case, the string is used as the both as the endpoint path and the resource name, and the resource configuration is taken from the `resource_defaults` configuration if it exists. +- a string. In this case, the string is used as both the endpoint path and the resource name, and the resource configuration is taken from the `resource_defaults` configuration if it exists. ### Resource configuration @@ -337,7 +338,7 @@ The endpoint configuration defines how to query the API endpoint. Quick example: The fields in the endpoint configuration are: - `path`: The path to the API endpoint. -- `method`: The HTTP method to be used. Default is `GET`. +- `method`: The HTTP method to be used. The default is `GET`. - `params`: Query parameters to be sent with each request. For example, `sort` to order the results or `since` to specify [incremental loading](#incremental-loading). This is also used to define [resource relationships](#define-resource-relationships). - `json`: The JSON payload to be sent with the request (for POST and PUT requests). - `paginator`: Pagination configuration for the endpoint. See the [pagination](#pagination) section for more details. @@ -398,7 +399,7 @@ from dlt.sources.helpers.rest_client.paginators import JSONLinkPaginator ``` :::note -Currently pagination is supported only for GET requests. To handle POST requests with pagination, you need to implement a [custom paginator](../../../general-usage/http/rest-client.md#custom-paginator). +Currently, pagination is supported only for GET requests. To handle POST requests with pagination, you need to implement a [custom paginator](../../../general-usage/http/rest-client.md#custom-paginator). ::: These are the available paginators: @@ -407,9 +408,9 @@ These are the available paginators: | ------------ | -------------- | ----------- | | `json_link` | [JSONLinkPaginator](../../../general-usage/http/rest-client.md#jsonresponsepaginator) | The link to the next page is in the body (JSON) of the response.
*Parameters:* | | `header_link` | [HeaderLinkPaginator](../../../general-usage/http/rest-client.md#headerlinkpaginator) | The links to the next page are in the response headers.
*Parameters:* | -| `offset` | [OffsetPaginator](../../../general-usage/http/rest-client.md#offsetpaginator) | The pagination is based on an offset parameter. With total items count either in the response body or explicitly provided.
*Parameters:* | -| `page_number` | [PageNumberPaginator](../../../general-usage/http/rest-client.md#pagenumberpaginator) | The pagination is based on a page number parameter. With total pages count either in the response body or explicitly provided.
*Parameters:* | -| `cursor` | [JSONResponseCursorPaginator](../../../general-usage/http/rest-client.md#jsonresponsecursorpaginator) | The pagination is based on a cursor parameter. The value of the cursor is in the response body (JSON).
*Parameters:* | +| `offset` | [OffsetPaginator](../../../general-usage/http/rest-client.md#offsetpaginator) | The pagination is based on an offset parameter, with the total items count either in the response body or explicitly provided.
*Parameters:* | +| `page_number` | [PageNumberPaginator](../../../general-usage/http/rest-client.md#pagenumberpaginator) | The pagination is based on a page number parameter, with the total pages count either in the response body or explicitly provided.
*Parameters:* | +| `cursor` | [JSONResponseCursorPaginator](../../../general-usage/http/rest-client.md#jsonresponsecursorpaginator) | The pagination is based on a cursor parameter, with the value of the cursor in the response body (JSON).
*Parameters:* | | `single_page` | SinglePagePaginator | The response will be interpreted as a single-page response, ignoring possible pagination metadata. | | `auto` | `None` | Explicitly specify that the source should automatically detect the pagination method. | @@ -431,7 +432,7 @@ rest_api.config_setup.register_paginator("custom_paginator", CustomPaginator) ### Data selection -The `data_selector` field in the endpoint configuration allows you to specify a JSONPath to select the data from the response. By default, the source will try to detect locations of the data automatically. +The `data_selector` field in the endpoint configuration allows you to specify a JSONPath to select the data from the response. By default, the source will try to detect the locations of the data automatically. Use this field when you need to specify the location of the data in the response explicitly. @@ -481,7 +482,6 @@ You can use the following endpoint configuration: Read more about [JSONPath syntax](https://github.com/h2non/jsonpath-ng?tab=readme-ov-file#jsonpath-syntax) to learn how to write selectors. - ### Authentication For APIs that require authentication to access their endpoints, the REST API source supports various authentication methods, including token-based authentication, query parameters, basic authentication, and custom authentication. The authentication configuration is specified in the `auth` field of the [client](#client) either as a dictionary or as an instance of the [authentication class](../../../general-usage/http/rest-client.md#authentication). @@ -510,7 +510,7 @@ Available authentication types: | Authentication class | String Alias (`type`) | Description | | ------------------- | ----------- | ----------- | -| [BearTokenAuth](../../../general-usage/http/rest-client.md#bearer-token-authentication) | `bearer` | Bearer token authentication. | +| [BearerTokenAuth](../../../general-usage/http/rest-client.md#bearer-token-authentication) | `bearer` | Bearer token authentication. | | [HTTPBasicAuth](../../../general-usage/http/rest-client.md#http-basic-authentication) | `http_basic` | Basic HTTP authentication. | | [APIKeyAuth](../../../general-usage/http/rest-client.md#api-key-authentication) | `api_key` | API key authentication with key defined in the query parameters or in the headers. | | [OAuth2ClientCredentials](../../../general-usage/http/rest-client.md#oauth20-authorization) | N/A | OAuth 2.0 authorization with a temporary access token obtained from the authorization server. | @@ -537,7 +537,7 @@ from dlt.sources.helpers.rest_client.auth import BearerTokenAuth config = { "client": { - "auth": BearTokenAuth(dlt.secrets["your_api_token"]), + "auth": BearerTokenAuth(dlt.secrets["your_api_token"]), }, # ... } @@ -551,7 +551,7 @@ Available authentication types: | `type` | Authentication class | Description | | ----------- | ------------------- | ----------- | -| `bearer` | [BearTokenAuth](../../../general-usage/http/rest-client.md#bearer-token-authentication) | Bearer token authentication.
Parameters: | +| `bearer` | [BearerTokenAuth](../../../general-usage/http/rest-client.md#bearer-token-authentication) | Bearer token authentication.
Parameters: | | `http_basic` | [HTTPBasicAuth](../../../general-usage/http/rest-client.md#http-basic-authentication) | Basic HTTP authentication.
Parameters: | | `api_key` | [APIKeyAuth](../../../general-usage/http/rest-client.md#api-key-authentication) | API key authentication with key defined in the query parameters or in the headers.
Parameters: | @@ -572,10 +572,9 @@ rest_api.config_setup.register_auth("custom_auth", CustomAuth) } ``` - ### Define resource relationships -When you have a resource that depends on another resource, you can define the relationship using the `resolve` configuration. With it you link a path parameter in the child resource to a field in the parent resource's data. +When you have a resource that depends on another resource, you can define the relationship using the `resolve` configuration. With it, you link a path parameter in the child resource to a field in the parent resource's data. In the GitHub example, the `issue_comments` resource depends on the `issues` resource. The `issue_number` parameter in the `issue_comments` endpoint configuration is resolved from the `number` field of the `issues` resource: @@ -653,7 +652,7 @@ You can include data from the parent resource in the child resource by using the } ``` -This will include the `id`, `title`, and `created_at` fields from the `issues` resource in the `issue_comments` resource data. The name of the included fields will be prefixed with the parent resource name and an underscore (`_`) like so: `_issues_id`, `_issues_title`, `_issues_created_at`. +This will include the `id`, `title`, and `created_at` fields from the `issues` resource in the `issue_comments` resource data. The names of the included fields will be prefixed with the parent resource name and an underscore (`_`) like so: `_issues_id`, `_issues_title`, `_issues_created_at`. ### Define a resource which is not a REST endpoint @@ -661,7 +660,7 @@ Sometimes, we want to request endpoints with specific values that are not return Thus, you can also include arbitrary dlt resources in your `RESTAPIConfig` instead of defining a resource for every path! In the following example, we want to load the issues belonging to three repositories. -Instead of defining now three different issues resources, one for each of the paths `dlt-hub/dlt/issues/`, `dlt-hub/verified-sources/issues/`, `dlt-hub/dlthub-education/issues/`, we have a resource `repositories` which yields a list of repository names which will be fetched by the dependent resource `issues`. +Instead of defining three different issues resources, one for each of the paths `dlt-hub/dlt/issues/`, `dlt-hub/verified-sources/issues/`, `dlt-hub/dlthub-education/issues/`, we have a resource `repositories` which yields a list of repository names that will be fetched by the dependent resource `issues`. ```py from dlt.sources.rest_api import RESTAPIConfig @@ -830,7 +829,7 @@ For example, if we query the endpoint with `https://api.example.com/posts?create } ``` -To enable the incremental loading for this endpoint, you can use the following endpoint configuration: +To enable incremental loading for this endpoint, you can use the following endpoint configuration: ```py { @@ -851,7 +850,7 @@ So in our case, the next request will be made to `https://api.example.com/posts? Let's break down the configuration. -1. We explicitly set `data_selector` to `"results"` to select the list of posts from the response. This is optional, if not set, dlt will try to auto-detect the data location. +1. We explicitly set `data_selector` to `"results"` to select the list of posts from the response. This is optional; if not set, dlt will try to auto-detect the data location. 2. We define the `created_since` parameter as an incremental parameter with the following fields: ```py @@ -865,7 +864,7 @@ Let's break down the configuration. ``` - `type`: The type of the parameter definition. In this case, it must be set to `incremental`. -- `cursor_path`: The JSONPath to the field within each item in the list. The value of this field will be used in the next request. In the example above our items look like `{"id": 1, "title": "Post 1", "created_at": "2024-01-26"}` so to track the created time we set `cursor_path` to `"created_at"`. Note that the JSONPath starts from the root of the item (dict) and not from the root of the response. +- `cursor_path`: The JSONPath to the field within each item in the list. The value of this field will be used in the next request. In the example above, our items look like `{"id": 1, "title": "Post 1", "created_at": "2024-01-26"}` so to track the created time, we set `cursor_path` to `"created_at"`. Note that the JSONPath starts from the root of the item (dict) and not from the root of the response. - `initial_value`: The initial value for the cursor. This is the value that will initialize the state of incremental loading. In this case, it's `2024-01-25`. The value type should match the type of the field in the data item. ### Incremental loading using the `incremental` field @@ -906,7 +905,7 @@ The full available configuration for the `incremental` field is: The fields are: - `start_param` (str): The name of the query parameter to be used as the start condition. If we use the example above, it would be `"created_since"`. -- `end_param` (str): The name of the query parameter to be used as the end condition. This is optional and can be omitted if you only need to track the start condition. This is useful when you need to fetch data within a specific range and the API supports end conditions (like `created_before` query parameter). +- `end_param` (str): The name of the query parameter to be used as the end condition. This is optional and can be omitted if you only need to track the start condition. This is useful when you need to fetch data within a specific range and the API supports end conditions (like the `created_before` query parameter). - `cursor_path` (str): The JSONPath to the field within each item in the list. This is the field that will be used to track the incremental loading. In the example above, it's `"created_at"`. - `initial_value` (str): The initial value for the cursor. This is the value that will initialize the state of incremental loading. - `end_value` (str): The end value for the cursor to stop the incremental loading. This is optional and can be omitted if you only need to track the start condition. If you set this field, `initial_value` needs to be set as well. @@ -920,7 +919,7 @@ If you encounter issues with incremental loading, see the [troubleshooting secti If you need to transform the values in the cursor field before passing them to the API endpoint, you can specify a callable under the key `convert`. For example, the API might return UNIX epoch timestamps but expects to be queried with an ISO 8601 date. To achieve that, we can specify a function that converts from the date format returned by the API to the date format required for API requests. -In the following examples, `1704067200` is returned from the API in the field `updated_at` but the API will be called with `?created_since=2024-01-01`. +In the following examples, `1704067200` is returned from the API in the field `updated_at`, but the API will be called with `?created_since=2024-01-01`. Incremental loading using the `params` field: ```py @@ -963,7 +962,7 @@ This also provides details on the HTTP requests. #### Getting validation errors -When you running the pipeline and getting a `DictValidationException`, it means that the [source configuration](#source-configuration) is incorrect. The error message provides details on the issue including the path to the field and the expected type. +When you are running the pipeline and getting a `DictValidationException`, it means that the [source configuration](#source-configuration) is incorrect. The error message provides details on the issue, including the path to the field and the expected type. For example, if you have a source configuration like this: @@ -1015,7 +1014,7 @@ If incorrect data is received from an endpoint, check the `data_selector` field #### Getting insufficient data or incorrect pagination -Check the `paginator` field in the configuration. When not explicitly specified, the source tries to auto-detect the pagination method. If auto-detection fails, or the system is unsure, a warning is logged. For production environments, we recommend to specify an explicit paginator in the configuration. See the [pagination](#pagination) section for more details. Some APIs may have non-standard pagination methods, and you may need to implement a [custom paginator](../../../general-usage/http/rest-client.md#implementing-a-custom-paginator). +Check the `paginator` field in the configuration. When not explicitly specified, the source tries to auto-detect the pagination method. If auto-detection fails, or the system is unsure, a warning is logged. For production environments, we recommend specifying an explicit paginator in the configuration. See the [pagination](#pagination) section for more details. Some APIs may have non-standard pagination methods, and you may need to implement a [custom paginator](../../../general-usage/http/rest-client.md#implementing-a-custom-paginator). #### Incremental loading not working @@ -1023,11 +1022,11 @@ See the [troubleshooting guide](../../../general-usage/incremental-loading.md#tr #### Getting HTTP 404 errors -Some API may return 404 errors for resources that do not exist or have no data. Manage these responses by configuring the `ignore` action in [response actions](./advanced#response-actions). +Some APIs may return 404 errors for resources that do not exist or have no data. Manage these responses by configuring the `ignore` action in [response actions](./advanced#response-actions). ### Authentication issues -If experiencing 401 (Unauthorized) errors, this could indicate: +If you are experiencing 401 (Unauthorized) errors, this could indicate: - Incorrect authorization credentials. Verify credentials in the `secrets.toml`. Refer to [Secret and configs](../../../general-usage/credentials/setup#understanding-the-exceptions) for more information. - An incorrect authentication type. Consult the API documentation for the proper method. See the [authentication](#authentication) section for details. For some APIs, a [custom authentication method](../../../general-usage/http/rest-client.md#custom-authentication) may be required. @@ -1037,3 +1036,4 @@ If experiencing 401 (Unauthorized) errors, this could indicate: The `rest_api` source uses the [RESTClient](../../../general-usage/http/rest-client.md) class for HTTP requests. Refer to the RESTClient [troubleshooting guide](../../../general-usage/http/rest-client.md#troubleshooting) for debugging tips. For further assistance, join our [Slack community](https://dlthub.com/community). We're here to help! + diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/rest_api/index.md b/docs/website/docs/dlt-ecosystem/verified-sources/rest_api/index.md index dd9a77e297..f92d38f87e 100644 --- a/docs/website/docs/dlt-ecosystem/verified-sources/rest_api/index.md +++ b/docs/website/docs/dlt-ecosystem/verified-sources/rest_api/index.md @@ -11,8 +11,9 @@ You can use the REST API source to extract data from any REST API. Using a [decl * how to handle [pagination](./basic.md#pagination), * [authentication](./basic.md#authentication). -dlt will take care of the rest: unnesting the data, inferring the schema etc, and writing to the destination. +dlt will take care of the rest: unnesting the data, inferring the schema, etc., and writing to the destination. import DocCardList from '@theme/DocCardList'; - \ No newline at end of file + + diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/sql_database/advanced.md b/docs/website/docs/dlt-ecosystem/verified-sources/sql_database/advanced.md index 708b195456..74012b4311 100644 --- a/docs/website/docs/dlt-ecosystem/verified-sources/sql_database/advanced.md +++ b/docs/website/docs/dlt-ecosystem/verified-sources/sql_database/advanced.md @@ -6,30 +6,28 @@ keywords: [sql connector, sql database pipeline, sql database] import Header from '../_source-info-header.md'; -# Advanced Usage +# Advanced usage
-## Incremental Loading +## Incremental loading Efficient data management often requires loading only new or updated data from your SQL databases, rather than reprocessing the entire dataset. This is where incremental loading comes into play. Incremental loading uses a cursor column (e.g., timestamp or auto-incrementing ID) to load only data newer than a specified initial value, enhancing efficiency by reducing processing time and resource use. Read [here](../../../walkthroughs/sql-incremental-configuration) for more details on incremental loading with `dlt`. - #### How to configure -1. **Choose a Cursor Column**: Identify a column in your SQL table that can serve as a reliable indicator of new or updated rows. Common choices include timestamp columns or auto-incrementing IDs. -1. **Set an Initial Value**: Choose a starting value for the cursor to begin loading data. This could be a specific timestamp or ID from which you wish to start loading data. +1. **Choose a cursor column**: Identify a column in your SQL table that can serve as a reliable indicator of new or updated rows. Common choices include timestamp columns or auto-incrementing IDs. +1. **Set an initial value**: Choose a starting value for the cursor to begin loading data. This could be a specific timestamp or ID from which you wish to start loading data. 1. **Deduplication**: When using incremental loading, the system automatically handles the deduplication of rows based on the primary key (if available) or row hash for tables without a primary key. -1. **Set end_value for backfill**: Set `end_value` if you want to backfill data from -certain range. -1. **Order returned rows**. Set `row_order` to `asc` or `desc` to order returned rows. +1. **Set end_value for backfill**: Set `end_value` if you want to backfill data from a certain range. +1. **Order returned rows**: Set `row_order` to `asc` or `desc` to order returned rows. #### Examples 1. **Incremental loading with the resource `sql_table`**. - Consider a table "family" with a timestamp column `last_modified` that indicates when a row was last modified. To ensure that only rows modified after midnight (00:00:00) on January 1, 2024, are loaded, you would set `last_modified` timestamp as the cursor as follows: + Consider a table "family" with a timestamp column `last_modified` that indicates when a row was last modified. To ensure that only rows modified after midnight (00:00:00) on January 1, 2024, are loaded, you would set the `last_modified` timestamp as the cursor as follows: ```py import dlt @@ -62,10 +60,10 @@ certain range. from dlt.sources.sql_database import sql_database source = sql_database().with_resources("family") - #using the "last_modified" field as an incremental field using initial value of midnight January 1, 2024 + # Using the "last_modified" field as an incremental field using initial value of midnight January 1, 2024 source.family.apply_hints(incremental=dlt.sources.incremental("updated", initial_value=pendulum.DateTime(2022, 1, 1, 0, 0, 0))) - #running the pipeline + # Running the pipeline pipeline = dlt.pipeline(destination="duckdb") info = pipeline.run(source, write_disposition="merge") print(info) @@ -87,31 +85,31 @@ table = sql_table().parallelize() ``` ## Column reflection -Column reflection is the automatic detection and retrieval of column metadata like column names, constraints, data types etc. Columns and their data types are reflected with SQLAlchemy. The SQL types are then mapped to `dlt` types. +Column reflection is the automatic detection and retrieval of column metadata like column names, constraints, data types, etc. Columns and their data types are reflected with SQLAlchemy. The SQL types are then mapped to `dlt` types. Depending on the selected backend, some of the types might require additional processing. The `reflection_level` argument controls how much information is reflected: - `reflection_level = "minimal"`: Only column names and nullability are detected. Data types are inferred from the data. -- `reflection_level = "full"`: Column names, nullability, and data types are detected. For decimal types we always add precision and scale. **This is the default.** +- `reflection_level = "full"`: Column names, nullability, and data types are detected. For decimal types, we always add precision and scale. **This is the default.** - `reflection_level = "full_with_precision"`: Column names, nullability, data types, and precision/scale are detected, also for types like text and binary. Integer sizes are set to bigint and to int for all other types. -If the SQL type is unknown or not supported by `dlt`, then, in the pyarrow backend, the column will be skipped, whereas in the other backends the type will be inferred directly from the data irrespective of the `reflection_level` specified. In the latter case, this often means that some types are coerced to strings and `dataclass` based values from sqlalchemy are inferred as `json` (JSON in most destinations). +If the SQL type is unknown or not supported by `dlt`, then, in the pyarrow backend, the column will be skipped, whereas in the other backends the type will be inferred directly from the data irrespective of the `reflection_level` specified. In the latter case, this often means that some types are coerced to strings and `dataclass` based values from sqlalchemy are inferred as `json` (JSON in most destinations). :::tip -If you use reflection level **full** / **full_with_precision** you may encounter a situation where the data returned by sqlalchemy or pyarrow backend does not match the reflected data types. Most common symptoms are: -1. The destination complains that it cannot cast one type to another for a certain column. For example `connector-x` returns TIME in nanoseconds +If you use reflection level **full** / **full_with_precision**, you may encounter a situation where the data returned by sqlalchemy or pyarrow backend does not match the reflected data types. The most common symptoms are: +1. The destination complains that it cannot cast one type to another for a certain column. For example, `connector-x` returns TIME in nanoseconds and BigQuery sees it as bigint and fails to load. -2. You get `SchemaCorruptedException` or other coercion error during the `normalize` step. -In that case you may try **minimal** reflection level where all data types are inferred from the returned data. From our experience this prevents +2. You get `SchemaCorruptedException` or another coercion error during the `normalize` step. +In that case, you may try **minimal** reflection level where all data types are inferred from the returned data. From our experience, this prevents most of the coercion problems. ::: -You can also override the sql type by passing a `type_adapter_callback` function. This function takes a `SQLAlchemy` data type as input and returns a new type (or `None` to force the column to be inferred from the data) as output. +You can also override the SQL type by passing a `type_adapter_callback` function. This function takes a `SQLAlchemy` data type as input and returns a new type (or `None` to force the column to be inferred from the data) as output. This is useful, for example, when: -- You're loading a data type which is not supported by the destination (e.g. you need JSON type columns to be coerced to string) -- You're using a sqlalchemy dialect which uses custom types that don't inherit from standard sqlalchemy types. -- For certain types you prefer `dlt` to infer data type from the data and you return `None` +- You're loading a data type that is not supported by the destination (e.g., you need JSON type columns to be coerced to string). +- You're using a sqlalchemy dialect that uses custom types that don't inherit from standard sqlalchemy types. +- For certain types, you prefer `dlt` to infer the data type from the data and you return `None`. In the following example, when loading timestamps from Snowflake, you ensure that they get translated into standard sqlalchemy `timestamp` columns in the resultant schema: @@ -136,10 +134,11 @@ source = sql_database( dlt.pipeline("demo").run(source) ``` -## Configuring with toml/environment variables +## Configuring with TOML/environment variables + You can set most of the arguments of `sql_database()` and `sql_table()` directly in the `.toml` files and/or as environment variables. `dlt` automatically injects these values into the pipeline script. -This is particularly useful with `sql_table()` because you can maintain a separate configuration for each table (below we show **secrets.toml** and **config.toml**, you are free to combine them into one): +This is particularly useful with `sql_table()` because you can maintain a separate configuration for each table (below we show **secrets.toml** and **config.toml**; you are free to combine them into one): The examples below show how you can set arguments in any of the `.toml` files (`secrets.toml` or `config.toml`): 1. Specifying connection string: @@ -147,7 +146,7 @@ The examples below show how you can set arguments in any of the `.toml` files (` [sources.sql_database] credentials="mssql+pyodbc://loader.database.windows.net/dlt_data?trusted_connection=yes&driver=ODBC+Driver+17+for+SQL+Server" ``` -2. Setting parameters like backend, chunk_size, and incremental column for the table `chat_message`: +2. Setting parameters like backend, `chunk_size`, and incremental column for the table `chat_message`: ```toml [sources.sql_database.chat_message] backend="pandas" @@ -156,7 +155,7 @@ The examples below show how you can set arguments in any of the `.toml` files (` [sources.sql_database.chat_message.incremental] cursor_path="updated_at" ``` - This is especially useful with `sql_table()` in a situation where you may want to run this resource for multiple tables. Setting parameters like this would then give you a clean way of maintaing separate configurations for each table. + This is especially useful with `sql_table()` in a situation where you may want to run this resource for multiple tables. Setting parameters like this would then give you a clean way of maintaining separate configurations for each table. 3. Handling separate configurations for database and individual tables When using the `sql_database()` source, you can separately configure the parameters for the database and for the individual tables. @@ -171,13 +170,13 @@ The examples below show how you can set arguments in any of the `.toml` files (` cursor_path="updated_at" ``` - The resulting source created below will extract data using **pandas** backend with **chunk_size** 1000. The table **chat_message** will load data incrementally using **updated_at** column. All the other tables will not use incremental loading, and will instead load the full data. + The resulting source created below will extract data using the **pandas** backend with **chunk_size** 1000. The table **chat_message** will load data incrementally using the **updated_at** column. All the other tables will not use incremental loading and will instead load the full data. ```py database = sql_database() ``` -You'll be able to configure all the arguments this way (except adapter callback function). [Standard dlt rules apply]((/general-usage/credentials/setup). +You'll be able to configure all the arguments this way (except the adapter callback function). [Standard dlt rules apply](../../../general-usage/credentials/setup). It is also possible to set these arguments as environment variables [using the proper naming convention](../../../general-usage/credentials/setup#naming-convention): ```sh @@ -186,3 +185,4 @@ SOURCES__SQL_DATABASE__BACKEND=pandas SOURCES__SQL_DATABASE__CHUNK_SIZE=1000 SOURCES__SQL_DATABASE__CHAT_MESSAGE__INCREMENTAL__CURSOR_PATH=updated_at ``` + diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/sql_database/configuration.md b/docs/website/docs/dlt-ecosystem/verified-sources/sql_database/configuration.md index 6de2a02b31..4236d656eb 100644 --- a/docs/website/docs/dlt-ecosystem/verified-sources/sql_database/configuration.md +++ b/docs/website/docs/dlt-ecosystem/verified-sources/sql_database/configuration.md @@ -10,11 +10,11 @@ import Header from '../_source-info-header.md';
-## Configuring the SQL Database source +## Configuring the SQL database source -`dlt` sources are python scripts made up of source and resource functions that can be easily customized. The SQL Database verified source has the following built-in source and resource: -1. `sql_database`: a `dlt` source which can be used to load multiple tables and views from a SQL database -2. `sql_table`: a `dlt` resource that loads a single table from the SQL database +`dlt` sources are Python scripts made up of source and resource functions that can be easily customized. The SQL Database verified source has the following built-in source and resource: +1. `sql_database`: a `dlt` source that can be used to load multiple tables and views from a SQL database. +2. `sql_table`: a `dlt` resource that loads a single table from the SQL database. Read more about sources and resources here: [General usage: source](../../../general-usage/source.md) and [General usage: resource](../../../general-usage/resource.md). @@ -106,13 +106,13 @@ We intend our sources to be fully hackable. Feel free to change the source code ### Connection string format `sql_database` uses SQLAlchemy to create database connections and reflect table schemas. You can pass credentials using -[database urls](https://docs.sqlalchemy.org/en/20/core/engines.html#database-urls), which has the general format: +[database URLs](https://docs.sqlalchemy.org/en/20/core/engines.html#database-urls), which have the general format: ```py "dialect+database_type://username:password@server:port/database_name" ``` -For example, to connect to a MySQL database using the `pymysql` dialect you can use the following connection string: +For example, to connect to a MySQL database using the `pymysql` dialect, you can use the following connection string: ```py "mysql+pymysql://rfamro:PWD@mysql-rfam-public.ebi.ac.uk:4497/Rfam" ``` @@ -123,17 +123,16 @@ Database-specific drivers can be passed into the connection string using query p "mssql+pyodbc://username:password@server/database?driver=ODBC+Driver+17+for+SQL+Server" ``` - ### Passing connection credentials to the `dlt` pipeline There are several options for adding your connection credentials into your `dlt` pipeline: -#### 1. Setting them in `secrets.toml` or as environment variables (Recommended) - -You can set up credentials using [any method](../../../general-usage/credentials/setup#available-config-providers) supported by `dlt`. We recommend using `.dlt/secrets.toml` or the environment variables. See Step 2 of the [setup](./setup) for how to set credentials inside `secrets.toml`. For more information on passing credentials read [here](../../../general-usage/credentials/setup). +#### 1. Setting them in `secrets.toml` or as environment variables (recommended) +You can set up credentials using [any method](../../../general-usage/credentials/setup#available-config-providers) supported by `dlt`. We recommend using `.dlt/secrets.toml` or the environment variables. See Step 2 of the [setup](./setup) for how to set credentials inside `secrets.toml`. For more information on passing credentials, read [here](../../../general-usage/credentials/setup). #### 2. Passing them directly in the script + It is also possible to explicitly pass credentials inside the source. Example: ```py @@ -152,8 +151,11 @@ It is recommended to configure credentials in `.dlt/secrets.toml` and to not inc ::: ### Other connection options + #### Using SqlAlchemy Engine as credentials + You are able to pass an instance of SqlAlchemy Engine instead of credentials: + ```py from dlt.sources.sql_database import sql_table from sqlalchemy import create_engine @@ -161,24 +163,20 @@ from sqlalchemy import create_engine engine = create_engine("mysql+pymysql://rfamro@mysql-rfam-public.ebi.ac.uk:4497/Rfam") table = sql_table(engine, table="chat_message", schema="data") ``` -This engine is used by `dlt` to open database connections and can work across multiple threads so is compatible with `parallelize` setting of dlt sources and resources. +This engine is used by `dlt` to open database connections and can work across multiple threads, so it is compatible with the `parallelize` setting of dlt sources and resources. ## Configuring the backend -Table backends convert streams of rows from database tables into batches in various formats. The default backend `SQLAlchemy` follows standard `dlt` behavior of -extracting and normalizing Python dictionaries. We recommend this for smaller tables, initial development work, and when minimal dependencies or a pure Python environment is required. This backend is also the slowest. Other backends make use of the structured data format of the tables and provide significant improvement in speeds. For example, the `PyArrow` backend converts rows into `Arrow` tables, which results in -good performance and preserves exact data types. We recommend using this backend for larger tables. +Table backends convert streams of rows from database tables into batches in various formats. The default backend, `SQLAlchemy`, follows standard `dlt` behavior of extracting and normalizing Python dictionaries. We recommend this for smaller tables, initial development work, and when minimal dependencies or a pure Python environment is required. This backend is also the slowest. Other backends make use of the structured data format of the tables and provide significant improvement in speeds. For example, the `PyArrow` backend converts rows into `Arrow` tables, which results in good performance and preserves exact data types. We recommend using this backend for larger tables. ### SQLAlchemy -The `SQLAlchemy` backend (the default) yields table data as a list of Python dictionaries. This data goes through the regular extract -and normalize steps and does not require additional dependencies to be installed. It is the most robust (works with any destination, correctly represents data types) but also the slowest. You can set `reflection_level="full_with_precision"` to pass exact data types to `dlt` schema. +The `SQLAlchemy` backend (the default) yields table data as a list of Python dictionaries. This data goes through the regular extract and normalize steps and does not require additional dependencies to be installed. It is the most robust (works with any destination, correctly represents data types) but also the slowest. You can set `reflection_level="full_with precision"` to pass exact data types to the `dlt` schema. ### PyArrow -The `PyArrow` backend yields data as `Arrow` tables. It uses `SQLAlchemy` to read rows in batches but then immediately converts them into `ndarray`, transposes it, and sets it as columns in an `Arrow` table. This backend always fully -reflects the database table and preserves original types (i.e. **decimal** / **numeric** data will be extracted without loss of precision). If the destination loads parquet files, this backend will skip `dlt` normalizer and you can gain two orders of magnitude (20x - 30x) speed increase. +The `PyArrow` backend yields data as `Arrow` tables. It uses `SQLAlchemy` to read rows in batches but then immediately converts them into `ndarray`, transposes it, and sets it as columns in an `Arrow` table. This backend always fully reflects the database table and preserves original types (i.e., **decimal** / **numeric** data will be extracted without loss of precision). If the destination loads parquet files, this backend will skip the `dlt` normalizer, and you can gain two orders of magnitude (20x - 30x) speed increase. Note that if `pandas` is installed, we'll use it to convert `SQLAlchemy` tuples into `ndarray` as it seems to be 20-30% faster than using `numpy` directly. @@ -207,21 +205,20 @@ info = pipeline.run(sql_alchemy_source) print(info) ``` -### pandas +### Pandas The `pandas` backend yields data as DataFrames using the `pandas.io.sql` module. `dlt` uses `PyArrow` dtypes by default as they generate more stable typing. With the default settings, several data types will be coerced to dtypes in the yielded data frame: -* **decimal** is mapped to double so it is possible to lose precision +* **decimal** is mapped to double, so it is possible to lose precision * **date** and **time** are mapped to strings * all types are nullable :::note -`dlt` will still use the data types reflected from the source database when creating destination tables. How the type differences resulting from the `pandas` backend are reconciled / parsed is up to the destination. Most of the destinations will be able to parse date/time strings and convert doubles into decimals (Please note that you'll still lose precision on decimals with default settings.). **However we strongly suggest -not to use the** `pandas` **backend if your source tables contain date, time, or decimal columns** +`dlt` will still use the data types reflected from the source database when creating destination tables. How the type differences resulting from the `pandas` backend are reconciled/parsed is up to the destination. Most of the destinations will be able to parse date/time strings and convert doubles into decimals (Please note that you'll still lose precision on decimals with default settings.). **However, we strongly suggest not to use the** `pandas` **backend if your source tables contain date, time, or decimal columns.** ::: -Internally dlt uses `pandas.io.sql._wrap_result` to generate `pandas` frames. To adjust [pandas-specific settings,](https://pandas.pydata.org/docs/reference/api/pandas.read_sql_table.html) pass it in the `backend_kwargs` parameter. For example, below we set `coerce_float` to `False`: +Internally, `dlt` uses `pandas.io.sql._wrap_result` to generate `pandas` frames. To adjust [pandas-specific settings,](https://pandas.pydata.org/docs/reference/api/pandas.read_sql_table.html) pass it in the `backend_kwargs` parameter. For example, below we set `coerce_float` to `False`: ```py import dlt @@ -252,22 +249,22 @@ print(info) ``` ### ConnectorX -The [`ConnectorX`](https://sfu-db.github.io/connector-x/intro.html) backend completely skips `SQLALchemy` when reading table rows, in favor of doing that in rust. This is claimed to be significantly faster than any other method (validated only on postgres). With the default settings it will emit `PyArrow` tables, but you can configure this by specifying the `return_type` in `backend_kwargs`. (See the [`ConnectorX` docs](https://sfu-db.github.io/connector-x/api.html) for a full list of configurable parameters.) +The [`ConnectorX`](https://sfu-db.github.io/connector-x/intro.html) backend completely skips `SQLALchemy` when reading table rows, in favor of doing that in Rust. This is claimed to be significantly faster than any other method (validated only on PostgreSQL). With the default settings, it will emit `PyArrow` tables, but you can configure this by specifying the `return_type` in `backend_kwargs`. (See the [`ConnectorX` docs](https://sfu-db.github.io/connector-x/api.html) for a full list of configurable parameters.) There are certain limitations when using this backend: -* it will ignore `chunk_size`. `ConnectorX` cannot yield data in batches. -* in many cases it requires a connection string that differs from the `SQLAlchemy` connection string. Use the `conn` argument in `backend_kwargs` to set this. -* it will convert **decimals** to **doubles**, so you will lose precision. -* nullability of the columns is ignored (always true) -* it uses different mappings for each data type. (Check [here](https://sfu-db.github.io/connector-x/databases.html) for more details.) -* JSON fields (at least those coming from postgres) are double wrapped in strings. To unwrap this, you can pass the in-built transformation function `unwrap_json_connector_x` (for example, with `add_map`): +* It will ignore `chunk_size`. `ConnectorX` cannot yield data in batches. +* In many cases, it requires a connection string that differs from the `SQLAlchemy` connection string. Use the `conn` argument in `backend_kwargs` to set this. +* It will convert **decimals** to **doubles**, so you will lose precision. +* Nullability of the columns is ignored (always true). +* It uses different mappings for each data type. (Check [here](https://sfu-db.github.io/connector-x/databases.html) for more details.) +* JSON fields (at least those coming from PostgreSQL) are double-wrapped in strings. To unwrap this, you can pass the in-built transformation function `unwrap_json_connector_x` (for example, with `add_map`): ```py from dlt.sources.sql_database.helpers import unwrap_json_connector_x ``` :::note -`dlt` will still use the data types refected from the source database when creating destination tables. It is up to the destination to reconcile / parse type differences. Please note that you'll still lose precision on decimals with default settings. +`dlt` will still use the data types reflected from the source database when creating destination tables. It is up to the destination to reconcile/parse type differences. Please note that you'll still lose precision on decimals with default settings. ::: ```py @@ -286,7 +283,7 @@ unsw_table = sql_table( backend="connectorx", # keep source data types reflection_level="full_with_precision", - # just to demonstrate how to setup a separate connection string for connectorx + # just to demonstrate how to set up a separate connection string for connectorx backend_kwargs={"conn": "postgresql://loader:loader@localhost:5432/dlt_data"} ) @@ -305,4 +302,5 @@ info = pipeline.run( ) print(info) ``` -With the dataset above and a local postgres instance, the `ConnectorX` backend is 2x faster than the `PyArrow` backend. +With the dataset above and a local PostgreSQL instance, the `ConnectorX` backend is 2x faster than the `PyArrow` backend. + diff --git a/docs/website/docs/dlt-ecosystem/visualizations/exploring-the-data.md b/docs/website/docs/dlt-ecosystem/visualizations/exploring-the-data.md index d9aae62f94..65c937ef77 100644 --- a/docs/website/docs/dlt-ecosystem/visualizations/exploring-the-data.md +++ b/docs/website/docs/dlt-ecosystem/visualizations/exploring-the-data.md @@ -31,7 +31,7 @@ pipeline and hide many intricacies of correctly setting up the connection to you ### Querying the data using the `dlt` SQL client Execute any SQL query and get results following the Python -[dbapi](https://peps.python.org/pep-0249/) spec. Below we fetch data from the customers table: +[dbapi](https://peps.python.org/pep-0249/) spec. Below, we fetch data from the customers table: ```py pipeline = dlt.pipeline(destination="bigquery", dataset_name="crm") @@ -40,17 +40,17 @@ with pipeline.sql_client() as client: "SELECT id, name, email FROM customers WHERE id = %s", 10 ) as cursor: - # get all data from the cursor as list of rows + # get all data from the cursor as a list of rows print(cursor.fetchall()) ``` -In the above, we used `dbapi` parameters placeholders and fetched the data using `fetchall` method +In the above, we used `dbapi` parameter placeholders and fetched the data using the `fetchall` method that reads all the rows from the cursor. ### Querying data into a data frame -You can fetch results of any SQL query as a data frame. If the destination is supporting that -natively (i.e. BigQuery and DuckDB), `dlt` uses the native method. Thanks to that, reading data +You can fetch the results of any SQL query as a data frame. If the destination supports that +natively (i.e., BigQuery and DuckDB), `dlt` uses the native method. Thanks to that, reading data frames may be really fast! The example below reads GitHub reactions data from the `issues` table and counts reaction types. @@ -65,18 +65,18 @@ with pipeline.sql_client() as client: with client.execute_query( 'SELECT "reactions__+1", "reactions__-1", reactions__laugh, reactions__hooray, reactions__rocket FROM issues' ) as table: - # calling `df` on a cursor, returns the data as a DataFrame + # calling `df` on a cursor returns the data as a DataFrame reactions = table.df() counts = reactions.sum(0).sort_values(0, ascending=False) ``` -The `df` method above returns all the data in the cursor as data frame. You can also fetch data in -chunks by passing `chunk_size` argument to the `df` method. +The `df` method above returns all the data in the cursor as a data frame. You can also fetch data in +chunks by passing the `chunk_size` argument to the `df` method. ### Access destination native connection The native connection to your destination like BigQuery `Client` or DuckDB `DuckDBPyConnection` is -available in case you want to do anything special. Below we take the native connection to `duckdb` +available in case you want to do anything special. Below, we take the native connection to `duckdb` to get `DuckDBPyRelation` from a query: ```py @@ -90,7 +90,7 @@ with pipeline.sql_client() as client: rel.limit(3).show() ``` -## Data Quality Dashboards +## Data quality dashboards After deploying a `dlt` pipeline, you might ask yourself: How can we know if the data is and remains high quality? @@ -108,38 +108,21 @@ any gaps or loading issues. ### Data usage as monitoring -Setting up monitoring is a good idea. However, in practice, often by the time you notice something -is wrong through reviewing charts, someone in the business has likely already noticed something is -wrong. That is, if there is usage of the data, then that usage will act as sort of monitoring. +Setting up monitoring is a good idea. However, in practice, often by the time you notice something is wrong through reviewing charts, someone in the business has likely already noticed something is wrong. That is, if there is usage of the data, then that usage will act as a sort of monitoring. -### Plotting main metrics on the line charts +### Plotting main metrics on line charts -In cases where data is not being used that much (e.g. only one marketing analyst is using some data -alone), then it is a good idea to have them plot their main metrics on "last 7 days" line charts, so -it's visible to them that something may be off when they check their metrics. +In cases where data is not being used much (e.g., only one marketing analyst is using some data alone), then it is a good idea to have them plot their main metrics on "last 7 days" line charts, so it's visible to them that something may be off when they check their metrics. -It's important to think about granularity here. A daily line chart, for example, would not catch -hourly issues well. Typically, you will want to match the granularity of the time dimension -(day/hour/etc.) of the line chart with the things that could go wrong, either in the loading process -or in the tracked process. +It's important to think about granularity here. A daily line chart, for example, would not catch hourly issues well. Typically, you will want to match the granularity of the time dimension (day/hour/etc.) of the line chart with the things that could go wrong, either in the loading process or in the tracked process. -If a dashboard is the main product of an analyst, they will generally watch it closely. Therefore, -it's probably not necessary for a data engineer to include monitoring in their daily activities in -these situations. +If a dashboard is the main product of an analyst, they will generally watch it closely. Therefore, it's probably not necessary for a data engineer to include monitoring in their daily activities in these situations. ## Tools to create dashboards -[Metabase](https://www.metabase.com/), [Looker Studio](https://lookerstudio.google.com/u/0/), and -[Streamlit](https://streamlit.io/) are some common tools that you might use to set up dashboards to -explore data. It's worth noting that while many tools are suitable for exploration, different tools -enable your organization to achieve different things. Some organizations use multiple tools for -different scopes: - -- Tools like [Metabase](https://www.metabase.com/) are intended for data democratization, where the - business user can change the dimension or granularity to answer follow-up questions. -- Tools like [Looker Studio](https://lookerstudio.google.com/u/0/) and - [Tableau](https://www.tableau.com/) are intended for minimal interaction curated dashboards that - business users can filter and read as-is with limited training. -- Tools like [Streamlit](https://streamlit.io/) enable powerful customizations and the building of - complex apps by Python-first developers, but they generally do not support self-service out of the - box. +[Metabase](https://www.metabase.com/), [Looker Studio](https://lookerstudio.google.com/u/0/), and [Streamlit](https://streamlit.io/) are some common tools that you might use to set up dashboards to explore data. It's worth noting that while many tools are suitable for exploration, different tools enable your organization to achieve different things. Some organizations use multiple tools for different scopes: + +- Tools like [Metabase](https://www.metabase.com/) are intended for data democratization, where the business user can change the dimension or granularity to answer follow-up questions. +- Tools like [Looker Studio](https://lookerstudio.google.com/u/0/) and [Tableau](https://www.tableau.com/) are intended for minimal interaction curated dashboards that business users can filter and read as-is with limited training. +- Tools like [Streamlit](https://streamlit.io/) enable powerful customizations and the building of complex apps by Python-first developers, but they generally do not support self-service out of the box. + diff --git a/docs/website/docs/examples/index.md b/docs/website/docs/examples/index.md index 5be3fd1632..b0b16e274d 100644 --- a/docs/website/docs/examples/index.md +++ b/docs/website/docs/examples/index.md @@ -1,14 +1,15 @@ --- title: Code Examples -description: A list of comprehensive code examples that teach you how to solve real world problem. +description: A list of comprehensive code examples that teach you how to solve real world problems. keywords: ['examples'] --- import DocCardList from '@theme/DocCardList'; -A list of comprehensive code examples that teach you how to solve a real world problem. +A list of comprehensive code examples that teach you how to solve real-world problems. :::info If you want to share your example, follow this [contributing](https://github.com/dlt-hub/dlt/tree/devel/docs/examples/CONTRIBUTING.md) tutorial. ::: - \ No newline at end of file + + diff --git a/docs/website/docs/reference/command-line-interface.md b/docs/website/docs/reference/command-line-interface.md index 14fadba74d..e29b43bcba 100644 --- a/docs/website/docs/reference/command-line-interface.md +++ b/docs/website/docs/reference/command-line-interface.md @@ -9,37 +9,37 @@ keywords: [command line interface, cli, dlt init] ```sh dlt init ``` -This command creates new dlt pipeline script that loads data from `source` to `destination` to it. When you run the command: -1. It creates basic project structure if the current folder is empty. Adds `.dlt/config.toml` and `.dlt/secrets.toml` and `.gitignore` files. -2. It checks if `source` argument is matching one of our [verified sources](../dlt-ecosystem/verified-sources/) and if it is so, [it adds it to the project](../walkthroughs/add-a-verified-source.md). -3. If the `source` is unknown it will use a [generic template](https://github.com/dlt-hub/python-dlt-init-template) to [get you started](../walkthroughs/create-a-pipeline.md). +This command creates a new dlt pipeline script that loads data from `source` to `destination`. When you run the command: +1. It creates a basic project structure if the current folder is empty, adding `.dlt/config.toml`, `.dlt/secrets.toml`, and `.gitignore` files. +2. It checks if the `source` argument matches one of our [verified sources](../dlt-ecosystem/verified-sources/) and, if so, [adds it to the project](../walkthroughs/add-a-verified-source.md). +3. If the `source` is unknown, it will use a [generic template](https://github.com/dlt-hub/python-dlt-init-template) to [get you started](../walkthroughs/create-a-pipeline.md). 4. It will rewrite the pipeline scripts to use your `destination`. 5. It will create sample config and credentials in `secrets.toml` and `config.toml` for the specified source and destination. -6. It will create `requirements.txt` with dependencies required by source and destination. If one exists, it will print instructions what to add to it. +6. It will create `requirements.txt` with dependencies required by the source and destination. If one exists, it will print instructions on what to add to it. -This command can be used several times in the same folders to add more sources, destinations and pipelines. It will also update the verified source code to the newest -version if run again with existing `source` name. You are warned if files will be overwritten or if `dlt` version needs upgrade to run particular pipeline. +This command can be used several times in the same folder to add more sources, destinations, and pipelines. It will also update the verified source code to the newest +version if run again with an existing `source` name. You are warned if files will be overwritten or if the `dlt` version needs an upgrade to run a particular pipeline. -### Specify your own "verified sources" repository. -You can use `--location ` option to specify your own repository with sources. Typically you would [fork ours](https://github.com/dlt-hub/verified-sources) and start customizing and adding sources ie. to use them for your team or organization. You can also specify a branch with `--branch ` ie. to test a version being developed. +### Specify your own "verified sources" repository +You can use the `--location ` option to specify your own repository with sources. Typically, you would [fork ours](https://github.com/dlt-hub/verified-sources) and start customizing and adding sources, e.g., to use them for your team or organization. You can also specify a branch with `--branch `, e.g., to test a version being developed. ### List all sources ```sh dlt init --list-sources ``` -Shows all available verified sources and their short descriptions. For each source, checks if your local `dlt` version requires update +Shows all available verified sources and their short descriptions. For each source, it checks if your local `dlt` version requires an update and prints the relevant warning. ## `dlt deploy` -This command prepares your pipeline for deployment and gives you step by step instruction how to accomplish it. To enabled this functionality please first execute +This command prepares your pipeline for deployment and gives you step-by-step instructions on how to accomplish it. To enable this functionality, please first execute ```sh pip install "dlt[cli]" ``` -that will add additional packages to current environment. +that will add additional packages to the current environment. > 💡 We ask you to install those dependencies separately to keep our core library small and make it work everywhere. -### github-action +### `github-action` ```sh dlt deploy