Skip to content

Commit

Permalink
Updated source/filesystem docs with explanations for bucket URLs
Browse files Browse the repository at this point in the history
  • Loading branch information
dat-a-man committed Jun 3, 2024
1 parent 829b558 commit 9e1f54c
Showing 1 changed file with 59 additions and 29 deletions.
88 changes: 59 additions & 29 deletions docs/website/docs/dlt-ecosystem/verified-sources/filesystem.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,8 +40,8 @@ To access these, you'll need secret credentials:
To get AWS keys for S3 access:

1. Access IAM in AWS Console.
1. Select "Users", choose a user, and open "Security credentials".
1. Click "Create access key" for AWS ID and Secret Key.
2. Select "Users", choose a user, and open "Security credentials".
3. Click "Create access key" for AWS ID and Secret Key.

For more info, see
[AWS official documentation.](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_access-keys.html)
Expand All @@ -51,12 +51,12 @@ For more info, see
To get GCS/GDrive access:

1. Log in to [console.cloud.google.com](http://console.cloud.google.com/).
1. Create a [service account](https://cloud.google.com/iam/docs/service-accounts-create#creating).
1. Enable "Cloud Storage API" / "Google Drive API"; see
2. Create a [service account](https://cloud.google.com/iam/docs/service-accounts-create#creating).
3. Enable "Cloud Storage API" / "Google Drive API"; see
[Google's guide](https://support.google.com/googleapi/answer/6158841?hl=en).
1. In IAM & Admin > Service Accounts, find your account, click the three-dot menu > "Manage Keys" >
4. In IAM & Admin > Service Accounts, find your account, click the three-dot menu > "Manage Keys" >
"ADD KEY" > "CREATE" to get a JSON credential file.
1. Grant the service account appropriate permissions for cloud storage access.
5. Grant the service account appropriate permissions for cloud storage access.

For more info, see how to
[create service account](https://support.google.com/a/answer/7378726?hl=en).
Expand All @@ -66,9 +66,9 @@ For more info, see how to
To obtain Azure blob storage access:

1. Go to Azure Portal (portal.azure.com).
1. Select "Storage accounts" > your storage.
1. Click "Settings" > "Access keys".
1. View account name and two keys (primary/secondary). Keep keys confidential.
2. Select "Storage accounts" > your storage.
3. Click "Settings" > "Access keys".
4. View account name and two keys (primary/secondary). Keep keys confidential.

For more info, see
[Azure official documentation](https://learn.microsoft.com/en-us/azure/storage/common/storage-account-keys-manage?tabs=azure-portal).
Expand All @@ -88,10 +88,10 @@ To get started with your data pipeline, follow these steps:
with filesystem as the [source](../../general-usage/source) and
[duckdb](../destinations/duckdb.md) as the [destination](../destinations).

1. If you'd like to use a different destination, simply replace `duckdb` with the name of your
2. If you'd like to use a different destination, simply replace `duckdb` with the name of your
preferred [destination](../destinations).
1. After running this command, a new directory will be created with the necessary files and
3. After running this command, a new directory will be created with the necessary files and
configuration settings to get started.
For more information, read the
Expand Down Expand Up @@ -119,9 +119,9 @@ For more information, read the
azure_storage_account_key="Please set me up!"
```

1. Finally, enter credentials for your chosen destination as per the [docs](../destinations/).
2. Finally, enter credentials for your chosen destination as per the [docs](../destinations/).

1. You can pass the bucket URL and glob pattern or use `config.toml`. For local filesystems, use
3. You can pass the bucket URL and glob pattern or use `config.toml`. For local filesystems, use
`file://` or skip the schema and provide the local path in a format native for your operating system.

```toml
Expand All @@ -133,18 +133,49 @@ For more information, read the
was used to conveniently use the backslashes without need to escape.
For remote file systems you need to add the schema, it will be used to get the protocol being
used:
used. The protocols that can be used are:
```toml
[sources.filesystem] # use [sources.readers.credentials] for the "readers" source
# bucket_url="az://my-bucket/csv_files/" - for Azure Blob Storage
# bucket_url="gdrive://my-bucket/csv_files/" - for Google Drive folder
# bucket_url="gs://my-bucket/csv_files/" - for Google Storage
bucket_url="s3://my-bucket/csv_files/" # for AWS S3
```
:::caution
For Azure, use adlfs>=2023.9.0. Older versions mishandle globs.
:::
- For Azure blob storage
```toml
[sources.filesystem] # use [sources.readers.credentials] for the "readers" source
bucket_url="az://<container_name>/<path_to_files>/"
```
- `az://` indicates the Azure Blob Storage protocol.
- `container_name` is the name of the container.
- `path_to_files/` is a directory path within the container.
`CAUTION: For Azure, use adlfs>=2023.9.0. Older versions mishandle globs.`
- For Google Drive
```toml
[sources.filesystem] # use [sources.readers.credentials] for the "readers" source
bucket_url="gdrive://<folder_name>/<subfolder_or_file_path>/"
```
- `gdrive://` indicates that the Google Drive protocol.
- `folder_name` refers to a folder within Google Drive.
- `subfolder_or_file_path/` is a sub-folder or directory path within the my-bucket folder.
- For Google Storage
```toml
[sources.filesystem] # use [sources.readers.credentials] for the "readers" source
bucket_url="gs://<bucket_name>/<path_to_files>/"
```
- `gs://` indicates the Google Cloud Storage protocol.
- `bucket_name` is the name of the bucket.
- `path_to_files/` is a directory path within the bucket.
- For AWS S3
```toml
[sources.filesystem] # use [sources.readers.credentials] for the "readers" source
bucket_url="s3://<bucket_name>/<path_to_files>/"
```
- `s3://` indicates the AWS S3 protocol.
- `bucket_name` is the name of the bucket.
- `path_to_files/` is a directory path within the bucket.
### Use local file system paths
You can use both native local file system paths and in form of `file:` uri. Absolute, relative and UNC Windows paths are supported.
Expand Down Expand Up @@ -172,7 +203,7 @@ bucket_url = '\\?\C:\a\b\c'
pip install -r requirements.txt
```

1. Install optional modules:
2. Install optional modules:

- For AWS S3:
```sh
Expand All @@ -184,13 +215,13 @@ bucket_url = '\\?\C:\a\b\c'
```
- GCS storage: No separate module needed.

1. You're now ready to run the pipeline! To get started, run the following command:
3. You're now ready to run the pipeline! To get started, run the following command:
```sh
python filesystem_pipeline.py
```
1. Once the pipeline has finished running, you can verify that everything loaded correctly by using
4. Once the pipeline has finished running, you can verify that everything loaded correctly by using
the following command:
```sh
Expand Down Expand Up @@ -493,5 +524,4 @@ verified source.
fs_client.ls("ci-test-bucket/standard_source/samples")
```
<!--@@@DLT_TUBA filesystem-->
<!--@@@DLT_TUBA filesystem-->

0 comments on commit 9e1f54c

Please sign in to comment.