Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docs: Updated filesystem docs with explanations for bucket URLs #1435

Merged
merged 3 commits into from
Jun 10, 2024
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
88 changes: 59 additions & 29 deletions docs/website/docs/dlt-ecosystem/verified-sources/filesystem.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,8 +40,8 @@ To access these, you'll need secret credentials:
To get AWS keys for S3 access:

1. Access IAM in AWS Console.
1. Select "Users", choose a user, and open "Security credentials".
1. Click "Create access key" for AWS ID and Secret Key.
2. Select "Users", choose a user, and open "Security credentials".
3. Click "Create access key" for AWS ID and Secret Key.

For more info, see
[AWS official documentation.](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_access-keys.html)
Expand All @@ -51,12 +51,12 @@ For more info, see
To get GCS/GDrive access:

1. Log in to [console.cloud.google.com](http://console.cloud.google.com/).
1. Create a [service account](https://cloud.google.com/iam/docs/service-accounts-create#creating).
1. Enable "Cloud Storage API" / "Google Drive API"; see
2. Create a [service account](https://cloud.google.com/iam/docs/service-accounts-create#creating).
3. Enable "Cloud Storage API" / "Google Drive API"; see
[Google's guide](https://support.google.com/googleapi/answer/6158841?hl=en).
1. In IAM & Admin > Service Accounts, find your account, click the three-dot menu > "Manage Keys" >
4. In IAM & Admin > Service Accounts, find your account, click the three-dot menu > "Manage Keys" >
"ADD KEY" > "CREATE" to get a JSON credential file.
1. Grant the service account appropriate permissions for cloud storage access.
5. Grant the service account appropriate permissions for cloud storage access.

For more info, see how to
[create service account](https://support.google.com/a/answer/7378726?hl=en).
Expand All @@ -66,9 +66,9 @@ For more info, see how to
To obtain Azure blob storage access:

1. Go to Azure Portal (portal.azure.com).
1. Select "Storage accounts" > your storage.
1. Click "Settings" > "Access keys".
1. View account name and two keys (primary/secondary). Keep keys confidential.
2. Select "Storage accounts" > your storage.
3. Click "Settings" > "Access keys".
4. View account name and two keys (primary/secondary). Keep keys confidential.

For more info, see
[Azure official documentation](https://learn.microsoft.com/en-us/azure/storage/common/storage-account-keys-manage?tabs=azure-portal).
Expand All @@ -88,10 +88,10 @@ To get started with your data pipeline, follow these steps:
with filesystem as the [source](../../general-usage/source) and
[duckdb](../destinations/duckdb.md) as the [destination](../destinations).

1. If you'd like to use a different destination, simply replace `duckdb` with the name of your
2. If you'd like to use a different destination, simply replace `duckdb` with the name of your
preferred [destination](../destinations).

1. After running this command, a new directory will be created with the necessary files and
3. After running this command, a new directory will be created with the necessary files and
configuration settings to get started.

For more information, read the
Expand Down Expand Up @@ -119,9 +119,9 @@ For more information, read the
azure_storage_account_key="Please set me up!"
```

1. Finally, enter credentials for your chosen destination as per the [docs](../destinations/).
2. Finally, enter credentials for your chosen destination as per the [docs](../destinations/).

1. You can pass the bucket URL and glob pattern or use `config.toml`. For local filesystems, use
3. You can pass the bucket URL and glob pattern or use `config.toml`. For local filesystems, use
`file://` or skip the schema and provide the local path in a format native for your operating system.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just make this a bit more clear. I understand that for local filesystem, I have to use file:// or the code shown below would also work. Just make it a bit more clear.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say you can use file:// as follows:

  • code example

or
Define it as follows:

[sources.filesystem] # use [sources.readers.credentials] for the "readers" source
bucket_url='~\Documents\csv_files\'
file_glob="*"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated as per comments


```toml
Expand All @@ -133,18 +133,49 @@ For more information, read the
was used to conveniently use the backslashes without need to escape.

For remote file systems you need to add the schema, it will be used to get the protocol being
used:
used. The protocols that can be used are:

```toml
[sources.filesystem] # use [sources.readers.credentials] for the "readers" source
# bucket_url="az://my-bucket/csv_files/" - for Azure Blob Storage
# bucket_url="gdrive://my-bucket/csv_files/" - for Google Drive folder
# bucket_url="gs://my-bucket/csv_files/" - for Google Storage
bucket_url="s3://my-bucket/csv_files/" # for AWS S3
```
:::caution
For Azure, use adlfs>=2023.9.0. Older versions mishandle globs.
:::
- For Azure blob storage
```toml
[sources.filesystem] # use [sources.readers.credentials] for the "readers" source
bucket_url="az://<container_name>/<path_to_files>/"
```

- `az://` indicates the Azure Blob Storage protocol.
- `container_name` is the name of the container.
- `path_to_files/` is a directory path within the container.

`CAUTION: For Azure, use adlfs>=2023.9.0. Older versions mishandle globs.`

- For Google Drive
```toml
[sources.filesystem] # use [sources.readers.credentials] for the "readers" source
bucket_url="gdrive://<folder_name>/<subfolder_or_file_path>/"
```

- `gdrive://` indicates that the Google Drive protocol.
- `folder_name` refers to a folder within Google Drive.
- `subfolder_or_file_path/` is a sub-folder or directory path within the my-bucket folder.

- For Google Storage
```toml
[sources.filesystem] # use [sources.readers.credentials] for the "readers" source
bucket_url="gs://<bucket_name>/<path_to_files>/"
```

- `gs://` indicates the Google Cloud Storage protocol.
- `bucket_name` is the name of the bucket.
- `path_to_files/` is a directory path within the bucket.

- For AWS S3
```toml
[sources.filesystem] # use [sources.readers.credentials] for the "readers" source
bucket_url="s3://<bucket_name>/<path_to_files>/"
```

- `s3://` indicates the AWS S3 protocol.
- `bucket_name` is the name of the bucket.
- `path_to_files/` is a directory path within the bucket.

### Use local file system paths
You can use both native local file system paths and in form of `file:` uri. Absolute, relative and UNC Windows paths are supported.
Expand Down Expand Up @@ -172,7 +203,7 @@ bucket_url = '\\?\C:\a\b\c'
pip install -r requirements.txt
```

1. Install optional modules:
2. Install optional modules:

- For AWS S3:
```sh
Expand All @@ -184,13 +215,13 @@ bucket_url = '\\?\C:\a\b\c'
```
- GCS storage: No separate module needed.

1. You're now ready to run the pipeline! To get started, run the following command:
3. You're now ready to run the pipeline! To get started, run the following command:

```sh
python filesystem_pipeline.py
```

1. Once the pipeline has finished running, you can verify that everything loaded correctly by using
4. Once the pipeline has finished running, you can verify that everything loaded correctly by using
the following command:

```sh
Expand Down Expand Up @@ -493,5 +524,4 @@ verified source.
fs_client.ls("ci-test-bucket/standard_source/samples")
```

<!--@@@DLT_TUBA filesystem-->

<!--@@@DLT_TUBA filesystem-->
Loading