Skip to content

Commit

Permalink
Updates documentation for the Avro codec and S3 sink. Resolves #3162. (
Browse files Browse the repository at this point in the history
…#3211)

Signed-off-by: David Venable <[email protected]>
(cherry picked from commit 3d67468)
  • Loading branch information
dlvenable authored and github-actions[bot] committed Aug 22, 2023
1 parent 47ecf22 commit 9da31df
Show file tree
Hide file tree
Showing 2 changed files with 14 additions and 151 deletions.
87 changes: 9 additions & 78 deletions data-prepper-plugins/avro-codecs/README.md
Original file line number Diff line number Diff line change
@@ -1,89 +1,20 @@
# Avro Sink/Output Codec
# Avro codecs

This is an implementation of Avro Sink Codec that parses the Data Prepper Events into Avro records and writes them into the underlying OutputStream.
This project provides [Apache Avro](https://avro.apache.org/) support for Data Prepper. It includes an input codec, and output codec, and common libraries which can be used by other projects using Avro.

## Usages
## Usage

Avro Output Codec can be configured with sink plugins (e.g. S3 Sink) in the Pipeline file.
For usage information, see the Data Prepper documentation:

## Configuration Options
* [S3 source](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/sources/s3/)
* [S3 sink](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/sinks/s3/)

```
pipeline:
...
sink:
- s3:
aws:
region: us-east-1
sts_role_arn: arn:aws:iam::123456789012:role/Data-Prepper
sts_header_overrides:
max_retries: 5
bucket: bucket_name
object_key:
path_prefix: vpc-flow-logs/%{yyyy}/%{MM}/%{dd}/
threshold:
event_count: 2000
maximum_size: 50mb
event_collect_timeout: 15s
codec:
avro:
schema: >
{
"type" : "record",
"namespace" : "org.opensearch.dataprepper.examples",
"name" : "VpcFlowLog",
"fields" : [
{ "name" : "version", "type" : ["null", "string"]},
{ "name" : "srcport", "type": ["null", "int"]},
{ "name" : "dstport", "type": ["null", "int"]},
{ "name" : "accountId", "type" : ["null", "string"]},
{ "name" : "interfaceId", "type" : ["null", "string"]},
{ "name" : "srcaddr", "type" : ["null", "string"]},
{ "name" : "dstaddr", "type" : ["null", "string"]},
{ "name" : "start", "type": ["null", "int"]},
{ "name" : "end", "type": ["null", "int"]},
{ "name" : "protocol", "type": ["null", "int"]},
{ "name" : "packets", "type": ["null", "int"]},
{ "name" : "bytes", "type": ["null", "int"]},
{ "name" : "action", "type": ["null", "string"]},
{ "name" : "logStatus", "type" : ["null", "string"]}
]
}
exclude_keys:
- s3
buffer_type: in_memory
```

## AWS Configuration

### Codec Configuration:

1) `schema`: A json string that user can provide in the yaml file itself. The codec parses schema object from this schema string.
2) `exclude_keys`: Those keys of the events that the user wants to exclude while converting them to avro records.

### Note:

1) User can provide only one schema at a time i.e. through either of the ways provided in codec config.
2) If the user wants the tags to be a part of the resultant Avro Data and has given `tagsTargetKey` in the config file, the user also has to modify the schema to accommodate the tags. Another field has to be provided in the `schema.json` file:

`{
"name": "yourTagsTargetKey",
"type": { "type": "array",
"items": "string"
}`
3) If the user doesn't provide any schema, the codec will auto-generate schema from the first event in the buffer.

## Developer Guide

This plugin is compatible with Java 11. See below

- [CONTRIBUTING](https://github.com/opensearch-project/data-prepper/blob/main/CONTRIBUTING.md)
- [monitoring](https://github.com/opensearch-project/data-prepper/blob/main/docs/monitoring.md)
See the [CONTRIBUTING](https://github.com/opensearch-project/data-prepper/blob/main/CONTRIBUTING.md) guide for general information on contributions.

The integration tests for this plugin do not run as part of the Data Prepper build.
They are included only with the S3 source or S3 sink for now.

The following command runs the integration tests:

```
./gradlew :data-prepper-plugins:s3-sink:integrationTest -Dtests.s3sink.region=<your-aws-region> -Dtests.s3sink.bucket=<your-bucket>
```
See the README files for those projects for information on running those tests.
78 changes: 5 additions & 73 deletions data-prepper-plugins/s3-sink/README.md
Original file line number Diff line number Diff line change
@@ -1,83 +1,15 @@
# S3 Sink
# S3 sink

This is the Data Prepper S3 sink plugin that sends records to an S3 bucket via S3Client.
The `s3` sink saves batches of events to Amazon Simple Storage Service (Amazon S3) objects.

## Usages
## Usage

The s3 sink should be configured as part of Data Prepper pipeline yaml file.

## Configuration Options

```
pipeline:
...
sink:
- s3:
aws:
region: us-east-1
sts_role_arn: arn:aws:iam::123456789012:role/Data-Prepper
sts_header_overrides:
max_retries: 5
bucket: bucket_name
object_key:
path_prefix: my-elb/%{yyyy}/%{MM}/%{dd}/
threshold:
event_count: 2000
maximum_size: 50mb
event_collect_timeout: 15s
codec:
ndjson:
buffer_type: in_memory
```

## AWS Configuration

- `region` (Optional) : The AWS region to use for credentials. Defaults to [standard SDK behavior to determine the region](https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/region-selection.html).

- `sts_role_arn` (Optional) : The AWS STS role to assume for requests to S3. which will use the [standard SDK behavior for credentials](https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/credentials.html).

- `sts_external_id` (Optional) : The external ID to attach to AssumeRole requests.

- `max_retries` (Optional) : An integer value indicates the maximum number of times that single request should be retired in-order to ingest data to amazon s3. Defaults to `5`.

- `bucket` (Required) : The name of the S3 bucket to write to.

- `object_key` (Optional) : It contains `path_prefix` and `file_pattern`. Defaults to s3 object `events-%{yyyy-MM-dd'T'hh-mm-ss}` inside bucket root directory.

- `path_prefix` (Optional) : path_prefix nothing but directory structure inside bucket in-order to store objects. Defaults to `none`.

## Threshold Configuration

- `event_count` (Required) : An integer value indicates the maximum number of events required to ingest into s3-bucket as part of threshold.

- `maximum_size` (Optional) : A String representing the count or size of bytes required to ingest into s3-bucket as part of threshold. Defaults to `50mb`.

- `event_collect_timeout` (Required) : A String representing how long events should be collected before ingest into s3-bucket as part of threshold. All Duration values are a string that represents a duration. They support ISO_8601 notation string ("PT20.345S", "PT15M", etc.) as well as simple notation Strings for seconds ("60s") and milliseconds ("1500ms").

## Buffer Type Configuration

- `buffer_type` (Optional) : Records stored temporary before flushing into s3 bucket. Possible values are `local_file` and `in_memory`. Defaults to `in_memory`.

## Metrics

### Counters

* `s3SinkObjectsSucceeded` - The number of S3 objects that the S3 sink has successfully written to S3.
* `s3SinkObjectsFailed` - The number of S3 objects that the S3 sink failed to write to S3.
* `s3SinkObjectsEventsSucceeded` - The number of records that the S3 sink has successfully written to S3.
* `s3SinkObjectsEventsFailed` - The number of records that the S3 sink has failed to write to S3.

### Distribution Summaries

* `s3SinkObjectSizeBytes` - Measures the distribution of the S3 request's payload size in bytes.
For information on usage, see the [s3 sink documentation](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/sinks/s3/).


## Developer Guide

This plugin is compatible with Java 11. See below

- [CONTRIBUTING](https://github.com/opensearch-project/data-prepper/blob/main/CONTRIBUTING.md)
- [monitoring](https://github.com/opensearch-project/data-prepper/blob/main/docs/monitoring.md)
See the [CONTRIBUTING](https://github.com/opensearch-project/data-prepper/blob/main/CONTRIBUTING.md) guide for general information on contributions.

The integration tests for this plugin do not run as part of the Data Prepper build.

Expand Down

0 comments on commit 9da31df

Please sign in to comment.