Skip to content

Commit

Permalink
Updated big5 to support the full 1 TiB corpus.
Browse files Browse the repository at this point in the history
Signed-off-by: Govind Kamat <[email protected]>
  • Loading branch information
gkamat committed Oct 15, 2024
1 parent 4458cd3 commit bae2578
Show file tree
Hide file tree
Showing 2 changed files with 11 additions and 12 deletions.
14 changes: 6 additions & 8 deletions big5/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
## Big5 OSB Workload

This repository contains the "Big5" workload for benchmarking OpenSearch using OpenSearch Benchmark. The "Big5" workload focuses on five essential areas in OpenSearch performance and querying: Text Querying, Sorting, Date Histogram, Range Queries, and Terms Aggregation.
This repository contains the **_Big5_** workload for benchmarking OpenSearch using OpenSearch Benchmark. This workload focuses on five essential areas in OpenSearch performance and querying: Text Querying, Sorting, Date Histogram, Range Queries, and Terms Aggregation.

This workload is derived from the Elasticsearch vs. OpenSearch comparison benchmark. It has been modified to conform to OpenSearch Benchmark terminology and comply with OpenSearch features.

Expand Down Expand Up @@ -45,7 +45,7 @@ This workload allows the following parameters to be specified using `--workload-
* `bulk_indexing_clients` (default: 8): Number of clients that issue bulk indexing requests.
* `bulk_size` (default: 5000): The number of documents in each bulk during indexing.
* `cluster_health` (default: "green"): The minimum required cluster health.
* `corpus_size` (default: "100"): The size of the data corpus to use in GiB. The currently provided sizes are 100, 1000 and 60. Note that there are [certain considerations when using the 1000 GiB (1 TiB) data corpus](#considerations-when-using-the-1-tb-data-corpus).
* `corpus_size` (default: "100"): The size of the data corpus to use in GiB. The currently provided sizes are 100, 880 and 1000. Note that there are [certain considerations when using the 1000 GiB (~1 TiB) data corpus](#considerations-when-using-larger-data-corpora).
* `document_compressed_size_in_bytes`: If specifying an alternate data corpus, the compressed size of the corpus.
* `document_count`: If specifying an alternate data corpus, the number of documents in that corpus.
* `document_file`: If specifying an alternate data corpus, the file name of the corpus.
Expand Down Expand Up @@ -180,17 +180,16 @@ Running range-auto-date-histo-with-metrics [
------------------------------------------------------
```

### Considerations when Using the 1 TB Data Corpus
### Considerations when Using Larger Data Corpora

*Caveat*: This corpus is being made available as a feature that is currently in beta test. Some points to note when carrying out performance runs using this corpus:
There are several points to note when carrying out performance runs using large data corpora:

* Due to CloudFront download size limits, the uncompressed size of the 1 TB corpus is actually 0.95 TB (~0.9 TiB). This [issue has been noted](https://github.com/opensearch-project/opensearch-benchmark/issues/543) and will be resolved in due course.
* Use an external data store to record metrics. Using the in-memory store will likely result in the system running out of memory and becoming unresponsive, resulting in inaccurate performance numbers.
* Use a load generation host with sufficient disk space to hold the corpus.
* Use a load generation host with at least 8 cores and 32 GB memory or more. It should have sufficient disk space to hold the corpus.
* Ensure the target cluster has adequate storage and at least 3 data nodes.
* Specify an appropriate shard count and number of replicas so that shards are evenly distributed and appropriately sized.
* Running the workload requires an instance type with at least 8 cores and 32 GB memory.
* Install the `pbzip2` decompressor to speed up decompression of the corpus.
* If you are using an older version of OSB, install the `pbzip2` decompressor to speed up decompression of the corpus.
* Set the client timeout to a sufficiently large value, since some queries take a long time to complete.
* Allow sufficient time for the workload to run. _Approximate_ times for the various steps involved, using an 8-core loadgen host:
- 15 minutes to download the corpus
Expand All @@ -199,7 +198,6 @@ Running range-auto-date-histo-with-metrics [
- 30 minutes for the force-merge
- 8 hours to run the set of included queries

More details will be added in due course.

### License

Expand Down
9 changes: 5 additions & 4 deletions big5/workload.json
Original file line number Diff line number Diff line change
Expand Up @@ -28,16 +28,17 @@
"compressed-bytes": 6023614688,
"uncompressed-bytes": 107321418111
}
{% elif corpus_size == 1000 %}
{% elif corpus_size == 880 %}
{
"source-file": "documents-1000.json.bz2",
"source-file": "documents-880.json.bz2",
"document-count": 1020000000,
"compressed-bytes": 53220934846,
"uncompressed-bytes": 943679382267
}
{% elif corpus_size == "1000-full" %}
{% elif corpus_size == 1000 %}
{
"source-file": "documents-1000-full.json.bz2",
"source-file": "documents-1000.json.bz2",
"source-file-parts": [ { "name": "documents-1000-part0", "size": 20189061054 }, { "name": "documents-1000-part1", "size": 20189061054 }, { "name": "documents-1000-part2", "size": 20189061055 } ],
"document-count": 1160800000,
"compressed-bytes": 60567183163,
"uncompressed-bytes": 1073936121222
Expand Down

0 comments on commit bae2578

Please sign in to comment.