You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
Currently our ETL job runs every 30 minutes and inserts a file into S3, triggering OpenSearch ingestion pipeline. Due to varying ETL completion time, it's challenging to determine suitable refresh_interval at the index level that works consistently for all scenarios.
As a result of this behavior - there is a delay in the data being available even though the ingestion to OpenSearch is complete.
Describe the solution you'd like
We propose to add a new configuration option for http post-processor hooks in the Data Prepper pipeline definition, which will allow us to specify the http POST endpoint and make refresh API call( /index-name/_refresh), post pipeline ingestion is completed.
Currently the processor available in the pipeline definition only works before ingesting data to OpenSearch.
Describe alternatives you've considered (Optional)
Provide refresh option at pipeline index settings which will internally refresh the index after the execution of pipeline.
Additional context
N/A
The text was updated successfully, but these errors were encountered:
@SavvasSriAnushaVeeramachineni , Thank you for opening this issue. I understand that you'd like Data Prepper to automatically call the _refresh API for every updated index.
Can you clarify what will try making that call? Are you using S3-scan? Do you want the completion of the scan to trigger the refresh?
As a result of this behavior - there is a delay in the data being available even though the ingestion to OpenSearch is complete.
What is your delay?
Also, have you tried using the default refresh_interval to let OpenSearch handle it?
@dlvenable Thanks for Replying!
Regarding : Can you clarify what will try making that call? Are you using S3-scan? Do you want the completion of the scan to trigger the refresh?
We are using S3-SQS processing.
I want the _refresh API to be called after all the records in csv are ingested at sink(OpenSearch).
We are hoping the data is available to search results, immediately after the data is ingested to OpenSearch from pipeline.
What is your delay?
delay is around 30 min
Also, have you tried using the default refresh_interval to let OpenSearch handle it?
Yes I have tried to set refresh_interval at 30 minutes, but the Ingestion complete time always doesn't fall within the 30 minute window. If the current refresh cycle has completed and the pipeline inserted data just after 1 minute, still we have to wait for an other 30 min for the data to be available in search results.
We don't want to keep a lower refresh_interval either, as it would increase the load and computational cost on index and also the scheduler which inserts data to S3 runs 30 minutes.
Is your feature request related to a problem? Please describe.
Currently our ETL job runs every 30 minutes and inserts a file into S3, triggering OpenSearch ingestion pipeline. Due to varying ETL completion time, it's challenging to determine suitable
refresh_interval
at the index level that works consistently for all scenarios.As a result of this behavior - there is a delay in the data being available even though the ingestion to OpenSearch is complete.
Describe the solution you'd like
We propose to add a new configuration option for http post-processor hooks in the Data Prepper pipeline definition, which will allow us to specify the http POST endpoint and make refresh API call( /index-name/_refresh), post pipeline ingestion is completed.
Currently the processor available in the pipeline definition only works before ingesting data to OpenSearch.
Describe alternatives you've considered (Optional)
Provide refresh option at pipeline index settings which will internally refresh the index after the execution of pipeline.
Additional context
N/A
The text was updated successfully, but these errors were encountered: