Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Forbid fenced Container to stop ConcurrentContainer #3537

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

LokeshAlamuri
Copy link
Contributor

[DRAFT]
Fixes #GH-3448

Issue: Fenced Child Container could stop the running ConcurrentContainer
Fix: Configure KafkaMessageListenerContainer (KMLC) to use ConcurrentMessagleListenerContainerRef instead of
ConcurrentContainer. Internally, ConcurrentContainerRef checks if KMLC is fenced when stop operations are called on Concurrent Container. If KMLC is fenced, suppress the stop related operations. If KMLC is not fenced, delegate the stop call to ConcurrentContainer.

[DRAFT]
Fixes #spring-projectsGH-3448
spring-projects#3448

Issue: Fenced Child Container could stop the running ConcurrentContainer
Fix: Configure KafkaMessageListenerContainer (KMLC) to use ConcurrentMessagleListenerContainerRef instead ofConcurrentContainer.
Internally, ConcurrentContainerRef checks if KMLC is fenced when stop operations are called on Concurrent Container. If KMLC is fenced, suppress the `stop` related operations.
If KMLC is not fenced, delegate the stop call to ConcurrentContainer.
Copy link
Member

@artembilan artembilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry. That's too many changes.
What is really the problem?
And why would KMLC call the stop on its parent at all?

@sobychacko
Copy link
Contributor

sobychacko commented Oct 7, 2024

@LokeshAlamuri Thanks for the PR. Have you run into it as a bug in your application? We see that you have a test on the related issue. Is that enough to reproduce the issue? As @artembilan mentioned, this PR has many changes that touch the critical elements of the framework. Since we are about to go into an RC phase in our release cycle, we may not want to introduce any potential regression issues in core components unless they are critical bugs. Case in point - a recent change in this area accidentally slowed down one of our tests in a downstream framework, which we only noticed a few weeks later due to the subtle nature of the issue. Because of that reason, I would vote to revisit this issue in the next minor version after 3.3.x, but we can discuss it. Thanks!

@LokeshAlamuri
Copy link
Contributor Author

LokeshAlamuri commented Oct 12, 2024

I'm sorry. That's too many changes. What is really the problem? And why would KMLC call the stop on its parent at all?

sobychacko artembilan Thanks a lot for allocating your time and reviewing the issue and draft version of PR I have submitted. As mentioned in the issue I have reported, this is possible in the following scenario.

Configuration:

  1. CommonContainerStoppingErrorHandler is chosen to stop the ConcurrentContainer on failure.
  2. Chosen concurreny is 2.

Scenario:

  1. ConcurrentContainer is started.
  2. ConcurrentContainer is stopped.
  3. ConcurrentContainer is started. Assume, ConcurrentContainer is started immediately before all the KMLC are stopped.
  4. Now if one of the KMLC of previous run has some error while processing previous record, it would stop the running ConcurrentContainer. This is not supposed to happen.

This is really a complex issue which would require changes in multiple files and APIs as well. This PR I have submitted could help in understanding the design issue and possible fix. The issue is not only with the stop but other APIs as well. I personally feel, ConcurrentContainer instance is freely passed around ErrorHandlers and Events. This is causing the real issue.

@LokeshAlamuri
Copy link
Contributor Author

Have you run into it as a bug in your application? We see that you have a test on the related issue. Is that enough to reproduce the issue?

Yes, the Junits I have submitted would be 100% sufficient to reproduce the issue.

@LokeshAlamuri
Copy link
Contributor Author

this PR has many changes that touch the critical elements of the framework. Since we are about to go into an RC phase in our release cycle, we may not want to introduce any potential regression issues in core components unless they are critical bugs. Case in point - a recent change in this area accidentally slowed down one of our tests in a downstream framework, which we only noticed a few weeks later due to the subtle nature of the issue. Because of that reason, I would vote to revisit this issue in the next minor version after 3.3.x, but we can discuss it. Thanks!

I shall follow your suggestions and provide the information required.

@artembilan
Copy link
Member

@LokeshAlamuri ,

can you confirm, please, that this piece of code from the KafkaMessageListenerContainer is guilty in the picture:

		/**
		 * Handle exceptions thrown by the consumer outside of message listener
		 * invocation (e.g. commit exceptions).
		 * @param e the exception.
		 */
		protected void handleConsumerException(Exception e) {
			if (e instanceof RetriableCommitFailedException) {
				this.logger.error(e, "Commit retries exhausted");
				return;
			}
			try {
				if (this.commonErrorHandler != null) {
						this.commonErrorHandler.handleOtherException(e, this.consumer,
								KafkaMessageListenerContainer.this.thisOrParentContainer, this.isBatchListener);
				}
				else {
					this.logger.error(e, "Consumer exception");
				}
			}
			catch (Exception ex) {
				this.logger.error(ex, "Consumer exception");
			}
		}

So, whenever the KMLC handles an exception, it calls that handleOtherException() providing its KafkaMessageListenerContainer.this.thisOrParentContainer.
Can it be just replaced into just this whenever the KMLC is fenced?

@LokeshAlamuri
Copy link
Contributor Author

LokeshAlamuri commented Oct 15, 2024

We can do that. But, still the container could be fenced when it is inside error handler but not at the point where planned fenced condition is verified. In that case, suggested fix will not work. This is a bit design issue, where problem is not just with errorhandlers and stop API. It could be a pause call on ConcurrentContainer and inside the EventListeners also.

@artembilan
Copy link
Member

This is a bit design issue, where problem is not just with errorhandlers and stop API

I see, so even if we check for fenced just before calling those API, there is still a race condition when we call API, but fenced is set after that, and target user code would suffer from the problem you are describing.

So, you probably right: something like proxy for the container to push to those APIs where we would check for fenced on each container hooks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants