-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/add failure flag to tasklets #909
Feature/add failure flag to tasklets #909
Conversation
jenkins build this please |
This flag is shared between the threads, if any thread fails, this is set and before the new threads get dispatched or run, this flag is checked. In case it is set, the program aborts.
aae22ff
to
a2b9b9e
Compare
jenkins build this please |
a2b9b9e
to
6c2ecf2
Compare
jenkins build this please |
opm/models/parallel/tasklets.hh
Outdated
std::cerr << "Failure flag of the TaskletRunner is set. Not dispatching new tasklets.\n"; | ||
exit(EXIT_FAILURE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is much better than continuing.
Still, it might result in rather ungraceful behavior if there is an exit(EXIT_FAILURE) statement executed on one process. All processes might print an MPI error message about another process that died. It might be hard to find the real error message.
Also: Should we refactor to use OpmLog while at it?
Do you think it is possible to throw here and somewhere up the stack make sure that all processes stop more graceful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll check that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both are possible:
- "exit(EXIT_FAILURE)" version:
Newton its= 3, linearizations= 4 (0.0sec), linear its= 16 (0.1sec)
ERROR: Uncaught std::exception when running tasklet: Can not open EclFile: ../../opm-data/spe1/SPE1CASE1.UNRST.
Failure flag of the TaskletRunner is set. Exiting thread.
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[28130,1],0]
Exit code: 1
--------------------------------------------------------------------------
- thow-version:
ERROR: Uncaught std::exception when running tasklet: Can not open EclFile: ../../opm-data/spe1/SPE1CASE1.UNRST.
Failure flag of the TaskletRunner is set. Exiting thread.
terminate called after throwing an instance of 'std::runtime_error'
what(): Failure flag of the TaskletRunner is set. Not dispatching new tasklets.
[samwise:101326] *** Process received signal ***
[samwise:101326] Signal: Aborted (6)
[samwise:101326] Signal code: (-6)
[samwise:101326] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x3c050)[0x7fb74025b050]
[samwise:101326] [ 1] /lib/x86_64-linux-gnu/libc.so.6(+0x8ae2c)[0x7fb7402a9e2c]
[samwise:101326] [ 2] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x12)[0x7fb74025afb2]
[samwise:101326] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7fb740245472]
[samwise:101326] [ 4] /lib/x86_64-linux-gnu/libstdc++.so.6(+0x9d919)[0x7fb74009d919]
[samwise:101326] [ 5] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xa8e1a)[0x7fb7400a8e1a]
[samwise:101326] [ 6] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xa8e85)[0x7fb7400a8e85]
[samwise:101326] [ 7] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xa90d8)[0x7fb7400a90d8]
[samwise:101326] [ 8] ./bin/flow(+0x4169a0)[0x55c9aca949a0]
[samwise:101326] [ 9] ./bin/flow(+0x41689e)[0x55c9aca9489e]
[samwise:101326] [10] ./bin/flow(+0x7422da)[0x55c9acdc02da]
[samwise:101326] [11] ./bin/flow(+0x73add9)[0x55c9acdb8dd9]
[samwise:101326] [12] ./bin/flow(+0x730c61)[0x55c9acdaec61]
[samwise:101326] [13] ./bin/flow(+0x722f12)[0x55c9acda0f12]
[samwise:101326] [14] ./bin/flow(+0x710836)[0x55c9acd8e836]
[samwise:101326] [15] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xd44a3)[0x7fb7400d44a3]
[samwise:101326] [16] /lib/x86_64-linux-gnu/libc.so.6(+0x89134)[0x7fb7402a8134]
[samwise:101326] [17] /lib/x86_64-linux-gnu/libc.so.6(+0x1097dc)[0x7fb7403287dc]
[samwise:101326] *** End of error message ***
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node samwise exited on signal 6 (Aborted).
--------------------------------------------------------------------------
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either option is fine for me, let me know what you prefer
Interesting failure for python_basic:
Probably this appeared previously and was not noticed. Question is whether this is a fatal failure or something that should be ignored. Once that is sorted out, I'll merge. |
Yes! Indeed! I'll look into that! |
jenkins build this opm-simulators=5474 please |
That is very strange, the case SPE1CASE1.DATA runs through when I call flow without python, but the test failed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the atomic<bool>
although if we're looking very narrowly at the use case in opm-simulators I don't think actually need it since there's only ever one thread/request active at any one time. We should keep it nonetheless.
We might however consider a more polling-based interface. In particular, I would prefer it if the TaskletRunner
did not call std::exit()
on the simulator's behalf. If we had a predicate along the lines of
bool TaskletRunner::failure() const { return failureFlag_.load(...); }
then the client code in EclGenericWriter::doWriteOutput()
might be rewritten as
this->taskletRunner_->barrier();
if (this->taskletRunner_->failure()) {
throw SomeException { ... };
}
auto tasklet = std::make_shared<EclWriteTasklet>(...);
this->taskletRunner_->dispatch(std::move(tasklet));
Then we'd guard the calling code in EclWriter::writeOutput()
–a function which is run on all ranks–by the OPM_BEGIN_PARALLEL_TRY_CATCH()
and OPM_END_PARALLEL_TRY_CATCH()
macros from DeferredLoggingErrorHelpers.hpp
. I.e., we'd have something along the lines of
OPM_BEGIN_PARALLEL_TRY_CATCH()
if (this->collectOnIORank_.isIORank()) { ... }
OPM_END_PARALLEL_TRY_CATCH("File output failure: ", simulator_.gridView().comm())
Finally, I'll second @blattms' comment that we should try to move away from <iostream>
based logging here, but that could/should be follow-up work.
jenkins build this please |
@bska: Can you have another look here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might personally consider using one of the STL lock guards like std::unique_lock
or std::lock_guard
instead of directly calling the std::mutex::lock()
and std::mutex::unlock()
member functions, but in such simple test code it's probably a wash. On the other hand, I don't quite understand why we're using an owning raw pointer for the TaskletRunner
object in the new unit test. Would you care to elaborate on the reason for this?
Thanks, I've added comments here: cedd34a, is that ok now? Otherwise I can also remove the assertion and outputs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is definitely an improvement, but I guess I still don't quite understand why the runner
object needs to be a raw pointer instead of a smart pointer like std::unique_ptr<>
. Does the following not work?
std::unique_ptr<Opm::TaskletRunner> runner{};
// ...
void execute()
{
const int numWorkers = 2;
runner = std::make_unique<Opm::TaskletRunner>(numWorkers);
BOOST_REQUIRE_LT(runner->workerThreadIndex(), 0);
BOOST_REQUIRE_EQ(runner->numWorkerThreads(), numWorkers);
// ...
}
BOOST_AUTO_TEST_SUITE(Tasklets)
BOOST_AUTO_TEST_CASE(TASKLET_FAILURE)
{
// As before...
}
BOOST_AUTO_TEST_SUITE_END()
…out why they are created on the heap
cedd34a
to
52b6860
Compare
Yes, thanks! I've changed that now in f1b4182 |
jenkins build this please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the updates. This looks good to me now and I'll merge into master.
As suggested by OPM/opm-common#3870 (comment), adding a failureFlag to the TaskletRunner, will solve #871