Feature/add failure flag to tasklets #909

lisajulia · 2024-07-08T15:55:39Z

As suggested by OPM/opm-common#3870 (comment), adding a failureFlag to the TaskletRunner, will solve #871

lisajulia · 2024-07-08T15:55:52Z

jenkins build this please

This flag is shared between the threads, if any thread fails, this is set and before the new threads get dispatched or run, this flag is checked. In case it is set, the program aborts.

lisajulia · 2024-07-09T04:58:51Z

jenkins build this please

lisajulia · 2024-07-09T07:44:58Z

jenkins build this please

blattms · 2024-07-09T07:47:16Z

opm/models/parallel/tasklets.hh

+            std::cerr << "Failure flag of the TaskletRunner is set. Not dispatching new tasklets.\n";
+            exit(EXIT_FAILURE);


This is much better than continuing.
Still, it might result in rather ungraceful behavior if there is an exit(EXIT_FAILURE) statement executed on one process. All processes might print an MPI error message about another process that died. It might be hard to find the real error message.

Also: Should we refactor to use OpmLog while at it?

Do you think it is possible to throw here and somewhere up the stack make sure that all processes stop more graceful?

I'll check that

Both are possible:

"exit(EXIT_FAILURE)" version:

Newton its= 3, linearizations= 4 (0.0sec), linear its= 16 (0.1sec) ERROR: Uncaught std::exception when running tasklet: Can not open EclFile: ../../opm-data/spe1/SPE1CASE1.UNRST. Failure flag of the TaskletRunner is set. Exiting thread. -------------------------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted. -------------------------------------------------------------------------- -------------------------------------------------------------------------- mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was: Process name: [[28130,1],0] Exit code: 1 --------------------------------------------------------------------------

thow-version:

ERROR: Uncaught std::exception when running tasklet: Can not open EclFile: ../../opm-data/spe1/SPE1CASE1.UNRST. Failure flag of the TaskletRunner is set. Exiting thread. terminate called after throwing an instance of 'std::runtime_error' what(): Failure flag of the TaskletRunner is set. Not dispatching new tasklets. [samwise:101326] *** Process received signal *** [samwise:101326] Signal: Aborted (6) [samwise:101326] Signal code: (-6) [samwise:101326] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x3c050)[0x7fb74025b050] [samwise:101326] [ 1] /lib/x86_64-linux-gnu/libc.so.6(+0x8ae2c)[0x7fb7402a9e2c] [samwise:101326] [ 2] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x12)[0x7fb74025afb2] [samwise:101326] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7fb740245472] [samwise:101326] [ 4] /lib/x86_64-linux-gnu/libstdc++.so.6(+0x9d919)[0x7fb74009d919] [samwise:101326] [ 5] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xa8e1a)[0x7fb7400a8e1a] [samwise:101326] [ 6] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xa8e85)[0x7fb7400a8e85] [samwise:101326] [ 7] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xa90d8)[0x7fb7400a90d8] [samwise:101326] [ 8] ./bin/flow(+0x4169a0)[0x55c9aca949a0] [samwise:101326] [ 9] ./bin/flow(+0x41689e)[0x55c9aca9489e] [samwise:101326] [10] ./bin/flow(+0x7422da)[0x55c9acdc02da] [samwise:101326] [11] ./bin/flow(+0x73add9)[0x55c9acdb8dd9] [samwise:101326] [12] ./bin/flow(+0x730c61)[0x55c9acdaec61] [samwise:101326] [13] ./bin/flow(+0x722f12)[0x55c9acda0f12] [samwise:101326] [14] ./bin/flow(+0x710836)[0x55c9acd8e836] [samwise:101326] [15] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xd44a3)[0x7fb7400d44a3] [samwise:101326] [16] /lib/x86_64-linux-gnu/libc.so.6(+0x89134)[0x7fb7402a8134] [samwise:101326] [17] /lib/x86_64-linux-gnu/libc.so.6(+0x1097dc)[0x7fb7403287dc] [samwise:101326] *** End of error message *** -------------------------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted. -------------------------------------------------------------------------- -------------------------------------------------------------------------- mpirun noticed that process rank 0 with PID 0 on node samwise exited on signal 6 (Aborted). --------------------------------------------------------------------------

Either option is fine for me, let me know what you prefer

blattms · 2024-07-09T09:38:01Z

Interesting failure for python_basic:

Error: [.../opm-common/opm/io/eclipse/EclFile.cpp:245] Could not open file: './SPE1CASE1.UNRST'
ERROR: Uncaught std::exception when running tasklet: [.../opm-common/opm/io/eclipse/EclFile.cpp:245] Could not open file: './SPE1CASE1.UNRST'.

Probably this appeared previously and was not noticed. Question is whether this is a fatal failure or something that should be ignored.

Once that is sorted out, I'll merge.

lisajulia · 2024-07-09T09:38:32Z

Interesting failure for python_basic:
Error: [.../opm-common/opm/io/eclipse/EclFile.cpp:245] Could not open file: './SPE1CASE1.UNRST'
ERROR: Uncaught std::exception when running tasklet: [.../opm-common/opm/io/eclipse/EclFile.cpp:245] Could not open file: './SPE1CASE1.UNRST'.
Probably this appeared previously and was not noticed. Question is whether this is a fatal failure or something that should be ignored.

Once that is sorted out, I'll merge.

Yes! Indeed! I'll look into that!

lisajulia · 2024-07-09T15:28:29Z

jenkins build this opm-simulators=5474 please

lisajulia · 2024-07-09T15:33:20Z

Interesting failure for python_basic:
Error: [.../opm-common/opm/io/eclipse/EclFile.cpp:245] Could not open file: './SPE1CASE1.UNRST'
ERROR: Uncaught std::exception when running tasklet: [.../opm-common/opm/io/eclipse/EclFile.cpp:245] Could not open file: './SPE1CASE1.UNRST'.
Probably this appeared previously and was not noticed. Question is whether this is a fatal failure or something that should be ignored.

Once that is sorted out, I'll merge.

That is very strange, the case SPE1CASE1.DATA runs through when I call flow without python, but the test failed.
Anyhow: removing the keyword UNIFOUT solved it, now the python test runs through as well.

lisajulia · 2024-07-11T05:59:20Z

@bska and @blattms: Can you have a look a this? Thanks! :)

bska

I like the atomic<bool> although if we're looking very narrowly at the use case in opm-simulators I don't think actually need it since there's only ever one thread/request active at any one time. We should keep it nonetheless.

We might however consider a more polling-based interface. In particular, I would prefer it if the TaskletRunner did not call std::exit() on the simulator's behalf. If we had a predicate along the lines of

bool TaskletRunner::failure() const { return failureFlag_.load(...); }

then the client code in EclGenericWriter::doWriteOutput() might be rewritten as

this->taskletRunner_->barrier();
if (this->taskletRunner_->failure()) {
    throw SomeException { ... };
}

auto tasklet = std::make_shared<EclWriteTasklet>(...);
this->taskletRunner_->dispatch(std::move(tasklet));

Then we'd guard the calling code in EclWriter::writeOutput()–a function which is run on all ranks–by the OPM_BEGIN_PARALLEL_TRY_CATCH() and OPM_END_PARALLEL_TRY_CATCH() macros from DeferredLoggingErrorHelpers.hpp. I.e., we'd have something along the lines of

OPM_BEGIN_PARALLEL_TRY_CATCH()
if (this->collectOnIORank_.isIORank()) { ... }
OPM_END_PARALLEL_TRY_CATCH("File output failure: ", simulator_.gridView().comm())

Finally, I'll second @blattms' comment that we should try to move away from <iostream> based logging here, but that could/should be follow-up work.

…hanism

lisajulia · 2024-07-12T13:15:15Z

jenkins build this please

lisajulia · 2024-07-15T08:45:26Z

@bska: Can you have another look here?

bska

I might personally consider using one of the STL lock guards like std::unique_lock or std::lock_guard instead of directly calling the std::mutex::lock() and std::mutex::unlock() member functions, but in such simple test code it's probably a wash. On the other hand, I don't quite understand why we're using an owning raw pointer for the TaskletRunner object in the new unit test. Would you care to elaborate on the reason for this?

tests/test_tasklets_failure.cc

lisajulia · 2024-07-15T12:25:51Z

I might personally consider using one of the STL lock guards like std::unique_lock or std::lock_guard instead of directly calling the std::mutex::lock() and std::mutex::unlock() member functions, but in such simple test code it's probably a wash. On the other hand, I don't quite understand why we're using an owning raw pointer for the TaskletRunner object in the new unit test. Would you care to elaborate on the reason for this?

Thanks, I've added comments here: cedd34a, is that ok now? Otherwise I can also remove the assertion and outputs.

bska

This is definitely an improvement, but I guess I still don't quite understand why the runner object needs to be a raw pointer instead of a smart pointer like std::unique_ptr<>. Does the following not work?

std::unique_ptr<Opm::TaskletRunner> runner{};
// ...
void execute()
{
    const int numWorkers = 2;
    runner = std::make_unique<Opm::TaskletRunner>(numWorkers);
    BOOST_REQUIRE_LT(runner->workerThreadIndex(), 0);
    BOOST_REQUIRE_EQ(runner->numWorkerThreads(), numWorkers);
    // ...
}

BOOST_AUTO_TEST_SUITE(Tasklets)
BOOST_AUTO_TEST_CASE(TASKLET_FAILURE)
{
    // As before...
}
BOOST_AUTO_TEST_SUITE_END()

…out why they are created on the heap

lisajulia · 2024-07-15T12:53:27Z

    runner = std::make_unique<Opm::TaskletRunner>(numWorkers);

Yes, thanks! I've changed that now in f1b4182

lisajulia · 2024-07-15T12:57:05Z

jenkins build this please

bska

Thanks a lot for the updates. This looks good to me now and I'll merge into master.

Add an atomic flag to TaskletRunner

920d216

This flag is shared between the threads, if any thread fails, this is set and before the new threads get dispatched or run, this flag is checked. In case it is set, the program aborts.

lisajulia force-pushed the feature/add-failureFlag-to-tasklets branch from aae22ff to a2b9b9e Compare July 8, 2024 18:58

Add test for failing tasklets

6c2ecf2

lisajulia force-pushed the feature/add-failureFlag-to-tasklets branch from a2b9b9e to 6c2ecf2 Compare July 9, 2024 07:44

blattms reviewed Jul 9, 2024

View reviewed changes

bska reviewed Jul 11, 2024

View reviewed changes

Implement a more polling-based interface for the tasklets failure mec…

1d2df29

…hanism

lisajulia requested review from blattms and bska July 12, 2024 13:16

bska reviewed Jul 15, 2024

View reviewed changes

tests/test_tasklets_failure.cc Outdated Show resolved Hide resolved

Change lock mechanism to std::lock_guard

bdeddda

bska reviewed Jul 15, 2024

View reviewed changes

lisajulia added 2 commits July 15, 2024 14:51

Change pointer to tasklet runners to smart pointer and add comment ab…

f1b4182

…out why they are created on the heap

Fix indentation

52b6860

lisajulia force-pushed the feature/add-failureFlag-to-tasklets branch from cedd34a to 52b6860 Compare July 15, 2024 12:52

Change asserts to BOOST_REQUIRE_* in test_tasklets_failure

d13fc3f

bska approved these changes Jul 15, 2024

View reviewed changes

bska merged commit 1d84a06 into OPM:master Jul 15, 2024
1 check passed

lisajulia deleted the feature/add-failureFlag-to-tasklets branch July 15, 2024 15:07

lisajulia mentioned this pull request Jul 16, 2024

Keyword UNIFOUT causes file problems when running flow from Python OPM/opm-simulators#5480

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/add failure flag to tasklets #909

Feature/add failure flag to tasklets #909

lisajulia commented Jul 8, 2024

lisajulia commented Jul 8, 2024

lisajulia commented Jul 9, 2024

lisajulia commented Jul 9, 2024

blattms Jul 9, 2024

lisajulia Jul 9, 2024

lisajulia Jul 9, 2024

lisajulia Jul 9, 2024

blattms commented Jul 9, 2024

lisajulia commented Jul 9, 2024

lisajulia commented Jul 9, 2024

lisajulia commented Jul 9, 2024 •

edited

Loading

lisajulia commented Jul 11, 2024

bska left a comment

lisajulia commented Jul 12, 2024

lisajulia commented Jul 15, 2024

bska left a comment

lisajulia commented Jul 15, 2024

bska left a comment

lisajulia commented Jul 15, 2024

lisajulia commented Jul 15, 2024

bska left a comment

		std::cerr << "Failure flag of the TaskletRunner is set. Not dispatching new tasklets.\n";
		exit(EXIT_FAILURE);

Feature/add failure flag to tasklets #909

Feature/add failure flag to tasklets #909

Conversation

lisajulia commented Jul 8, 2024

lisajulia commented Jul 8, 2024

lisajulia commented Jul 9, 2024

lisajulia commented Jul 9, 2024

blattms Jul 9, 2024

Choose a reason for hiding this comment

lisajulia Jul 9, 2024

Choose a reason for hiding this comment

lisajulia Jul 9, 2024

Choose a reason for hiding this comment

lisajulia Jul 9, 2024

Choose a reason for hiding this comment

blattms commented Jul 9, 2024

lisajulia commented Jul 9, 2024

lisajulia commented Jul 9, 2024

lisajulia commented Jul 9, 2024 • edited Loading

lisajulia commented Jul 11, 2024

bska left a comment

Choose a reason for hiding this comment

lisajulia commented Jul 12, 2024

lisajulia commented Jul 15, 2024

bska left a comment

Choose a reason for hiding this comment

lisajulia commented Jul 15, 2024

bska left a comment

Choose a reason for hiding this comment

lisajulia commented Jul 15, 2024

lisajulia commented Jul 15, 2024

bska left a comment

Choose a reason for hiding this comment

lisajulia commented Jul 9, 2024 •

edited

Loading