Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/add failure flag to tasklets #909

Merged
merged 7 commits into from
Jul 15, 2024

Conversation

lisajulia
Copy link
Contributor

As suggested by OPM/opm-common#3870 (comment), adding a failureFlag to the TaskletRunner, will solve #871

@lisajulia
Copy link
Contributor Author

jenkins build this please

This flag is shared between the threads, if any thread fails, this is set and before the new threads get dispatched or run, this flag is checked.
In case it is set, the program aborts.
@lisajulia lisajulia force-pushed the feature/add-failureFlag-to-tasklets branch from aae22ff to a2b9b9e Compare July 8, 2024 18:58
@lisajulia
Copy link
Contributor Author

jenkins build this please

@lisajulia lisajulia force-pushed the feature/add-failureFlag-to-tasklets branch from a2b9b9e to 6c2ecf2 Compare July 9, 2024 07:44
@lisajulia
Copy link
Contributor Author

jenkins build this please

Comment on lines 228 to 229
std::cerr << "Failure flag of the TaskletRunner is set. Not dispatching new tasklets.\n";
exit(EXIT_FAILURE);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is much better than continuing.
Still, it might result in rather ungraceful behavior if there is an exit(EXIT_FAILURE) statement executed on one process. All processes might print an MPI error message about another process that died. It might be hard to find the real error message.

Also: Should we refactor to use OpmLog while at it?

Do you think it is possible to throw here and somewhere up the stack make sure that all processes stop more graceful?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll check that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both are possible:

  • "exit(EXIT_FAILURE)" version:
 Newton its= 3, linearizations= 4 (0.0sec), linear its= 16 (0.1sec)
ERROR: Uncaught std::exception when running tasklet: Can not open EclFile: ../../opm-data/spe1/SPE1CASE1.UNRST.
Failure flag of the TaskletRunner is set. Exiting thread.
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[28130,1],0]
  Exit code:    1
--------------------------------------------------------------------------
  • thow-version:
ERROR: Uncaught std::exception when running tasklet: Can not open EclFile: ../../opm-data/spe1/SPE1CASE1.UNRST.
Failure flag of the TaskletRunner is set. Exiting thread.
terminate called after throwing an instance of 'std::runtime_error'
  what():  Failure flag of the TaskletRunner is set. Not dispatching new tasklets.
[samwise:101326] *** Process received signal ***
[samwise:101326] Signal: Aborted (6)
[samwise:101326] Signal code:  (-6)
[samwise:101326] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x3c050)[0x7fb74025b050]
[samwise:101326] [ 1] /lib/x86_64-linux-gnu/libc.so.6(+0x8ae2c)[0x7fb7402a9e2c]
[samwise:101326] [ 2] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x12)[0x7fb74025afb2]
[samwise:101326] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7fb740245472]
[samwise:101326] [ 4] /lib/x86_64-linux-gnu/libstdc++.so.6(+0x9d919)[0x7fb74009d919]
[samwise:101326] [ 5] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xa8e1a)[0x7fb7400a8e1a]
[samwise:101326] [ 6] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xa8e85)[0x7fb7400a8e85]
[samwise:101326] [ 7] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xa90d8)[0x7fb7400a90d8]
[samwise:101326] [ 8] ./bin/flow(+0x4169a0)[0x55c9aca949a0]
[samwise:101326] [ 9] ./bin/flow(+0x41689e)[0x55c9aca9489e]
[samwise:101326] [10] ./bin/flow(+0x7422da)[0x55c9acdc02da]
[samwise:101326] [11] ./bin/flow(+0x73add9)[0x55c9acdb8dd9]
[samwise:101326] [12] ./bin/flow(+0x730c61)[0x55c9acdaec61]
[samwise:101326] [13] ./bin/flow(+0x722f12)[0x55c9acda0f12]
[samwise:101326] [14] ./bin/flow(+0x710836)[0x55c9acd8e836]
[samwise:101326] [15] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xd44a3)[0x7fb7400d44a3]
[samwise:101326] [16] /lib/x86_64-linux-gnu/libc.so.6(+0x89134)[0x7fb7402a8134]
[samwise:101326] [17] /lib/x86_64-linux-gnu/libc.so.6(+0x1097dc)[0x7fb7403287dc]
[samwise:101326] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node samwise exited on signal 6 (Aborted).
--------------------------------------------------------------------------

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either option is fine for me, let me know what you prefer

@blattms
Copy link
Member

blattms commented Jul 9, 2024

Interesting failure for python_basic:

Error: [.../opm-common/opm/io/eclipse/EclFile.cpp:245] Could not open file: './SPE1CASE1.UNRST'
ERROR: Uncaught std::exception when running tasklet: [.../opm-common/opm/io/eclipse/EclFile.cpp:245] Could not open file: './SPE1CASE1.UNRST'.

Probably this appeared previously and was not noticed. Question is whether this is a fatal failure or something that should be ignored.

Once that is sorted out, I'll merge.

@lisajulia
Copy link
Contributor Author

Interesting failure for python_basic:

Error: [.../opm-common/opm/io/eclipse/EclFile.cpp:245] Could not open file: './SPE1CASE1.UNRST'
ERROR: Uncaught std::exception when running tasklet: [.../opm-common/opm/io/eclipse/EclFile.cpp:245] Could not open file: './SPE1CASE1.UNRST'.

Probably this appeared previously and was not noticed. Question is whether this is a fatal failure or something that should be ignored.

Once that is sorted out, I'll merge.

Yes! Indeed! I'll look into that!

@lisajulia
Copy link
Contributor Author

jenkins build this opm-simulators=5474 please

@lisajulia
Copy link
Contributor Author

lisajulia commented Jul 9, 2024

Interesting failure for python_basic:

Error: [.../opm-common/opm/io/eclipse/EclFile.cpp:245] Could not open file: './SPE1CASE1.UNRST'
ERROR: Uncaught std::exception when running tasklet: [.../opm-common/opm/io/eclipse/EclFile.cpp:245] Could not open file: './SPE1CASE1.UNRST'.

Probably this appeared previously and was not noticed. Question is whether this is a fatal failure or something that should be ignored.

Once that is sorted out, I'll merge.

That is very strange, the case SPE1CASE1.DATA runs through when I call flow without python, but the test failed.
Anyhow: removing the keyword UNIFOUT solved it, now the python test runs through as well.

@lisajulia
Copy link
Contributor Author

@bska and @blattms: Can you have a look a this? Thanks! :)

Copy link
Member

@bska bska left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the atomic<bool> although if we're looking very narrowly at the use case in opm-simulators I don't think actually need it since there's only ever one thread/request active at any one time. We should keep it nonetheless.

We might however consider a more polling-based interface. In particular, I would prefer it if the TaskletRunner did not call std::exit() on the simulator's behalf. If we had a predicate along the lines of

bool TaskletRunner::failure() const { return failureFlag_.load(...); }

then the client code in EclGenericWriter::doWriteOutput() might be rewritten as

this->taskletRunner_->barrier();
if (this->taskletRunner_->failure()) {
    throw SomeException { ... };
}

auto tasklet = std::make_shared<EclWriteTasklet>(...);
this->taskletRunner_->dispatch(std::move(tasklet));

Then we'd guard the calling code in EclWriter::writeOutput()–a function which is run on all ranks–by the OPM_BEGIN_PARALLEL_TRY_CATCH() and OPM_END_PARALLEL_TRY_CATCH() macros from DeferredLoggingErrorHelpers.hpp. I.e., we'd have something along the lines of

OPM_BEGIN_PARALLEL_TRY_CATCH()
if (this->collectOnIORank_.isIORank()) { ... }
OPM_END_PARALLEL_TRY_CATCH("File output failure: ", simulator_.gridView().comm())

Finally, I'll second @blattms' comment that we should try to move away from <iostream> based logging here, but that could/should be follow-up work.

@lisajulia
Copy link
Contributor Author

jenkins build this please

@lisajulia lisajulia requested review from blattms and bska July 12, 2024 13:16
@lisajulia
Copy link
Contributor Author

@bska: Can you have another look here?

Copy link
Member

@bska bska left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might personally consider using one of the STL lock guards like std::unique_lock or std::lock_guard instead of directly calling the std::mutex::lock() and std::mutex::unlock() member functions, but in such simple test code it's probably a wash. On the other hand, I don't quite understand why we're using an owning raw pointer for the TaskletRunner object in the new unit test. Would you care to elaborate on the reason for this?

tests/test_tasklets_failure.cc Outdated Show resolved Hide resolved
@lisajulia
Copy link
Contributor Author

I might personally consider using one of the STL lock guards like std::unique_lock or std::lock_guard instead of directly calling the std::mutex::lock() and std::mutex::unlock() member functions, but in such simple test code it's probably a wash. On the other hand, I don't quite understand why we're using an owning raw pointer for the TaskletRunner object in the new unit test. Would you care to elaborate on the reason for this?

Thanks, I've added comments here: cedd34a, is that ok now? Otherwise I can also remove the assertion and outputs.

Copy link
Member

@bska bska left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is definitely an improvement, but I guess I still don't quite understand why the runner object needs to be a raw pointer instead of a smart pointer like std::unique_ptr<>. Does the following not work?

std::unique_ptr<Opm::TaskletRunner> runner{};
// ...
void execute()
{
    const int numWorkers = 2;
    runner = std::make_unique<Opm::TaskletRunner>(numWorkers);
    BOOST_REQUIRE_LT(runner->workerThreadIndex(), 0);
    BOOST_REQUIRE_EQ(runner->numWorkerThreads(), numWorkers);
    // ...
}

BOOST_AUTO_TEST_SUITE(Tasklets)
BOOST_AUTO_TEST_CASE(TASKLET_FAILURE)
{
    // As before...
}
BOOST_AUTO_TEST_SUITE_END()

@lisajulia lisajulia force-pushed the feature/add-failureFlag-to-tasklets branch from cedd34a to 52b6860 Compare July 15, 2024 12:52
@lisajulia
Copy link
Contributor Author

    runner = std::make_unique<Opm::TaskletRunner>(numWorkers);

Yes, thanks! I've changed that now in f1b4182

@lisajulia
Copy link
Contributor Author

jenkins build this please

Copy link
Member

@bska bska left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the updates. This looks good to me now and I'll merge into master.

@bska bska merged commit 1d84a06 into OPM:master Jul 15, 2024
1 check passed
@lisajulia lisajulia deleted the feature/add-failureFlag-to-tasklets branch July 15, 2024 15:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants