Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ocean fails stand-alone decomp test, intel optimized #5219

Closed
mark-petersen opened this issue Oct 6, 2022 · 15 comments · Fixed by #5356
Closed

Ocean fails stand-alone decomp test, intel optimized #5219

mark-petersen opened this issue Oct 6, 2022 · 15 comments · Fixed by #5356
Assignees

Comments

@mark-petersen
Copy link
Contributor

MPAS-Ocean nightly on master 4dcc8db fails

ocean/baroclinic_channel/10km/decomp_test
ocean/global_ocean/QU240/PHC/RK4/decomp_test

with the intel optimized compiler and OpenMP. Differences between 4 processor and 8 processor runs are max 1e-13.

@mark-petersen
Copy link
Contributor Author

Earlier this week, I thought it failed

ocean/global_ocean/QU240/PHC/decomp_test

with intel optimized, but I don't see that on badger with 4dcc8db now - maybe I'm mixing it up though.

@mark-petersen
Copy link
Contributor Author

tested the very first SMEP merge: bb84429 Merge branch 'vanroekel/ocean/add-submesoscale-eddies' (PR #5099)

PASS ocean/baroclinic_channel/10km/decomp_test
FAIL ocean/global_ocean/QU240/PHC/decomp_test
PASS ocean/global_ocean/QU240/PHC/RK4/decomp_test

with intel optimized on badger. This is confusing.

@xylar
Copy link
Contributor

xylar commented Oct 6, 2022

I'm trying to use bisection to find the cause of this and I'm just seeing hanging when I try to test on c63cce2. This may have more to do with a bad node on Anvil or something but it's certainly not helping me debug...

@xylar
Copy link
Contributor

xylar commented Oct 7, 2022

@mark-petersen, I agree that #5099 is responsible and that this is probably already fixed in #5216. I'll make sure.

@xylar
Copy link
Contributor

xylar commented Oct 7, 2022

Nope, the test is still failing after #5216 so there's another threading problem from #5099 that we need to track down.

@xylar
Copy link
Contributor

xylar commented Oct 10, 2022

Sorry, I wasn't clear in my mind. We're looking for a decomposition problem, not a threading problem. Much trickier in some ways!

@xylar
Copy link
Contributor

xylar commented Nov 29, 2022

I believe this was introduced by #5183 and not by #5099. At least that is what I'm seeing in testing on Chrysalis with Intel and Intel-MPI. I'm seeing test execution passing for #5170 (the previous ocean-related commit merge) but failing for #5183. No PRs were merged between these 2 so it seems like #5183 is likely responsible, though why is not at all clear at this point.

@xylar
Copy link
Contributor

xylar commented Nov 30, 2022

After rerunning pr test suite today, this test case is running fine for #5183 so the failures yesterday seem random and unrelated to this issue. Still investigating.

@xylar
Copy link
Contributor

xylar commented Nov 30, 2022

@mark-petersen and @dengwirda, I now believe this issue was introduced by #5195. While I ran into sporadic execution failures of ocean/baroclinic_channel/10km/decomp_test before that PR, I find consistent comparison (4 vs. 8 processor) failures after it. In my testing, the ocean/baroclinic_channel/10km/decomp_test still passes for #5172 but fails starting at #5195. I have not tested #5178 and #5182 that were merged between these 2 but I think they are highly unlikely to have caused this because they don't involve any standalone MPAS code.

I'm not at all confident about this but the most likely culprit to my eyes is this new loop:

!$omp parallel
!$omp do schedule(runtime) &
!$omp private(cell1, cell2, k, thicknessSum)
do iEdge = nEdgesOwned+1, nEdgesArray(4)
cell1 = cellsOnEdge(1,iEdge)
cell2 = cellsOnEdge(2,iEdge)
thicknessSum = layerThickEdgeFlux(minLevelEdgeBot(iEdge),iEdge)
do k = minLevelEdgeBot(iEdge)+1, maxLevelEdgeTop(iEdge)
thicknessSum = thicknessSum + &
layerThickEdgeFlux(k,iEdge)
enddo
bottomDepthEdge(iEdge) = thicknessSum &
- 0.5_RKIND*(sshNew(cell1) + sshNew(cell2))
enddo ! iEdge
!$omp end do
!$omp end parallel

It seems like maybe the OpenMP directives may not cover all the variables they need to? Some were fixed in #5226 but maybe some are still missing?

I'm still investigating.

@xylar
Copy link
Contributor

xylar commented Nov 30, 2022

Another possibility is this line:

do iEdge = nEdgesOwned+1, nEdgesArray(4)

It could be that 4 is out of range for nEdgesArray. If so, this would not be the only place that it is indexed out of bounds, the split implicit solver also indexes to config_num_halos + 1, which defaults to 4. I couldn't find any other code that indexes to this halo so it could be that it isn't guaranteed to exist and doesn't exist for some reason (e.g. small mesh size?) in the baroclinic channel test case.

@xylar
Copy link
Contributor

xylar commented Nov 30, 2022

I'm going to quickly try rerunning that test case with an index of nEdgesArray(3) instead of nEdgesArray(4) to see if it passes. Then, we can figure out what the deal is.

@xylar
Copy link
Contributor

xylar commented Nov 30, 2022

Still fails with nEdgesArray(3) so that's not it.

@mark-petersen
Copy link
Contributor Author

Well, I spent some time on this and did not figure it out. But it appears that layerThickEdgeFlux has a machine-precision mismatch between processors when using intel. If I set this here:

components/mpas-ocean/src/mode_forward/mpas_ocn_time_integration_split.F
 739 layerThickEdgeFlux(:,1:nEdgesAll) = 50.0_RKIND

then I get a decomp test match for the baroclinic channel. This is not a solution, of course, because it overwrites the actual values in the array. But I tried some other things, like an extra halo update and rounding that array, but those didn't fix it.

@mark-petersen
Copy link
Contributor Author

I finally found it. The computation of bottomDepthEdge was separated into two loops: first from 1:nEdgesOwned (with many other calculations) and another from nEdgesOwned+1:nEdgesArray(4) to finish up the halo. Apparently the intel optimized compiler changes the order of operations by itself. Moving the calculation of bottomDepthEdge completely to the second loop, and looping over all edges passes ocean/baroclinic_channel/10km/decomp_test:

+++ b/components/mpas-ocean/src/mode_forward/mpas_ocn_time_integration_split.F
@@ -775,8 +775,6 @@ module ocn_time_integration_split
-               bottomDepthEdge(iEdge) = thicknessSum &
-                  - 0.5_RKIND*(sshNew(cell1) + sshNew(cell2))

@@ -798,7 +798,7 @@ module ocn_time_integration_split
-            do iEdge = nEdgesOwned+1, nEdgesArray(4)
+            do iEdge = 1, nEdgesArray(4)

Whew!

On a side note, The 4 ignores the number of halo layers, which can be set in the namelist. I'll actually change it to

            do iEdge = 1, nEdgesArray(size(nEdgesArray)-1)

which includes all edges within the halo, but not the outside edges of the last halo layer.

@xylar
Copy link
Contributor

xylar commented Dec 6, 2022

Very nice detective work, @mark-petersen! I agree that we should not hard-code the halo size so I'm very happy with your recommended solution.

jonbob added a commit that referenced this issue Dec 13, 2022
…5356)

Move bottomDepthEdge calculation to single loop over all edges

After #5195 was merged, the MPAS-Ocean standalone test
ocean/baroclinic_channel/10km/decomp_test failed to match between 4 and
8 partitions, but only for intel optimized. All compass nightly suite
tests passed for gnu debug, gnu optimized, intel debug.

This PR solves the problem by merging the computation of bottomDepthEdge
into a single edge loop. Previously it was split into two loops,
1:nEdgesOwned (with many other calculations) and another from
nEdgesOwned+1:nEdgesArray(4). The intel optimized compiler must have
changed order-of-operations in these two loops for different partitions.

Fixes #5219
[BFB]
@jonbob jonbob closed this as completed in 0614e7b Dec 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants