Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugs when transforming mixed-precision kernels #2716

Open
arporter opened this issue Sep 17, 2024 · 6 comments
Open

Bugs when transforming mixed-precision kernels #2716

arporter opened this issue Sep 17, 2024 · 6 comments
Assignees
Labels
bug in progress LFRic Issue relates to the LFRic domain NG-ARCH Issues relevant to the GPU parallelisation of LFRic and other models expected to be used in NG-ARCH

Comments

@arporter
Copy link
Member

As found here: #2663 (comment)

The problem seems to be that we rename the interface rather than the actual subroutine that is being called.

@arporter arporter added bug LFRic Issue relates to the LFRic domain NG-ARCH Issues relevant to the GPU parallelisation of LFRic and other models expected to be used in NG-ARCH labels Sep 17, 2024
@arporter arporter self-assigned this Sep 17, 2024
arporter added a commit that referenced this issue Sep 17, 2024
arporter added a commit that referenced this issue Sep 18, 2024
@arporter
Copy link
Member Author

Fixing the interface renaming was relatively simple and I've now done that. Unfortunately, while testing I discovered that our 'clever' code that attempts to return the correct Routine PSyIR by looking at the signature takes no account of precision! This is because it uses LFRicTypes and it seems that that class doesn't account for precision. I can't remember what the thinking was here - I need to search through the Issues and see what I can find.

@arporter arporter changed the title Bug when transforming mixed-precision kernels Bugs when transforming mixed-precision kernels Sep 18, 2024
@arporter
Copy link
Member Author

Sergi suggested trying to ModuleInline the kernel first and then apply ACCRoutineTrans to it. However, that doesn't help because KernelModuleInlineTrans uses the same broken code to get the PSyIR of the routine to inline.

@arporter
Copy link
Member Author

arporter commented Sep 23, 2024

In fact, calling ACCRoutineTrans on the inlined kernel also fails because:

 Transformation Error: routine 'matrix_vector_code_r_double' calls another routine 'MATMUL(matrix(:,:,ik), x_e)' which is not available on the accelerator device and therefore cannot have ACCRoutineTrans applied to it (TODO #342).

I thought we had MATMUL tagged as being available on the GPU?

EDIT: it's not but perhaps it could be?

@arporter
Copy link
Member Author

arporter commented Oct 2, 2024

I'm making good progress with the plumbing but am now hitting errors because of the plain Symbols that are created in a kernel's symbol table for all of the reserved names. I need to work out why that isn't a problem on master.

arporter added a commit that referenced this issue Oct 2, 2024
arporter added a commit that referenced this issue Oct 3, 2024
arporter added a commit that referenced this issue Oct 3, 2024
@arporter
Copy link
Member Author

arporter commented Oct 3, 2024

While working on this, I found this test (in lfric_kern_test.py):

def test_get_kernel_schedule_mixed_precision():
    '''
    Test that we can get the correct schedule for a mixed-precision kernel.

    '''
    api_config = Config.get().api_conf(TEST_API)
    _, invoke = get_invoke("26.8_mixed_precision_args.f90", TEST_API,
                           name="invoke_0", dist_mem=False)
    sched = invoke.schedule
    kernels = sched.walk(LFRicKern, stop_type=LFRicKern)
    # 26.8 contains an invoke of five kernels, one each at the following
    # precisions.
    kernel_precisions = ["r_def", "r_solver", "r_tran", "r_bl", "r_phys"]
    # Get the precision (in bytes) for each of these.
    precisions = [api_config.precision_map[name] for
                  name in kernel_precisions]
    # Check that the correct kernel implementation is obtained for each
    # one in the invoke.
    for precision, kern in zip(precisions, kernels):
        sched = kern.get_kernel_schedule()
        assert isinstance(sched, KernelSchedule)
        assert sched.name == f"mixed_code_{8*precision}"

so the precision-matching clearly does work in some situations. It may be that it simply has a bug for scalar arguments?

arporter added a commit that referenced this issue Oct 3, 2024
arporter added a commit that referenced this issue Oct 3, 2024
arporter added a commit that referenced this issue Oct 3, 2024
arporter added a commit that referenced this issue Oct 3, 2024
arporter added a commit that referenced this issue Oct 7, 2024
arporter added a commit that referenced this issue Oct 8, 2024
arporter added a commit that referenced this issue Oct 8, 2024
arporter added a commit that referenced this issue Oct 8, 2024
arporter added a commit that referenced this issue Oct 8, 2024
arporter added a commit that referenced this issue Oct 8, 2024
@arporter
Copy link
Member Author

arporter commented Oct 9, 2024

For the record, the code that we used to use to attempt to resolve the correct routine implementation was:

       # The kernel name corresponds to an interface block. Find which
       # of the routines matches the precision of the arguments.
       for routine in routines:
           try:
               # The validity check for the kernel arguments should raise
               # an exception if the precisions don't match but currently
               # this fails for scalar arguments.
               self.validate_kernel_code_args(routine.symbol_table)
               sched = routine
               break
            except GenerationError:
               pass
        else:
            raise GenerationError(
                f"Failed to find a kernel implementation with an interface"
                f" that matches the invoke of '{self.name}'. (Tried "
                f"routines {[item.name for item in routines]}.)")

I'm putting this here as I'm going to remove from the code base (LFRicKern.get_kernel_schedule()) it for the moment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug in progress LFRic Issue relates to the LFRic domain NG-ARCH Issues relevant to the GPU parallelisation of LFRic and other models expected to be used in NG-ARCH
Projects
None yet
Development

No branches or pull requests

1 participant