Behavior of sharded_map vs pmap in a multi-host setting (setup using jax.distributed) #23072

shyams2 · 2024-08-15T03:14:06Z

shyams2
Aug 15, 2024

Hi everyone! I'm exploring the differences between shard_map and pmap functionalities in JAX, particularly in a multi-host setting. I've encountered some behavior that I'd like to understand better and potentially find a solution for.

Setup: Multi-Host Environment

Consider a setup with 2 hosts, each having 4 devices.

Example 1: Using `pmap`

Here's a basic script using pmap:

import jax
import jax.numpy as jnp
import functools
from absl import app
from absl import flags

flags.DEFINE_string("server_addr", "", help="server ip addr")
flags.DEFINE_integer("num_hosts", 1, help="num of hosts")
flags.DEFINE_integer("host_idx", 0, help="index of current host")
FLAGS = flags.FLAGS

def f(x):
    return x

def main(argv):
    jax.distributed.initialize(FLAGS.server_addr, FLAGS.num_hosts, FLAGS.host_idx)
    devices = jax.devices()
    local_devices = jax.local_devices()
    print("host_idx:", FLAGS.host_idx)
    print("devices:", devices)
    print("local_devices:", local_devices)

    x = 8 * FLAGS.host_idx + jnp.arange(8).reshape(4, 2)
    x_s = jax.pmap(f, "i")(x)
    print("x:", x)
    print(jax.debug.visualize_array_sharding(x))
    print("x_s", x_s)
    print(jax.debug.visualize_array_sharding(x_s))

    @functools.partial(jax.pmap, axis_name="i")
    def psum_data(data):
        return jax.lax.psum(data, "i")

    print("devices buffers x_s", [shard.data for shard in x_s.addressable_shards])
    print("output after taking psum:", psum_data(x))

if __name__ == "__main__":
    app.run(main)

With pmap, the final output of taking psum yields [56 64] across all shards, as expected.

Example 2: Attempting to Use `shard_map`

Now, I tried to achieve the same result using shard_map:

import jax
import numpy as np
import jax.numpy as jnp
import functools
from absl import app
from absl import flags
from jax.experimental.shard_map import shard_map
from jax.sharding import Mesh, PartitionSpec as P, NamedSharding

flags.DEFINE_string("server_addr", "", help="server ip addr")
flags.DEFINE_integer("num_hosts", 1, help="num of hosts")
flags.DEFINE_integer("host_idx", 0, help="index of current host")
FLAGS = flags.FLAGS

def f(x):
    return x

def main(argv):
    jax.distributed.initialize(FLAGS.server_addr, FLAGS.num_hosts, FLAGS.host_idx)
    devices = jax.devices()
    local_devices = jax.local_devices()
    print("host_idx:", FLAGS.host_idx)
    print("devices:", devices)
    print("local_devices:", local_devices)

    mesh = Mesh(np.array(local_devices), ("i",))
    sharding = NamedSharding(mesh, P("i"))
    replicated_sharding = NamedSharding(mesh, P())

    x = 8 * FLAGS.host_idx + jnp.arange(8)
    x_s = shard_map(f, mesh, in_specs=P("i"), out_specs=P("i"))(x)
    print("x:", x)
    print(jax.debug.visualize_array_sharding(x))
    print("x_s", jax.experimental.multihost_utils.process_allgather(x_s))
    print(jax.debug.visualize_array_sharding(x_s))

    @functools.partial(
        shard_map,
        mesh=mesh,
        in_specs=P("i"),
        out_specs=P("i"),
    )
    def psum_data(data):
        return jax.lax.psum(data, "i")

    print("devices buffers x_s", [shard.data for shard in x_s.addressable_shards])
    print("output after taking psum:", psum_data(x))

if __name__ == "__main__":
    app.run(main)

Observed Behavior and Questions

With shard_map, I'm getting different outputs for the shards on each host:
- Host 1: [12 16]
- Host 2: [44 48]
It seems like the shards are not aware of each other across hosts when using shard_map.

Questions

Is there a way to make shard_map compatible with the behavior I want, i.e., to perform operations across all devices on all hosts? Or is it better to stick to pmap for this functionality?
Are there any best practices or alternative approaches for achieving cross-host operations with shard_map?

Any insights or suggestions would be greatly appreciated. Thanks in advance for your help!

shyams2 · 2024-08-15T11:27:47Z

shyams2
Aug 15, 2024
Author

After further investigation, I've managed to achieve the behavior I was looking for using shard_map. Here's the updated code that works across multiple hosts:

import jax
import numpy as np
import jax.numpy as jnp
import functools
from absl import app
from absl import flags

from jax.experimental.shard_map import shard_map
from jax.experimental import mesh_utils, multihost_utils
from jax.sharding import Mesh, PartitionSpec as P, NamedSharding

flags.DEFINE_string("server_addr", "", help="server ip addr")
flags.DEFINE_integer("num_hosts", 1, help="num of hosts")
flags.DEFINE_integer("host_idx", 0, help="index of current host")
FLAGS = flags.FLAGS

def f(x):
    return x

def main(argv):
    jax.distributed.initialize(FLAGS.server_addr, FLAGS.num_hosts, FLAGS.host_idx)

    devices = jax.devices()
    local_devices = jax.local_devices()

    print("host_idx:", FLAGS.host_idx)
    print("devices:", devices)
    print("local_devices:", local_devices)

    mesh = Mesh(np.array(devices), ("i",))
    sharding = NamedSharding(mesh, P("i"))
    replicated_sharding = NamedSharding(mesh, P())

    x = 8 * FLAGS.host_idx + jnp.arange(8)
    global_array = multihost_utils.host_local_array_to_global_array(x, mesh, P("i"))
    x_s = shard_map(f, mesh, in_specs=P("i"), out_specs=P("i"))(global_array)

    print("x:", x)
    print(jax.debug.visualize_array_sharding(x))
    print("x_s", multihost_utils.process_allgather(x_s))
    print(jax.debug.visualize_array_sharding(x_s))

    @functools.partial(
        shard_map,
        mesh=mesh,
        in_specs=P("i"),
        out_specs=P("i"),
    )
    def psum_data(data):
        return jax.lax.psum(data, "i")

    p_sum_out = psum_data(global_array)

    print("devices buffers x_s", [shard.data for shard in x_s.addressable_shards])
    print(
        "devices buffers p_sum_out",
        [shard.data for shard in p_sum_out.addressable_shards],
    )
    print(
        "output after taking psum (global_array):",
        multihost_utils.process_allgather(p_sum_out),
    )
    print(
        "output after taking psum (local_array):",
        multihost_utils.global_array_to_host_local_array(p_sum_out, mesh, P("i")),
    )

if __name__ == "__main__":
    app.run(main)

The key changes that made this work:

Using jax.devices() instead of jax.local_devices() to create the mesh.
Converting the local array to a global array using multihost_utils.host_local_array_to_global_array().
Using multihost_utils.process_allgather() and multihost_utils.global_array_to_host_local_array() to view the results.

While this approach works, I have some additional questions:

Is this the recommended way to use shard_map in a multi-host setting?
How does this approach compare to pmap in terms of performance?
Are there any potential drawbacks or limitations to this method that I should be aware of?

Any insights or best practices would be greatly appreciated. Thank you!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Behavior of sharded_map vs pmap in a multi-host setting (setup using jax.distributed) #23072

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Behavior of sharded_map vs pmap in a multi-host setting (setup using jax.distributed) #23072

shyams2 Aug 15, 2024

Setup: Multi-Host Environment

Example 1: Using pmap

Example 2: Attempting to Use shard_map

Observed Behavior and Questions

Questions

Replies: 1 comment

shyams2 Aug 15, 2024 Author

shyams2
Aug 15, 2024

Example 1: Using `pmap`

Example 2: Attempting to Use `shard_map`

shyams2
Aug 15, 2024
Author