Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Improve multi-GPU BFS performance #4619

Draft
wants to merge 95 commits into
base: branch-24.10
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
95 commits
Select commit Hold shift + click to select a range
a0d1f01
add a create_graph_from_edgelist function that takes edge list in mul…
seunghwak Jul 15, 2024
55513ae
update R-mat graph generators to generate edge list in multiple chunks
seunghwak Jul 15, 2024
2163fd8
Merge branch 'branch-24.08' of https://github.com/rapidsai/cugraph in…
seunghwak Jul 15, 2024
a9dfb92
fix build error
seunghwak Jul 15, 2024
e7b33ca
delete unused functions
seunghwak Jul 15, 2024
27ea550
fix build errors
seunghwak Jul 16, 2024
e5e8257
add temporary performance measurement code
seunghwak Jul 19, 2024
7ec5b08
add code to broadcast frontier using a bitmap
seunghwak Jul 19, 2024
d6123ba
resolve merge conflicts
seunghwak Jul 19, 2024
81f51c1
fix build error
seunghwak Jul 19, 2024
69cb4f9
update dataframe buffer utilities
seunghwak Jul 21, 2024
6adcccb
reduce # resizes
seunghwak Jul 21, 2024
bfe21fc
remove debug statement
seunghwak Jul 22, 2024
446435b
rename VertexFrontierBucketType to KeyBucketType
seunghwak Jul 22, 2024
df463ce
update per_v_transform_reduce_incoming|outgoing_e to support reduce_o…
seunghwak Jul 25, 2024
222148d
update kernels to take KeyIterator key_first & key_last
seunghwak Jul 26, 2024
effc69c
update per_v_transform_reduce_incoming_outgoing_e to support key list
seunghwak Jul 28, 2024
d537290
remove pred_op.cuh
seunghwak Jul 29, 2024
cf92885
update per_v_transform_reduce_incoming|outgoing_e to take a predicate
seunghwak Jul 29, 2024
b8d846c
Merge branch 'branch-24.10' of https://github.com/rapidsai/cugraph in…
seunghwak Aug 9, 2024
50ffc33
Merge branch 'branch-24.10' of https://github.com/rapidsai/cugraph in…
seunghwak Aug 9, 2024
a6476b9
split per_v_transform_reduce_incoming_outgoing_e implementation to tw…
seunghwak Aug 9, 2024
ec24758
implement per_v_transform_reduce_if_incoming|outgoing_e
seunghwak Aug 10, 2024
df751e7
update BFS to use per_v_transform_reduce_if_outoging_e
seunghwak Aug 13, 2024
4661b9b
file rename
seunghwak Aug 13, 2024
0951741
remove transform_reduce_v_frontier_outgoing_e_by_src (this can be bet…
seunghwak Aug 13, 2024
796b928
Merge branch 'branch-24.10' of https://github.com/rapidsai/cugraph in…
seunghwak Aug 13, 2024
7b98e3a
code cleanup, add few FIXMEs to improve performance, and add performa…
seunghwak Aug 15, 2024
3f77ee1
performance tuning for BFS
seunghwak Aug 16, 2024
bb75771
add a utility to find iteator type in dataframe buffer
seunghwak Aug 17, 2024
cfce7bc
minor performance tuning
seunghwak Aug 17, 2024
75d6151
delete unused code
seunghwak Aug 18, 2024
0f88988
add an option to skip edge shuffling in R-mat edge list generation
seunghwak Aug 18, 2024
180ece1
Merge branch 'branch-24.10' of https://github.com/rapidsai/cugraph in…
seunghwak Aug 18, 2024
2efb51e
fix build error
seunghwak Aug 19, 2024
106a6ad
fix documentation error
seunghwak Aug 19, 2024
98419cb
add a query function
seunghwak Aug 19, 2024
8193a91
Merge branch 'branch-24.10' of https://github.com/rapidsai/cugraph in…
seunghwak Aug 20, 2024
9625e0c
bug fix
seunghwak Aug 20, 2024
03cfe0d
bug fix
seunghwak Aug 20, 2024
c15305f
bug fixes
seunghwak Aug 21, 2024
4a1f150
bug fix
seunghwak Aug 21, 2024
29b6834
replace offset vector communication with local computing
seunghwak Aug 21, 2024
fcc75e0
add tmp perf measurement code
seunghwak Aug 21, 2024
710eb88
map GPUs on minor_comm to consecutive GPUs
seunghwak Aug 21, 2024
d040110
additional performance tuning
seunghwak Aug 22, 2024
ca816dd
add a utility function
seunghwak Aug 22, 2024
7712c38
fix build error
seunghwak Aug 22, 2024
31a5955
fix build error
seunghwak Aug 22, 2024
ac33784
bug fix
seunghwak Aug 23, 2024
6d8c7ef
perf experiment
seunghwak Aug 23, 2024
6bcdbe7
perf measurement code update
seunghwak Aug 23, 2024
3a950a5
rename [vertex_first, vertex_last) in fill|update_edge_src|dst_proper…
seunghwak Aug 23, 2024
d27a5e3
update fill|update_edge_minor_property to optionally use bitmap to br…
seunghwak Aug 24, 2024
97022f5
add missing includes
seunghwak Aug 24, 2024
ecf76f8
specialization for bool
seunghwak Aug 24, 2024
350f17e
add asynchronous copy_if
seunghwak Aug 27, 2024
93f726f
fix implicit synchronization in multi-stream execution
seunghwak Aug 27, 2024
d022c30
fix implicit synchronizations for multi-stream execution
seunghwak Aug 28, 2024
e53b3b8
delete debug prints
seunghwak Aug 28, 2024
b6e4f28
fix erroneous comments (numbering)
seunghwak Aug 29, 2024
be504cc
reduce memory footprint in graph creation
seunghwak Aug 30, 2024
3b151e0
undo temporary change for benchmarking
seunghwak Aug 30, 2024
ad0c879
update comments
seunghwak Aug 30, 2024
ce4ea93
cosmetic updates
seunghwak Aug 30, 2024
743ebf1
resolve merge conflicts
seunghwak Aug 30, 2024
9445027
update renumbering to use binning in more places
seunghwak Sep 3, 2024
bea1498
Merge branch 'branch-24.10' of https://github.com/rapidsai/cugraph in…
seunghwak Sep 3, 2024
70b1108
Merge branch 'upstream_pr4642' into enh_bfs_mg
seunghwak Sep 3, 2024
28641a6
update detail::extract_transform_v_frontier_e to use multiple CUDA st…
seunghwak Sep 3, 2024
05df778
exec_policy=>exec_policy_nosync
seunghwak Sep 3, 2024
5c4e3bd
performance-tweak detail::extract_transform_v_frontier_e
seunghwak Sep 4, 2024
dc44a7d
update comments
seunghwak Sep 4, 2024
ebcbfb7
improve stream concurrency
seunghwak Sep 4, 2024
3652c33
update copy_if_nosync to take a pointer to store the counter
seunghwak Sep 5, 2024
c689e35
temporary paramter setting for benchmarking
seunghwak Sep 9, 2024
f7b061b
bug fix
seunghwak Sep 9, 2024
20e1c74
add sun_nosync for multi stream execution
seunghwak Sep 9, 2024
9fa4fb4
pre-filter keys
seunghwak Sep 9, 2024
b6a1fb0
multi-stream execution
seunghwak Sep 9, 2024
3f71304
more performance logs
seunghwak Sep 9, 2024
3577699
update logging
seunghwak Sep 10, 2024
42c4d0b
use global comm to shuffle in compute_renumber_map (to avoid P2P buff…
seunghwak Sep 10, 2024
2557668
reduce small memory allocations
seunghwak Sep 11, 2024
0381f22
bug fix
seunghwak Sep 11, 2024
eb822da
temporarily store vertex IDs in 48 bit to cut peak memory usage
seunghwak Sep 12, 2024
a067f08
update v_list bitmap bcast
seunghwak Sep 13, 2024
6c9118e
undo a flag
seunghwak Sep 15, 2024
20721e6
peak memory usage
seunghwak Sep 19, 2024
9d002c5
use approximation in swithcing between topdown & bottomup
seunghwak Sep 23, 2024
9e3574e
update logging
seunghwak Sep 23, 2024
07749f4
peak memory usage
seunghwak Sep 25, 2024
4ddd0a1
improve logging
seunghwak Sep 25, 2024
3bb6602
NCCL bug workaround
seunghwak Sep 25, 2024
8be2a3f
temporary parameter tweaks for testing
seunghwak Sep 25, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 12 additions & 12 deletions cpp/include/cugraph/edge_partition_device_view.cuh
Original file line number Diff line number Diff line change
Expand Up @@ -255,7 +255,7 @@ class edge_partition_device_view_t<vertex_t, edge_t, multi_gpu, std::enable_if_t
rmm::device_uvector<edge_t> local_degrees(this->major_range_size(), stream);
if (dcs_nzd_vertices_) {
assert(major_hypersparse_first_);
thrust::transform(rmm::exec_policy(stream),
thrust::transform(rmm::exec_policy_nosync(stream),
thrust::make_counting_iterator(this->major_range_first()),
thrust::make_counting_iterator(this->major_range_last()),
local_degrees.begin(),
Expand All @@ -266,7 +266,7 @@ class edge_partition_device_view_t<vertex_t, edge_t, multi_gpu, std::enable_if_t
major_hypersparse_first_.value_or(vertex_t{0})});
} else {
thrust::transform(
rmm::exec_policy(stream),
rmm::exec_policy_nosync(stream),
thrust::make_counting_iterator(this->major_range_first()),
thrust::make_counting_iterator(this->major_range_last()),
local_degrees.begin(),
Expand All @@ -284,7 +284,7 @@ class edge_partition_device_view_t<vertex_t, edge_t, multi_gpu, std::enable_if_t
rmm::device_uvector<edge_t> local_degrees(thrust::distance(major_first, major_last), stream);
if (dcs_nzd_vertices_) {
assert(major_hypersparse_first_);
thrust::transform(rmm::exec_policy(stream),
thrust::transform(rmm::exec_policy_nosync(stream),
major_first,
major_last,
local_degrees.begin(),
Expand All @@ -295,7 +295,7 @@ class edge_partition_device_view_t<vertex_t, edge_t, multi_gpu, std::enable_if_t
major_hypersparse_first_.value_or(vertex_t{0})});
} else {
thrust::transform(
rmm::exec_policy(stream),
rmm::exec_policy_nosync(stream),
major_first,
major_last,
local_degrees.begin(),
Expand Down Expand Up @@ -355,7 +355,7 @@ class edge_partition_device_view_t<vertex_t, edge_t, multi_gpu, std::enable_if_t
if (dcs_nzd_vertices_) {
assert(major_hypersparse_first_);
thrust::transform(
rmm::exec_policy(stream),
rmm::exec_policy_nosync(stream),
thrust::make_counting_iterator(this->major_range_first()),
thrust::make_counting_iterator(this->major_range_last()),
local_degrees.begin(),
Expand All @@ -368,7 +368,7 @@ class edge_partition_device_view_t<vertex_t, edge_t, multi_gpu, std::enable_if_t
mask_first});
} else {
thrust::transform(
rmm::exec_policy(stream),
rmm::exec_policy_nosync(stream),
thrust::make_counting_iterator(this->major_range_first()),
thrust::make_counting_iterator(this->major_range_last()),
local_degrees.begin(),
Expand All @@ -394,7 +394,7 @@ class edge_partition_device_view_t<vertex_t, edge_t, multi_gpu, std::enable_if_t
if (dcs_nzd_vertices_) {
assert(major_hypersparse_first_);
thrust::transform(
rmm::exec_policy(stream),
rmm::exec_policy_nosync(stream),
major_first,
major_last,
local_degrees.begin(),
Expand All @@ -407,7 +407,7 @@ class edge_partition_device_view_t<vertex_t, edge_t, multi_gpu, std::enable_if_t
mask_first});
} else {
thrust::transform(
rmm::exec_policy(stream),
rmm::exec_policy_nosync(stream),
major_first,
major_last,
local_degrees.begin(),
Expand Down Expand Up @@ -577,7 +577,7 @@ class edge_partition_device_view_t<vertex_t, edge_t, multi_gpu, std::enable_if_t
__host__ rmm::device_uvector<edge_t> compute_local_degrees(rmm::cuda_stream_view stream) const
{
rmm::device_uvector<edge_t> local_degrees(this->major_range_size(), stream);
thrust::transform(rmm::exec_policy(stream),
thrust::transform(rmm::exec_policy_nosync(stream),
thrust::make_counting_iterator(this->major_range_first()),
thrust::make_counting_iterator(this->major_range_last()),
local_degrees.begin(),
Expand All @@ -595,7 +595,7 @@ class edge_partition_device_view_t<vertex_t, edge_t, multi_gpu, std::enable_if_t
rmm::cuda_stream_view stream) const
{
rmm::device_uvector<edge_t> local_degrees(thrust::distance(major_first, major_last), stream);
thrust::transform(rmm::exec_policy(stream),
thrust::transform(rmm::exec_policy_nosync(stream),
major_first,
major_last,
local_degrees.begin(),
Expand Down Expand Up @@ -638,7 +638,7 @@ class edge_partition_device_view_t<vertex_t, edge_t, multi_gpu, std::enable_if_t
{
rmm::device_uvector<edge_t> local_degrees(this->major_range_size(), stream);
thrust::transform(
rmm::exec_policy(stream),
rmm::exec_policy_nosync(stream),
thrust::make_counting_iterator(this->major_range_first()),
thrust::make_counting_iterator(this->major_range_last()),
local_degrees.begin(),
Expand All @@ -660,7 +660,7 @@ class edge_partition_device_view_t<vertex_t, edge_t, multi_gpu, std::enable_if_t
{
rmm::device_uvector<edge_t> local_degrees(thrust::distance(major_first, major_last), stream);
thrust::transform(
rmm::exec_policy(stream),
rmm::exec_policy_nosync(stream),
major_first,
major_last,
local_degrees.begin(),
Expand Down
3 changes: 2 additions & 1 deletion cpp/include/cugraph/graph_functions.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -1114,7 +1114,8 @@ shuffle_external_vertex_value_pairs(raft::handle_t const& handle,
* @param edge_ids Optional list of edge ids
* @param edge_types Optional list of edge types
* @return Tuple of vectors storing edge sources, destinations, optional weights,
* optional edge ids, optional edge types mapped to this GPU.
* optional edge ids, optional edge types mapped to this GPU and a vector storing the
* number of edges received from each GPU.
*/
template <typename vertex_t, typename edge_t, typename weight_t, typename edge_type_t>
std::tuple<rmm::device_uvector<vertex_t>,
Expand Down
2 changes: 1 addition & 1 deletion cpp/include/cugraph/graph_view.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -243,7 +243,7 @@ namespace detail {
// use (key, value) pairs to store source/destination properties if (unique edge
// sources/destinations) over (V / major_comm_size|minor_comm_size) is smaller than the threshold
// value
double constexpr edge_partition_src_dst_property_values_kv_pair_fill_ratio_threshold = 0.1;
double constexpr edge_partition_src_dst_property_values_kv_pair_fill_ratio_threshold = 0.0; // FIXME: just for benchmarking

// FIXME: threshold values require tuning
// use the hypersparse format (currently, DCSR or DCSC) for the vertices with their degrees smaller
Expand Down
24 changes: 24 additions & 0 deletions cpp/include/cugraph/partition_manager.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,30 @@ class partition_manager {
: (major_comm_rank * minor_comm_size + minor_comm_rank);
}

#ifdef __CUDACC__
__host__ __device__
#endif
static int
compute_major_comm_rank_from_global_comm_rank(int major_comm_size,
int minor_comm_size,
int comm_rank)
{
return map_major_comm_to_gpu_row_comm ? comm_rank % major_comm_size
: comm_rank / minor_comm_size;
}

#ifdef __CUDACC__
__host__ __device__
#endif
static int
compute_minor_comm_rank_from_global_comm_rank(int major_comm_size,
int minor_comm_size,
int comm_rank)
{
return map_major_comm_to_gpu_row_comm ? comm_rank / major_comm_size
: comm_rank % minor_comm_size;
}

#ifdef __CUDACC__
__host__ __device__
#endif
Expand Down
73 changes: 47 additions & 26 deletions cpp/include/cugraph/utilities/dataframe_buffer.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,53 @@ auto allocate_dataframe_buffer(size_t buffer_size, rmm::cuda_stream_view stream_
std::make_index_sequence<tuple_size>(), buffer_size, stream_view);
}

template <typename T>
struct dataframe_buffer_type {
using type = decltype(allocate_dataframe_buffer<T>(size_t{0}, rmm::cuda_stream_view{}));
};

template <typename T>
using dataframe_buffer_type_t = typename dataframe_buffer_type<T>::type;

template <typename T>
std::optional<dataframe_buffer_type_t<T>> try_allocate_dataframe_buffer(
size_t buffer_size, rmm::cuda_stream_view stream_view)
{
try {
return allocate_dataframe_buffer<T>(buffer_size, stream_view);
} catch (std::exception const& e) {
return std::nullopt;
}
}

template <typename T>
struct dataframe_buffer_iterator_type {
using type = typename rmm::device_uvector<T>::iterator;
};

template <typename... Ts>
struct dataframe_buffer_iterator_type<thrust::tuple<Ts...>> {
using type = thrust::zip_iterator<thrust::tuple<typename rmm::device_uvector<Ts>::iterator...>>;
};

template <typename T>
using dataframe_buffer_iterator_type_t = typename dataframe_buffer_iterator_type<T>::type;

template <typename T>
struct dataframe_buffer_const_iterator_type {
using type = typename rmm::device_uvector<T>::const_iterator;
};

template <typename... Ts>
struct dataframe_buffer_const_iterator_type<thrust::tuple<Ts...>> {
using type =
thrust::zip_iterator<thrust::tuple<typename rmm::device_uvector<Ts>::const_iterator...>>;
};

template <typename T>
using dataframe_buffer_const_iterator_type_t =
typename dataframe_buffer_const_iterator_type<T>::type;

template <typename BufferType>
void reserve_dataframe_buffer(BufferType& buffer,
size_t new_buffer_capacity,
Expand Down Expand Up @@ -206,30 +253,4 @@ auto get_dataframe_buffer_cend(BufferType& buffer)
std::make_index_sequence<std::tuple_size<BufferType>::value>(), buffer);
}

template <typename T>
struct dataframe_buffer_value_type {
using type = void;
};

template <typename T>
struct dataframe_buffer_value_type<rmm::device_uvector<T>> {
using type = T;
};

template <typename... Ts>
struct dataframe_buffer_value_type<std::tuple<rmm::device_uvector<Ts>...>> {
using type = thrust::tuple<Ts...>;
};

template <typename BufferType>
using dataframe_buffer_value_type_t = typename dataframe_buffer_value_type<BufferType>::type;

template <typename T>
struct dataframe_buffer_type {
using type = decltype(allocate_dataframe_buffer<T>(size_t{0}, rmm::cuda_stream_view{}));
};

template <typename T>
using dataframe_buffer_type_t = typename dataframe_buffer_type<T>::type;

} // namespace cugraph
Loading
Loading