Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vulkan AS rebuild-on-replay: Copy AS input buffers on build #3426

Open
wants to merge 2 commits into
base: v1.x
Choose a base branch
from

Conversation

cmannett85-arm
Copy link
Collaborator

When vkCmdBuildAccelerationStructuresKHR is called, the input buffers used for the build are copied into host-accessible memory, so when initial state serialisation is triggered it can be copied into the capture file.

To unify the behaviour of the different geometry types, there is only one readback buffer per AS, and in the case of triangle data it can contain the vertex, index, and transform data concatenated into a single contiguous block.

The buffer data along with build arguments are stored in a VkAccelerationStructureInfo object owned by the AS's VkResourceRecord. If this record is freed, then the copied buffer mem is also freed. If this destruction occurs during the active capture phase, the AS info is kept alive until the end of the active capture.

Serialisation of the initial state will follow in later patches.

Caveats:

  • vkCmdBuildAccelerationStructuresIndirectKHR is not handled
  • VkAccelerationStructureGeometryInstancesDataKHR::arrayOfPointers mode is not handled

Copy link
Owner

@baldurk baldurk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a general note since I can't put this on the specific line, remember to disable the indirect build physical device feature.

It's fine not to support indirect builds as it seems few drivers do as well and I don't think anyone will use it given the feature is not supported in D3D12, but I think it should be technically possible as it looks like we get bounds on the data size on the CPU? The main challenge will be doing the copy, might have to fall back to a compute shader or something and do it inefficiently.

renderdoc/driver/vulkan/wrappers/vk_misc_funcs.cpp Outdated Show resolved Hide resolved
renderdoc/driver/vulkan/vk_acceleration_structure.h Outdated Show resolved Hide resolved
renderdoc/driver/vulkan/vk_acceleration_structure.h Outdated Show resolved Hide resolved
renderdoc/driver/vulkan/vk_acceleration_structure.cpp Outdated Show resolved Hide resolved
renderdoc/driver/vulkan/vk_acceleration_structure.cpp Outdated Show resolved Hide resolved
renderdoc/driver/vulkan/vk_acceleration_structure.cpp Outdated Show resolved Hide resolved
renderdoc/driver/vulkan/vk_acceleration_structure.cpp Outdated Show resolved Hide resolved
renderdoc/driver/vulkan/vk_acceleration_structure.cpp Outdated Show resolved Hide resolved
renderdoc/driver/vulkan/vk_acceleration_structure.cpp Outdated Show resolved Hide resolved
When vkCmdBuildAccelerationStructuresKHR is called, the input buffers used for the build are copied into host-accessible memory, so when initial state serialisation is triggers it can be copied into the capture file.

To unify the behaviour of the different geometry types, there is only one readback buffer per AS, and in the case of triangle data it can contain the vertex, index, and transform data concatenated into a single contiguous block.

The buffer data along with build arguments are stored in a ref-counted VkAccelerationStructureInfo object held by the AS's VkResourceRecord.  Once the info object is destroyed, then the copied buffer mem is also freed.

Serialisation of the initial state will follow in later patches.

Caveats:
* vkCmdBuildAccelerationStructuresIndirectKHR is not handled
* VkAccelerationStructureGeometryInstancesDataKHR::arrayOfPointers mode is not handled
The spec does not mandate that the compaction size is the same between capture and replay, so we effectively disable it by always returning the full AS size when the compaction size is queried.
Copy link
Owner

@baldurk baldurk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The input buffer copying looks good. I've made some notes about the query pool handling though - there's one wrinkle I hadn't thought of which makes it more complex, but hopefully some ways it can be simplified so it's not too bad overall.

RDResult CopyInputBuffers(VkCommandBuffer commandBuffer,
const VkAccelerationStructureBuildGeometryInfoKHR &info,
const VkAccelerationStructureBuildRangeInfoKHR *buildRange,
CaptureState state);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This state seems redundant here - we don't do this during replay, and we don't need to differentiate between background and active capturing. Doing a quick look through unless I'm missing something this value is (ultimately) not used.

Same applies below to CopyAccelerationStructure.


void QueryPoolInfo::Reset(uint32_t firstQuery, uint32_t queryCount)
{
m_Entries.removeIf([&](const auto &entry) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please avoid auto except where necessary for lambdas or extremely ugly for std iterators, you can use Entry here.

Also please use explicit lambda capture rather than unconditional capture to be clear about what is stored in the lambda.

}

void QueryPoolInfo::Replace(uint32_t firstQuery, uint32_t queryCount,
const std::function<void(Entry)> &writeEntry) const
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason for this function to be separated out? It looks like it's only called from the other Replace() to apply any replacements so could be done inside that function? That should allow some simplification and ease of reading since it took me a couple of passes to understand the code flow with the functions being named the same.

void Replace(uint32_t firstQuery, uint32_t queryCount,
const std::function<void(Entry)> &writeEntry) const;

rdcarray<Entry> m_Entries;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this more generally, query pools are single-typed so we know at creation time whether we're either going to replace all results or none of them (when the query pool type is compacted size).

Do you have any empirical data or intuition on how large those query pools are? I would assume either 1, some small fixed count (UE5 seems to use 64, configurable), or maybe the application's expected max number of BLASs (some thousands?).

Unless that seems unreasonable or you're otherwise worried about having a huge number of possible queries with only a few actually being valid, I think this code could be simplified a good deal by a) only allocating this struct when needed for patching and b) pre-allocating all the entries up front. Then you don't have to worry about intersecting, resizing this array on the fly, etc. We just read all our values out of it.

for(uint32_t i = 0; i < accelerationStructureCount; ++i)
sizes.push_back(GetRecord(pAccelerationStructures[i])->memSize);

qpInfo->Add(firstQuery, std::move(sizes));
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately I've realised seeing this that there is an ordering/timeline constraint that I hadn't considered with this data. Because the properties write to a query and then the queries are resolved separately, these commands can be recorded and submitted out of order so we can't update a single authoritative fake query pool on the CPU timeline and update from there.

There may be a clever solution to this to resolve it on the CPU by deferring query updates and keeping multiple versions around, but I think I'd say we should solve it on the GPU to keep things simpler:

  1. Create an internal buffer backing the query pool (ie. what the driver probably does on most GPUs) of equal size to its full set of results. It should probably be readback memory to be CPU-accessible because of 4 below. In theory in future we could have two buffers kept up to date in parallel, one for CPU reads and one for GPU reads, but I doubt that will be a problem.
  2. This vkCmdWriteAccelerationStructuresPropertiesKHR writes the data via vkUpdateBuffer into that backing buffer - again effectively in parallel to what the driver is doing.
  3. In vkCmdCopyQueryPoolResults we do a plain vkCmdCopyBuffer from backing buffer to user's results buffer.
  4. In vkGetQueryPoolResults we CPU-read from the backing buffer.

It's a little bit uglier/more work than tracking purely on the CPU side but it keeps the timeline neatly in sync without having to mirror it with submission tracking and CPU-side versioning of data.

@@ -1699,6 +1705,8 @@ void WrappedVulkan::vkResetQueryPool(VkDevice device, VkQueryPool queryPool, uin
SERIALISE_TIME_CALL(
ObjDisp(device)->ResetQueryPool(Unwrap(device), Unwrap(queryPool), firstQuery, queryCount));

GetRecord(queryPool)->queryPoolInfo->Reset(firstQuery, queryCount);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also misses handling of CmdResetQueryPool but I think with the other comments handling reset may not be needed at all.


if(instanceInfo.arrayOfPointers)
return RDResult(ResultCode::InternalError,
"AS instance build arrayOfPointers unsupported");
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI there is a helper RETURN_ERROR_RESULT which can do this and also print the RDCERR for you, without the calling code needing to do that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants