Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(PR2) Adds the support for Flexible Data Placement(FDP) over NVMe into the Cachelib #277

Closed

Conversation

arungeorge83
Copy link

@arungeorge83 arungeorge83 commented Nov 15, 2023

This adds the device layer support for NVMe-FDP semantics and adds the RUH-awareness feature of NVMe-FDP in
the upper layers of Navy. This allows the BlockCache(large items) and BigHash(small items) of Navy to segregate their data streams in physical NAND media by using the FDP placement Identifiers.

With this changes, the Cachelib can reduce the Device Write Amplification (WAF) significantly even in high SSD utilization scenarios("nvmCacheSizeMB" above 50% of the SSD capacity) in most of the cachelib workloads.

This commit introduces a 'placementHandle' concept for data placement, which can be used by both BC and BH of Navy on
device write() calls, especially for FDP placements. The 'placementHandle' have to be allocated from the device.

io_uring_cmd interface(through nvme char device) is used to send FDP directives to Linux kernel, as sending it through the
conventional block interfaces is not suported yet. The user can select the NVMe block device (Namespace/partition) as usual
(Ex: "nvmCachePaths": ["/dev/nvme0n1p1"]), and the cachelib will pick the corresponding NVMe char device internally.

This commit adds a new config 'fdpMode' to enable FDP. The user needs to select the fdpMode along with iOUring I/O Engine options ("navyEnableIoUring": true, "navyQDepth": 1, "fdpMode": true).

This second PR consist of the code which can be integrated to the 'iouring async-io' support in the Navy.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 15, 2023
Copy link
Contributor

@jaesoo-fb jaesoo-fb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you check if build is OK with removing liburing package?

@@ -541,6 +543,8 @@ class NavyConfig {
// ============ Device settings =============
// Set the device block size, i.e., minimum unit of IO
void setBlockSize(uint64_t blockSize) noexcept { blockSize_ = blockSize; }
// Set the FDP data placment mode in device
void setFDPMode(bool enable) noexcept { fdp_ = enable; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit. setFDPEnabled

cachelib/allocator/nvmcache/NavyConfig.h Outdated Show resolved Hide resolved
cachelib/allocator/nvmcache/NavyConfig.h Outdated Show resolved Hide resolved
Comment on lines 484 to 513
if(isFDPEnabled) {
f = openNvmeCharFile(path);
} else {
f = openCacheFile(path, fdSize, truncateFile);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can you support the partition? I.e., we want to use only [start_lba, start_lba + partition_size)

W.r.t. this, would it be possible to take the block device name and map it to char dev? i.e., "nvmen[p]" to "ngn"?

The rationale is that this will allow us to apply the size check and fallocate thing the same. i.e., we can do:

f = openCacheFile(path, ...);
if (isFDPEnabled) {
  f.close();
  f = openNvmeCharFile(getNvmeCharDevice(path), ...);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there were some formatting error. I meant the mapping: "nvmeXnY[pZ]" to "ngXnY".

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, your requirement is as follows,
There are partitions like nvmeXnYp1, nvmeXnYp2 ... , and one of them is used in this cachelib instance.
(Ex: nvmCachePath = "/dev/nvmeXnYp2")
OR is it just one partition in that namespace, but it happened to be partition-formatted ?

The rationale is that this will allow us to apply the size check and fallocate thing the same
Is this steps used for device simulation scenarios?

Anyway the proposed approach is doable and making sense.

cachelib/navy/bighash/BigHash.cpp Outdated Show resolved Hide resolved
auto remainingSize = size;
auto maxWriteSize = (maxWriteSize_ == 0) ? remainingSize : maxWriteSize_;
maxWriteSize = (maxIOSize_ == 0) ? maxWriteSize : maxIOSize_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this after fix setting maxIOSize_.

#ifndef CACHELIB_IOURING_DISABLE
if (isFDPEnabled_) {
fdpNvme_ = std::make_shared<FdpNvme>(fvec_[0].fd());
setMaxIOSize(fdpNvme_->getMaxTfrSize());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix maxWriteSize_ too.

maxIOSize_ = fdpNvme_->getMaxTfrSize();
maxWriteSize_ = min(maxWriteSize_, maxIOSize_);

cachelib/navy/common/Device.cpp Outdated Show resolved Hide resolved
cachelib/navy/common/Device.cpp Outdated Show resolved Hide resolved
cachelib/navy/common/Device.cpp Outdated Show resolved Hide resolved
Comment on lines 875 to 877
AsyncIoContext::prepUringCmdAsyncIo(int fd, uint8_t opType, void* data,
uint32_t size, uint64_t offset, void* userdata, int handle) {
std::unique_ptr<folly::AsyncBaseOp> asyncOp;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might need to guard this code appropriately using CACHELIB_IOURING_DISABLE.

So, suggest to call prepUringCmdAsyncIo inside the prepAsyncIo with guarding this function with #ifndef CACHELIB_IOURING_DISABLE

} else {
fdpNvme_->prepWriteUringCmdSqe(sqe, fd, data, size, offset, handle);
}
asyncOp = std::move(iouringCmdOp);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need to have asyncOp separately? We can just return iouringCmdOp.

cachelib/navy/common/Device.cpp Outdated Show resolved Hide resolved
cachelib/navy/common/Device.cpp Outdated Show resolved Hide resolved
cachelib/navy/common/Device.cpp Outdated Show resolved Hide resolved
cachelib/navy/common/Device.cpp Outdated Show resolved Hide resolved
Comment on lines 19 to 24
#include <linux/nvme_ioctl.h>
#include <sys/ioctl.h>
#include <errno.h>
#include <cstring>
#include <numeric>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit. in alphabetical order?

Copy link
Contributor

@jaesoo-fb jaesoo-fb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you check if this compiles in supported platforms here https://github.com/facebook/CacheLib/actions?

Comment on lines 45 to 47
auto pid = getFdpPID(phndl);

return static_cast<int>(pid);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to expose the placement id here? Why not just give out the phndl?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, possible.

cachelib/navy/common/FdpNvme.cpp Show resolved Hide resolved
cachelib/navy/common/FdpNvme.cpp Outdated Show resolved Hide resolved
cachelib/navy/common/FdpNvme.cpp Outdated Show resolved Hide resolved
cachelib/navy/common/FdpNvme.h Outdated Show resolved Hide resolved
cachelib/navy/common/FdpNvme.cpp Outdated Show resolved Hide resolved
cachelib/navy/common/FdpNvme.cpp Outdated Show resolved Hide resolved
cachelib/navy/common/FdpNvme.h Outdated Show resolved Hide resolved
cachelib/navy/common/FdpNvme.cpp Outdated Show resolved Hide resolved
cachelib/navy/common/FdpNvme.cpp Outdated Show resolved Hide resolved
@arungeorge83
Copy link
Author

Can you check if build is OK with removing liburing package?

FdpNvme.cpp is not protected for this case. I guess you want to bring them under -DCACHELIB_IOURING_DISABLE, right?

@arungeorge83
Copy link
Author

arungeorge83 commented Nov 16, 2023

Can you check if this compiles in supported platforms here https://github.com/facebook/CacheLib/actions?

As per the status,
'This workflow is awaiting approval from a maintainer in #277'
Could you check it and approve?
Or should I verify it locally?

@jaesoo-fb
Copy link
Contributor

As per the status, 'This workflow is awaiting approval from a maintainer in #277' Could you check it and approve? Or should I verify it locally?

Oh, my bad. I just started the workflows. I expect some of them to be failed, so please fix it as it appears.

Comment on lines 466 to 471
if (result) {
bytesRead_.add(readSize);
} else {
readIOErrors_.inc();
return false;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit. return early to reduce the indent depth.

if (!result) {
  ...
  return false;
}

bytesRead_.add(readSize);

: asyncBase_(std::move(asyncBase)),
id_(id),
qDepth_(capacity),
useIoUring_(useIoUring) {
useIoUring_(useIoUring),
fdpNvme_(fdpNvme) {
#ifdef CACHELIB_IOURING_DISABLE
// io_uring is not available on the system
XDCHECK(!useIoUring_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit. XDCHECK(!useIoUring_ && !fdbNvme_)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you meant XDCHECK(!useIoUring_ || !fdbNvme_) ?
It is an error here, if either of them is enabled.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If CACHELIB_IOURING_DISABLE, both of useIoUring_ and fdbNvme_ should be nullptr, right?

}
#endif

bool AsyncIoContext::submitIo(IOOp& op) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit. Could you move this back to original place, i.e., above prepAsyncIo?

Comment on lines 858 to 893
#ifndef CACHELIB_IOURING_DISABLE
std::unique_ptr<folly::AsyncBaseOp> AsyncIoContext::prepNvmeIo(IOOp& op) {
std::unique_ptr<folly::IoUringOp> iouringCmdOp;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is more simpler to guard the body to minimize the use of CACHELIB_IOURING_DISABLE, i.e.,

std::unique_ptr<folly::AsyncBaseOp> AsyncIoContext::prepNvmeIo(IOOp& op) {
#ifndef CACHELIB_IOURING_DISABLE
...
#else 
  return nullptr;
#endif
}

Comment on lines 161 to 164
void setMaxWriteSize(uint32_t maxWriteSize) { maxWriteSize_ = maxWriteSize; }

void setMaxIOSize(uint32_t maxIOSize) { maxIOSize_ = maxIOSize; }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove those after moving FdpNvme construction to createDirectIoFileDevice?

Comment on lines 967 to 974
if (!!fdpNvme_) {
auto maxIOSize = fdpNvme_->getMaxTfrSize();
if (maxIOSize) {
setMaxIOSize(maxIOSize);
maxDeviceWriteSize = std::min<size_t>(maxDeviceWriteSize, maxIOSize);
setMaxWriteSize(maxDeviceWriteSize);
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I feel like it would be better to move this logic to createDirectIoFileDevice and pass FdpNvme and maxIOSize as parameters instead of isFDPEnabled.

Also, could you move this if inside of try without if (!!fdpNvme_)?

Copy link
Author

@arungeorge83 arungeorge83 Dec 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A query;
you meant to have something like this in the FileDevice creation, right?
FileDevice::FileDevice(...,
uint32_t maxDeviceWriteSize,
uint32_t maxIOSize,
std::shared_ptr fdpNvme,
.....)
: Device(...,
maxDeviceWriteSize,
maxIOSize)

maxIOSize would be set the Device initiation itself, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, something like that. Both maxIOSize_ and maxDeviceWriteSize_ would be set in the constructor initializer list.

} else {
throw std::bad_alloc();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we need this? Isn't this an assert?

Rather, I guess we need to raise some exception (invalid_argument?) if ruh_status->nruhsd is 0 to disable fdp?

if (!ruh_status->nruhsd) {
    throw std::invalid_argument("Failed to initialize FDP; nruhsd is 0");
}

size_t devPos = blockDevice.find_first_of("0123456789");
size_t pPos = blockDevice.find('p', devPos);

return "/dev/ng" + blockDevice.substr(devPos, pPos - devPos);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this is not enough to support the partition; we have to set the offset and the size too. Not sure whether there is some library with which we can get the start lba and size of partition from sysfs, i.e., /sys/block/nvme1n1/nvme1n1p2/start

Copy link
Author

@arungeorge83 arungeorge83 Dec 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. Sorry, I missed the 'offset'.
And the 'offset' need to be passed to the createDirectIoFileDevice(), right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think it would be good to have all nvme specific code in Device.cpp or FdpNvme.h

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A thought in the same line:
Is it ok if we keep all the nvme char device handling inside FdpNvme.cpp?

  1. That might require two sets of folly:file.
    one that is kept(file1) by FileDevice, which is of /dev/nvmeXnYpZ; the other (file2) kept by FdpNvme which is of /dev/ngXnY.
    And map is kept by FdpNvme, "file1" : "file2, offset, size".
    What do you say? Do you foresee any statistics collection related issues if we keep two sets of folly::file.

  2. The other way is to have all the logic in Device.cpp:createFdpFileDevice() which in turn makes use of different functions exported through FdpNvme.h, such as openNvmeCharFile(blkdev_path) (Returns folly::file, part_offset, size etc)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it ok if we keep all the nvme char device handling inside FdpNvme.cpp?

Sounds good to me and prefer option 1. However, I don't think you need a map. Instead, I think we can make FdpNvme represents an FDP NVMe block device which could be a partition or a whole namespace. The char dev fd and partition offset could be property of the FdpNvme.

W.r.t. this, I can see you are initializing FdpNvme for the first device only. So, I think we can restrict FdpNvme is only supported for single block device only for now.

Copy link
Author

@arungeorge83 arungeorge83 Dec 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, sure.

So, I think we can restrict FdpNvme is only supported for single block device only for now.

Yes, I am adding explicit checks and exceptions to limit the FDP support for single device for now.

@facebook-github-bot
Copy link
Contributor

@jaesoo-fb has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

1 similar comment
@facebook-github-bot
Copy link
Contributor

@jaesoo-fb has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@arungeorge83 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@jaesoo-fb has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Contributor

@jaesoo-fb jaesoo-fb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good. Thanks for the work.

@@ -549,6 +551,8 @@ class NavyConfig {
// ============ Device settings =============
// Set the device block size, i.e., minimum unit of IO
void setBlockSize(uint64_t blockSize) noexcept { blockSize_ = blockSize; }
// Set the NVMe FDP Device data placement mode in the Cachelib
void setFDPMode(bool enable) noexcept { fdpEnabled_ = enable; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit. setEnableFDP.

Could you also change other instances of FDPMode to EnableFDP?

@@ -227,6 +227,9 @@ struct CacheConfig : public JSONConfig {
// Navy will split it into multiple IOs.
uint32_t deviceMaxWriteSize{1024 * 1024};

// Enable the FDP Data placement mode in the device, if it is capable.
bool fdpMode{false};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit. deviceEnableFDP

@@ -694,6 +698,9 @@ class NavyConfig {
// Whether to use write size (instead of parcel size) for Navy admission
// policy.
bool useEstimatedWriteSize_{false};
// Whether Navy support the NVMe FDP data placement(TP4146) directives or not.
// Reference: https://nvmexpress.org/nvmeflexible-data-placement-fdp-blog/
bool fdpEnabled_{false};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit. enableFDP_

Comment on lines 1124 to 1125
#else
XDCHECK(!isFDPEnabled);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit. not needed.

Comment on lines 212 to 215
// Some devices have this transfer size limit due to DMA size limitations.
// This limit is applicable for both writes and reads.
const uint32_t maxIOSize_{0};

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit. Could you move this right above maxWriteSize_ to match with the initialization order in the constructor?

// as of now; and not supported through conventional block interfaces.
class FdpNvme {
public:
FdpNvme(const std::string& fileName);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make this explicit, i.e., explicit FdpNvme(...)

#include <sys/ioctl.h>

#include <cstring>
#include <numeric>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this used?

@@ -788,6 +844,32 @@ bool AsyncIoContext::submitIo(IOOp& op) {
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot add comment above. Could you remove req which is not used.

@@ -773,9 +829,8 @@
 }

 bool AsyncIoContext::submitIo(IOOp& op) {
-  IOReq& req = op.parent_;
-
   op.startTime_ = getSteadyClock();
+
   while (numOutstanding_ >= qDepth_) {
     if (qDepth_ > 1) {
       XLOG_EVERY_MS(ERR, 10000) << fmt::format(

@@ -44,6 +44,14 @@ class MockDevice : public Device {
MOCK_METHOD3(readImpl, bool(uint64_t, uint32_t, void*));
MOCK_METHOD3(writeImpl, bool(uint64_t, uint32_t, const void*));
MOCK_METHOD0(flushImpl, void());
MOCK_METHOD0(allocatePlacementHandle, int());

virtual bool writeImpl(uint64_t offset,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove virtual here and add override instead. i.e., bool writeImpl(...) override {

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This results in compilation error (due to gtest restrictions on default argument - int handle - I guess).
How about adding a line ?
MOCK_METHOD4(writeImpl, bool(uint64_t, uint32_t, const void*, int));

size_t bPos = blockDevice.find_last_of("/") + 1;
size_t pPos = blockDevice.find('p', bPos);

XDCHECK(bPos < std::string::npos);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this.

@arungeorge83
Copy link
Author

Overall looks good. Thanks for the work.

Could you apply below patch?

diff --git a/fbcode/cachelib/navy/common/TARGETS b/fbcode/cachelib/navy/common/TARGETS
--- a/fbcode/cachelib/navy/common/TARGETS
+++ b/fbcode/cachelib/navy/common/TARGETS
@@ -7,6 +7,7 @@
     srcs = [
         "Buffer.cpp",
         "Device.cpp",
+        "FdpNvme.cpp",
         "Hash.cpp",
         "SizeDistribution.cpp",
         "Types.cpp",
@@ -15,6 +16,7 @@
         "Buffer.h",
         "CompilerUtils.h",
         "Device.h",
+        "FdpNvme.h",
         "Hash.h",
         "NavyThread.h",
         "SizeDistribution.h",
@@ -27,7 +29,6 @@
         "//folly:function",
         "//folly:thread_local",
         "//folly/experimental/io:async_io",
-        "//folly/experimental/io:io_uring",
         "//folly/hash:checksum",
         "//folly/hash:hash",
         "//folly/io/async:event_base_manager",
@@ -41,6 +42,8 @@
         "//folly:file",
         "//folly:portability",
         "//folly:range",
+        "//folly/experimental/io:async_base",
+        "//folly/experimental/io:io_uring",
         "//folly/fibers:core_manager",
         "//folly/fibers:fiber_manager_map",
         "//folly/fibers:timed_mutex",
@@ -50,4 +53,7 @@
         "//folly/lang:bits",
         "//folly/logging:logging",
     ],
+    exported_external_deps = [
+        ("liburing", None, "uring"),
+    ],
 )

I could not find the file TARGETS. Is it something not present in open source version?

@jaesoo-fb
Copy link
Contributor

I could not find the file TARGETS. Is it something not present in open source version?

Yeah, this is for internal build. Please ignore.

}

bool isValidNvmeDevice(const std::string& fileName) {
return std::regex_match(fileName, std::regex("^/dev/nvme\\d+n\\d+(p\\d+)?$"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you replace this with the equivalent (re2::RE2::FullMatch?) from RE2? https://github.com/google/re2

We cannot use regex_match due to its vulnerability to segfault https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61582&fbclid=IwAR16Xwiru8QryYObTnoATM5WwCSTX9IRVorMwDxYq9E0sxrsBQTw-H1iKH4

@facebook-github-bot
Copy link
Contributor

@arungeorge83 has updated the pull request. You must reimport the pull request before landing.

…nto Cachelib

Summary:
This commit adds the device layer support for NVMe-FDP semantics and adds the RUH-awareness feature of NVMe-FDP in
the upper layers of Navy. This allows the BlockCache(large items) and BigHash(small items) of Navy to segregate their data streams
in physical NAND media by using the FDP placement Identifiers.

With this changes, the Cachelib can reduce the Device Write Amplification (WAF) significantly even in high SSD utilization scenarios
("nvmCacheSizeMB" above 50% of the SSD capacity) in most of the cachelib workloads.

This commit introduces a 'placementHandle' concept for data placement, which can be used by both BC and BH of Navy on
device write() calls, especially for FDP placements. The 'placementHandle' have to be allocated from the device.

io_uring_cmd interface(through nvme char device) is used to send FDP directives to Linux kernel, as sending it through the
conventional block interfaces is not suported yet. The user can select the NVMe block device (Namespace/partition) as usual
(Ex: "nvmCachePaths": ["/dev/nvme0n1p1"]), and the cachelib will pick the corresponding NVMe char device internally.

This commit adds a new config 'deviceEnableFDP' to enable FDP. The user needs to select this along with iOUring I/O Engine options.
("navyEnableIoUring": true, "navyQDepth": 1, "deviceEnableFDP": true).

Signed-off-by: Arun George <[email protected]>
Signed-off-by: Vikash Kumar <[email protected]>
@facebook-github-bot
Copy link
Contributor

@jaesoo-fb has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Comment on lines +1030 to +1031
options.sqe128 = true;
options.cqe32 = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I cannot see the big_cqe is used anymore. Is this still needed?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, we can remove the line "options.cqe32 = true;" for now. We don't have the big_cqe use cases in the current code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arungeorge83 Actually, I would like to remove this option completely. So, we don't need sqe128, either, right?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jaesoo-fb Actually both sqe128 and cqe32 are mandatory for NVMe passthrough commands. Else the kernel NVMe driver will reject those ios.
Reference: https://elixir.bootlin.com/linux/latest/source/drivers/nvme/host/ioctl.c#L742, Iceber/iouring-go#21

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also from Kanchan Joshi's words (Author of the IO passthru patch - https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=456cba386e94f22fa1b1426303fdcac9e66b1417),
NVMe passthrough commands are of 72 bytes in size. This can not fit into regular SQE. Therefore big SQE (128 bytes) is required.
Also passthrough commands may return more than one result to userspace. Returning addtional result is not possible with regular CQE. Hence big CQE is needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarification. Yeah, it looks clear here https://elixir.bootlin.com/linux/v6.7/source/drivers/nvme/host/ioctl.c#L742

Comment on lines +26 to +28
#ifndef CACHELIB_IOURING_DISABLE
#include <liburing.h>
#endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI. Actually, we should not include liburing.h directly here due to some internal dependency issues causing test failures on some targets. I will remove this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.
Then should we include <liburing.h> through <folly/experimental/io/IoUring.h> which include <liburing.h> in the file?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arungeorge83 Yeah, right. It will include the liburing.h from the repo instead of that installed on the system.

FYI, I am taking care of minor changes including those required for the internal build, so you don't need to update the commit.

@facebook-github-bot
Copy link
Contributor

@jaesoo-fb merged this pull request in 009e89b.

@jaesoo-fb
Copy link
Contributor

Submitted with the fixup b5d70a5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants