Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The benchmark is invalid #3

Open
benlwalker opened this issue Aug 24, 2022 · 4 comments
Open

The benchmark is invalid #3

benlwalker opened this issue Aug 24, 2022 · 4 comments

Comments

@benlwalker
Copy link

This benchmark has at least two fatal flaws.

  1. It uses more threads than CPU cores, which is not allowed with event-loop style software like SPDK or io_uring. You must always spawn 1 thread per core, onto which you can multi-plex more work.
  2. It only issues I/O at queue depth 1 and then busy waits. Systems like SPDK (or io_uring) are not designed to be used this way. Submit many I/O at once, then poll for a batch of completions and submit another batch.

This is why you see extremely low performance from SPDK and no better performance with io_uring than a blocking read system call.

@lei-houjyu
Copy link
Collaborator

Hi Ben,

Thanks for your comments! Please find my replies below:

  1. It uses more threads than CPU cores, which is not allowed with event-loop style software like SPDK or io_uring. You must always spawn 1 thread per core, onto which you can multi-plex more work.

Please note that the thread number spans from 1 to 24 in our experiments, which covers not only "more threads than CPU cores", but also less and equal threads. If you are interested in the peak performance of SPDK, you can find it at the data point of 6 threads (same as the CPU core number of our evaluation machine). We're also aware of SPDK's impotence under more threads than CPU cores and discuss it in Sec. 6.2 of our paper (citation 54 also offers an excellent explanation).

It only issues I/O at queue depth 1 and then busy waits. Systems like SPDK (or io_uring) are not designed to be used this way. Submit many I/O at once, then poll for a batch of completions and submit another batch.

We use the queue depth of 1 only in the close-loop experiments of comparing the latency, because this setting gives the lowest latency. When trying to compare the max throughput, we did use a similar setting as you suggested, i.e., increase the queue depth and submit in batches (Figure 7).

Hope my response answers your question. We're happy to talk more. Thank you!

@benlwalker
Copy link
Author

Full disclosure - I'm one of the creators of SPDK and designed many of these APIs, including SPDK's thread multiplexing system. I really am the authority on what is or isn't correct usage of SPDK. And you are doing it very wrong.

We're also aware of SPDK's impotence under more threads than CPU cores and discuss it in Sec. 6.2 of our paper (citation 54 also offers an excellent explanation).

This statement is horribly wrong. The problem is not at all that event-based systems like SPDK or io_uring don't work well when you oversubscribe the cores. The problem is that you have failed to understand how to use them correctly. These systems multiplex many events from different sources and tasks onto a single thread per core. BY DESIGN. Using more than one thread per core does not make any sense and can only be considered an invalid configuration. Your paper's discussion of these systems is deeply flawed.

We use the queue depth of 1 only in the close-loop experiments of comparing the latency, because this setting gives the lowest latency. When trying to compare the max throughput, we did use a similar setting as you suggested, i.e., increase the queue depth and submit in batches (Figure 7).

Even in the open loop case, unless you are setting the rate parameter which isn't mentioned in the paper, it's going to busy poll for 16us after every submission for absolutely no reason. This roughly lines up with the ~100k IOPs per core you're seeing in the benchmark. If you were doing it right, you'd get more than 10 million per core. See the available benchmarks in the SPDK project for how to do it correctly.

Look - I don't dislike the general idea of a BPF resubmit function. I don't think it has much chance of being merged due to the layering violations, but I'm not opposed to the concept and think it has definite advantages for applications written to do synchronous IO from thread pools. But the use of both SPDK and io_uring here is just incorrect, making all of your comparisons invalid.

@lei-houjyu
Copy link
Collaborator

Hi Ben,

We’re glad to be in contact with someone from the SPDK project, and appreciate your feedback.

This statement is horribly wrong. The problem is not at all that event-based systems like SPDK or io_uring don't work well when you oversubscribe the cores. The problem is that you have failed to understand how to use them correctly. These systems multiplex many events from different sources and tasks onto a single thread per core. BY DESIGN. Using more than one thread per core does not make any sense and can only be considered an invalid configuration. Your paper's discussion of these systems is deeply flawed.

We agree that SPDK was not originally designed to run with multiple threads per core. And this is exactly what we are showing in the multi-threading experiment. Indeed our experiment shows that to achieve its peak performance, SPDK should run with one single thread per core. This experiment precisely leads us to the conclusion in the paper: when the thread number is less than the cores, SPDK offers the best performance, and XRP is closer to it than read() and io_uring. Whereas, once there is high CPU contention (i.e., multiple threads on a core), XRP scales better than others.

We can certainly envision situations with high CPU contention where a thread with high CPU usage is contending with a thread that is using SPDK. For example, this can happen with non-cooperative applications that share the same core. Our multi-threading experiment is a proxy for such a high CPU contention scenario -- perhaps not a perfect proxy.

Even in the open loop case, unless you are setting the rate parameter which isn't mentioned in the paper, it's going to busy poll for 16us after every submission for absolutely no reason. This roughly lines up with the ~100k IOPs per core you're seeing in the benchmark. If you were doing it right, you'd get more than 10 million per core. See the available benchmarks in the SPDK project for how to do it correctly.

The Y axis units in Figure 7 are actually the number of BPF-KV operations, not the total IOPS. Given an index depth of 6, one BPF-KV operation equals 7 I/Os (6 index lookups + 1 lookup to fetch the value). In Figure 7, with 6 threads, SPDK achieves 697k BPF-KV OPS (116k per core), which translates to 4.9M IOPS (813k per core). Given the max 512B-random-read throughput of the device that we used is 5.0M IOPS, SPDK’s throughput matches the device IOPS.

Regarding the rate parameter, we’re not sure what you’re referring to. We set a target rate of generating the KV requests to the hardware limit.

@benlwalker
Copy link
Author

We agree that SPDK was not originally designed to run with multiple threads per core. And this is exactly what we are showing in the multi-threading experiment. Indeed our experiment shows that to achieve its peak performance, SPDK should run with one single thread per core. This experiment precisely leads us to the conclusion in the paper: when the thread number is less than the cores, SPDK offers the best performance, and XRP is closer to it than read() and io_uring. Whereas, once there is high CPU contention (i.e., multiple threads on a core), XRP scales better than others.

We can certainly envision situations with high CPU contention where a thread with high CPU usage is contending with a thread that is using SPDK. For example, this can happen with non-cooperative applications that share the same core. Our multi-threading experiment is a proxy for such a high CPU contention scenario -- perhaps not a perfect proxy.

I understand your line of thinking now and there are a couple of mistakes still. First, your paper and your comments here still lead the reader to believe that running SPDK with more than one thread per core is a valid thing to do. It's not. Phrasing such as "not originally designed to run with multiple threads per core", where "originally" implies that the situation may have changed since it was originally written is misleading.

Second, I understand your focus is on systems where the CPU resources are oversubscribed. Unfortunately, your proxy for a system that's oversubscribed is not only "not a perfect proxy" but not even a reasonable one. What you've done is take a 6 core system and run more than 6 SPDK threads on it that are busy waiting, so they're constantly contending with one another. To state it again, you'd never run more than 1 SPDK thread per core for any reason. What can happen, as you note, is that other applications on the system may steal time away from the SPDK threads. That's not the intended way to use SPDK, but that's something far more reasonable to explore. A benchmark that runs 6 SPDK cores plus another application that simulates doing compute work on maybe 3 to 5 of the cores is what the proxy should have been.

Further, for the benchmark of XRP, you run more than 6 threads that are not busy waiting and instead nicely yield their timeslice to the kernel for interrupts to be processed. Again, what you should have done here was make 6 XRP threads and then had another application simulating doing compute work on some number of cores. This application would have interfered in the interrupt processing.

The Y axis units in Figure 7 are actually the number of BPF-KV operations, not the total IOPS. Given an index depth of 6, one BPF-KV operation equals 7 I/Os (6 index lookups + 1 lookup to fetch the value). In Figure 7, with 6 threads, SPDK achieves 697k BPF-KV OPS (116k per core), which translates to 4.9M IOPS (813k per core). Given the max 512B-random-read throughput of the device that we used is 5.0M IOPS, SPDK’s throughput matches the device IOPS.

I understand the math now, but it's ridiculous to scale up a busy-polling system like SPDK to more cores than it needs to saturate the hardware. SPDK should never be running more than 1 thread on a system only capable of 5M I/Ops. You left off lower thread counts from Figure 7 (a) - I'd highly recommend you go back and run those. SPDK should saturate that disk using just one thread if your benchmark is working correctly. And if you have a better workload proxy that burns 3-5 of the 6 cores doing other compute, you'd see that the actual optimal way to share the cores is to dedicate one core to SPDK and 5 to the other compute task.

Regarding the rate parameter, we’re not sure what you’re referring to. We set a target rate of generating the KV requests to the hardware limit.

Here's your benchmark's main loop per thread:

    for (size_t i = 0; i < r->op_count; i++) {
        // 1. busy polling until the new deadline
        while (early_than(&now, &deadline)) {
            if (i > r->finished) {
                spdk_nvme_qpair_process_completions(r->qpair, 0);
            }
            clock_gettime(CLOCK_REALTIME, &now);
        }
        add_nano_to_timespec(&deadline, gap);

        // 2. init the request
        key__t key = rand() % max_key;
        Request *req = init_request(global_ns, r->qpair, NULL, key, NULL, r);

        // 3. ensure queue is not overflowed
        while (i - r->finished >= io_queue_size) {
            spdk_nvme_qpair_process_completions(r->qpair, 0);
        }
        
        // 4. issue the request
        clock_gettime(CLOCK_REALTIME, &req->start);
        if ((rc = spdk_nvme_ns_cmd_read(req->ns, req->qpair, req->buff,
                                    0, 1, traverse_complete, req, 0)) != 0) {
            printf("starting read I/O failed: %s\n", strerror(rc));
            exit(1);
        }
        now = req->start;
    }

You should instead have each thread issue queue depth worth of reads, then sit in a while loop calling spdk_nvme_qpair_process_completions(). When each operation completes, it should continue on to the next step from within the completion callback. There should be no sleeps, no rate limits, and no new command submissions within the while loop. It is an event-based system - everything should be driven by events.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants