Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

number of io_uring instances and its impact on parallelism and concurrency #495

Open
kennthhz-zz opened this issue Dec 13, 2021 · 9 comments

Comments

@kennthhz-zz
Copy link

Will more io_uring instances increase the parallelism at the disk device level or only increase concurrency? Is there any guideline on how many io_uring instances to create per device or per CPU? My understanding is that number of instance only impacts concurrency, not throughput (by way of increased parallelism).

@axboe
Copy link
Owner

axboe commented Dec 13, 2021

One io_uring instance can drive millions of requests, both submit and completion. It's not really a per device or per CPU thing, in general the recommendation is to avoid sharing a ring between threads if possible, since that will need serialization on the app side. Apart from that, you don't need multiple rings. For reference, the 13M IOPS/core numbers I generated is using just 2 logical threads, and 1 ring per thread. Just 2 rings in total for that.

@kennthhz-zz
Copy link
Author

That makes sense. However, the polling of the CQ should happen in a different thread that SQ submission to avoid blocking submission thread. Also by having 1 thread per vcore, it is better for cache locality. So wouldn't it better to have 1 submission and 1 competition thread per vcore, and one ring per vcore?In essence, share nothing..

@axboe
Copy link
Owner

axboe commented Dec 13, 2021

I suspect it depends on your use case. The way I wrote that test app, you'll generally run with QD X, and submit Y and reap Z requests, where Y and Z are smaller than QD. It does mean that device seen queue depth will go as low as X - Z, but that's generally not a problem.

The submission side isn't blocked, it's not running while we're reaping completions. If you share 2 threads on one vcore as well, then you do end up competing for CPU resources between the submitter and completer.

So I suspect the answer is "it depends" :-)

@kennthhz-zz
Copy link
Author

Sorry, I mean to say completion can be blocked if I use a single thread per vcore. So I can use io_uring_peek_batch_cqe instead of io_uring_wait_cqe. The challenge of using a single thread per vcore for doing both submission and completion is how to arrange the task queue (two task type: submit and reap_complete). I need to insert the reap_complete after insert submission task. But if it is right after, by the time I execute reap_complete, the completion may not finish yet. So I need to insert another one sometime down the road. It can be done, but app starts to play a scheduler job. Though the peek won't incur sys call, but if it is wasted a lot, still costs CPU (like mindless polling). Does io_uring_wait_cqe internally do busy polling? I think not. It is based on notification right? The notification is based on interrupt (non IOPOLLING mode). So io_uring_wait_cqe will block the thread but not wasting CPU (except incurring sys call initially)?

@axboe
Copy link
Owner

axboe commented Dec 13, 2021

Why do you think it will be blocked? Waiting for events in the kernel doesn't block new submissions. Waiting doesn't do busy polling, if you're entering the kernel that's considered slow path of event reaping.

Checking for events doesn't have to be busy polling, it can be done as it needs to. It's just a single memory read, seeing if there are new events.

I guess you mean that waiting for events will block? That's of course true, and it's the same as the example I gave higher up where you have to accept that blocking (or io-polling) for completions means that you don't submit at that time. If that's a concern for you, then just use two threads and split your submit and complete between them.

@kennthhz-zz
Copy link
Author

kennthhz-zz commented Dec 13, 2021

I think you got what I am doing in your final paragraph. I have a single thread on a vcore (which also maps to a single ring). So my sqe tasks and cqe tasks are serialized to a task queue. there is no block or sharing between thread/vcores. So if cqe is executing io_uring_wait_cqe, it will block. Anyway this can be solved by single thread by using not blocking peek API for completion and smartly position the completion task in the queue OR by using 2 threads. Now the hard part to know is which one is better. I would think if IO throughput is large, the peek might wins out because it won't waste CPU and won't incur overhead of one extra thread.

Also each call to io_uring_wait_cqe incurs sys call right? while io_uring_peek_cqe doesn't.

@axboe
Copy link
Owner

axboe commented Dec 13, 2021

Also each call to io_uring_wait_cqe incurs sys call right? while io_uring_peek_cqe doesn't.

wait_cqe() will block if you ask for more events than are directly available, at the time of checking. peek_cqe() will never enter the kernel, all it does is read the kernel CQ tail and see if we have an event available.

@kennthhz-zz
Copy link
Author

kennthhz-zz commented Dec 14, 2021

Just to be clear, does io_uring_wait_cqe internally do a peak (which won't enter kernel)? if at least one cqe is available then it will enter the kernel to dequeue. If there is no one ready, it will block the thread without enter the kernel (without using busy waiting either, will yield the thread). Is that true?
Also, if I am using peek, then mark cqe as seen, I can do completion without ever enter kernel mode?

@axboe
Copy link
Owner

axboe commented Dec 14, 2021

Yes, if an event is available and you ask for one, wait_cqe will not enter the kernel. You never need to enter the kernel to dequeue, that's simply updating the cq head to mark it processed. If none are available, it'll enter the kernel and sleep for one.

Also, if I am using peek, then mark cqe as seen, I can do completion without ever enter kernel mode?

Correct

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants