My first guess would be the way you are limiting to 5 simultaneous threads may be the difference. SDK 3 is designed with a lot of async/await logic. This adds some overhead, meaning individual requests may be slightly slower. However, it generally increases throughput under load. So you might try increasing the threads from 5 to 10 or 20 and see what that does. The optimum number varies depending on a lot of factors. For example, if Couchbase Server is running on your local machine the optimum number will be much lower since there is no network latency.
Another thing is something obvious that still managed to bite me when I was doing benchmarking on the SDK early on. Make sure you’re compiling in release mode AND don’t attach the debugger when you run.