Kaldaien here: Have I mentioned I hate juggling accounts? I never have access to them when most of these sites force me to change my password constantly. So I really do prefer it if you keep any discussion with me to the Steam forums.
In any case, allow me to try and address some of the things discussed so far as best I can .
Let me start by pointing out your understanding of SMT leaves a little bit to be desired. I am not criticizing you, just letting you know that if you view physical and logical cores as having the same set of scheduling challenges, then you already killed your software's performance. You would not be alone though, many games perform worse with SMT enabled because the developers do not fully appreciate how it works.
Here is an idealized workload if I magically had 16 full processor cores that could retire completed instructions in parallel without needed the to borrow resources from the paired logical core:
Call this:
(i), and don't ever do this in the real-world:
Code:
A B C D E F AA BB CC DD EE FF AAA BBB CCC DDD
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
That does not actually work for two reasons:
- The queue these threads are getting their data from has to acquire/release locks constantly
- The logical processor pair {0,1} does not contain two fully independent execution pipelines.
Let's try something smarter now….
(ii): This is an SMT-aware design, instead of 16 workers we have 8 and allow the scheduler to move threads between two SMT-friendly cores whenever permissible.
Code:
CPU {0,1} share data, instrtuction cache, and have the same locality to near/far memory.
CPU {2,3} same deal
CPU {4,5} …
CPU {6,7} …
CPU {8,9} …
CPU {10,11} …
CPU {12,13} …
CPU {14,15} …
We're now moving things in the a more sensible direction by reducing the number of parallel queues, but ideally we need to pipeline these things to make up for various sources of contention caused by going parallel.
(iii): A 4 thread pool - 4 job deep queue with locking illustrated:
Code:
*
* D
* C D
* B C D
A B C D
A B C * ….
A B * -
A * - - (Queue 0 has 4 jobs fetched and has started on the first job, locks has been released and queue 1 is now fetching the nxet 4 jobs)
* - - - (Queue 0 is fetching 4 jobs, holds a lock and 0,1,2, must wait)
0 1 2 3
This is more detailed discussion than the topic ever deserves here, but I was completely taken aback when it was suggested I do not understand multi-threaded execution. Unity doesn't understand; I understand just fine
Creating a very wide and shallow distribution of jobs in a threadpool is not ideal in an SMT-based system and only serves to hurt the performance of the other tasks the game needs to continue doing.
You can generally improve throughput of the system as a whole if you do not create a massive threadpool that pre-empts more important tasks (thread priority is missing from most of Unity).
Latency to retire the same number of jobs increases, but you will not interrupt the threads that are constantly buffering audio or delivering graphics commands to the driver.
- That problem of the render thread being interrupted and starved of CPU time is emphatically why performance is in the toilet in this game and the driver can't keep my GPU load above 25%.
You can solve this any number of ways and if I had a better working overview of Unity's actual jobs I might instead opt to tweak the priority scheduling rather than constricting parallel work queues.
I did adequate profiling with my own custom tools to come to the conclusion that fewer threads in the pool is simple and effective and that Unity reacts to being told there are fewer CPU cores in a way favorable to the discussion above.