[ HINT (Solution) !!! ]
Right after typing my previous, I was reading on the web about this and had an idea. To decrease the buffer size of my audio device (StudioLive 32). Because I read presonus saying multicore was improved when they introduced the Dropout Protection and Low input monitoring (v3.5) and it separates the processing in different process blocks or something like that. I previously had SLive at 2048 samples and maximum dropout protection, when getting those results I mentioned before. But noticed the processing tab in SOne Options does not enable multiple block sizes when set up like that. I thought hey let's try putting SLive at 128 samples. And voila, my thread that was previously getting 80% usage was a little lower, and all other cores were a little higher. Already no more clicks and dropouts. So I though hey maybe if I set the hardware's block size as big as I can while not disabling Dropout Protection multiple block sizes, it works even better! So now my SLive is at 1028, and SOne at maximum dropout protections says 2048 samples block size, and I can see different monitoring latencies below that. Result is now I see half the threads running higher moving all over the place within 30%~50% usage, the other half is at about 30% usage, and that thread that was previously 80% usage is most often a bit higher than the others but never over 60%.
Since I observed I start to get a few clicks&pops when the highest usage thread gets to about 80%, I can say I now have those 20% spare room to add even more before realtime playback gets glitchy. I am running that same whole huge session even with mix fx softube tape active (which is VERY HEAVY on the CPU). BEAUTIFUL! Everything is online and running in realtime with zero glitches.
SET YOUR AUDIO DEVICE BLOCK SIZE LOWER THAN 2048 AND DROPOUT PROTECTION TO MAXIMUM.
(You should see different monitoring latencies for standard and low latency in the table at Options > Audio Setup > Processing. Looks like to me that setting the device's block size to the same number as the process block size causes Studio One to not use the multicore optimization from Studio One v3.5. Maybe they could fix that to still use multiple processing blocks even when the two blocks are the same size? Or perhaps set Process Block Size to 4096? Or maybe it is just that the Audio Device's buffer is still single-core and only the "virtual" internal processing buffer has been multicore optimized? In that case, is it possible to also optimize de Audio Device's buffer so we can get multicore efficiency even when not using Dropout Protection to have a bigger processing block size than the device's?)