Many workloads are of two types:
- Poorly scalable across cores, almost everything in one thread
- Good scalability across cores, almost anything across many threads
The number of working loads between them, which for example holds 4 or 8 cores but no more, is simply too small.
For poorly scalable workloads, you want some really fast cores. The amount primarily depends on how many non-scalable workloads you want to run at one time. For a game there may be 3 to 6, often not.
Then you have highly scalable workloads. You can, of course, make it faster by cramming as many large, fast cores as possible onto a chip. But this is not the most efficient way, because large cores take up disproportionately extra space to extract the last bits of performance. And that space is scarce on a chip, every 2mm costs more.
When we go to SPEC2017 Standards Looking at Lake Alder, we see that the P core has a score of 8.14 for integer operations and 14.16 for floating point. The E-core has a score of 5.25 for an integer (65% of performance) and 7.66 for a floating point (54%). Approximately half and two thirds of the performance, depending of course on the amount of work.
However, the space occupied by the electronic core is a lot of smallest. I can’t find the exact sizes at the moment, but let’s say 4 electron cores fit on the same surface as the 1 P-core. Together, these 4 cores have (with perfect scaling) 2.6x the integer or 2.2x floating point performance on the same surface as the 1 P-core!
So in the future I see a lot of chips with 4 to 8 performance cores for single thread performance, and tens of efficiency cores for multi-threaded performance. And not only does Intel seem to be aware of this, but AMD is also said to be working on the problem Zen4 Dance cores;