**Describe the bug**
After reading a reading a recent [Phoronix benchmark](ht…tps://www.phoronix.com/scan.php?page=article&item=hp-devone-linux&num=3) (a bit down the page) I decided to investigate why Arch Linux was so much slower (10-20x) for zstd performance. It turned out that something is wrong with some of the build systems included with zstd!
When zstd is built with the cmake or meson build systems there is negative scaling with the number of threads, while when building with the Makefile in the top level directory, there is positive scaling with the number of threads.
**To Reproduce**
Steps to reproduce the behavior:
1. Build with build system you want to investigate. One of:
* Plain `make`
* `mkdir build && cmake ../zstd-1.5.2/build/cmake/ && make`
* `meson setup builddir && cd builddir && ninja`
2. Test the resulting binary on a large file, I used the [FreeBSD image](https://download.freebsd.org/releases/amd64/amd64/ISO-IMAGES/13.1/FreeBSD-13.1-RELEASE-amd64-memstick.img) as this is what Phoronix Test Suite used, albeit from a older version that I can't find. I can however reproduce the same issue with the linked file. Use the following pair of benchmark commands and compare scaling:
* `path/to/zstd -T1 -b4 path/to/FreeBSD-13.1-RELEASE-amd64-memstick.img`
* `path/to/zstd -T6 -b4 path/to/FreeBSD-13.1-RELEASE-amd64-memstick.img` (adjust -T6 based on the number of cores you have)
*Note!* I see the same pattern at other compression levels such as 6 and 8, not just 4. So that value doesn't really appear to matter, as long as it is consistent of course.
**Expected behavior**
I expect that all build systems should result in binaries with roughly the same behaviour. Performance and scaling should be similar.
**Actual results**
The output below has been abbreviated for clarity, repeated command lines has been elided only showing the output. Three runs for each combination of program and flags has been performed. As can be seen the results are relatively consistent run-to-run (at least consistent enough given the huge discrepancies).
1. CMake
```console
$ programs/zstd -T1 -b4 ~/Downloads/FreeBSD-13.1-RELEASE-amd64-memstick.img
4#md64-memstick.img :1172165120 -> 781156418 (x1.501), 1108.5 MB/s, 4999.1 MB/s
4#md64-memstick.img :1172165120 -> 781156418 (x1.501), 1152.9 MB/s, 5006.7 MB/s
4#md64-memstick.img :1172165120 -> 781156418 (x1.501), 1102.1 MB/s, 4978.5 MB/s
$ programs/zstd -T6 -b4 ~/Downloads/FreeBSD-13.1-RELEASE-amd64-memstick.img
4#md64-memstick.img :1172165120 -> 781637623 (x1.500), 717.0 MB/s, 4940.0 MB/s
4#md64-memstick.img :1172165120 -> 781637623 (x1.500), 759.3 MB/s, 4893.5 MB/s
4#md64-memstick.img :1172165120 -> 781637623 (x1.500), 697.5 MB/s, 4869.6 MB/s
```
2. Meson
```console
$ programs/zstd -T1 -b4 ~/Downloads/FreeBSD-13.1-RELEASE-amd64-memstick.img
4#md64-memstick.img :1172165120 -> 781156418 (x1.501), 1097.0 MB/s, 5029.3 MB/s
4#md64-memstick.img :1172165120 -> 781156418 (x1.501), 1098.2 MB/s, 4970.2 MB/s
4#md64-memstick.img :1172165120 -> 781156418 (x1.501), 1117.8 MB/s, 4952.6 MB/s
$ programs/zstd -T6 -b4 ~/Downloads/FreeBSD-13.1-RELEASE-amd64-memstick.img
4#md64-memstick.img :1172165120 -> 781637623 (x1.500), 735.0 MB/s, 4982.8 MB/s
4#md64-memstick.img :1172165120 -> 781637623 (x1.500), 758.6 MB/s, 4966.9 MB/s
4#md64-memstick.img :1172165120 -> 781637623 (x1.500), 727.9 MB/s, 4949.7 MB/s
```
3. Makefile
```console
$ ./zstd -T1 -b4 ~/Downloads/FreeBSD-13.1-RELEASE-amd64-memstick.img
4#md64-memstick.img :1172165120 -> 781156418 (x1.501), 1118.2 MB/s, 4971.0 MB/s
4#md64-memstick.img :1172165120 -> 781156418 (x1.501), 1105.4 MB/s, 4931.3 MB/s
4#md64-memstick.img :1172165120 -> 781156418 (x1.501), 1150.6 MB/s, 4930.1 MB/s
$ ./zstd -T6 -b4 ~/Downloads/FreeBSD-13.1-RELEASE-amd64-memstick.img
4#md64-memstick.img :1172165120 -> 781637623 (x1.500), 3518.0 MB/s, 4898.2 MB/s
4#md64-memstick.img :1172165120 -> 781637623 (x1.500), 3486.3 MB/s, 4917.0 MB/s
4#md64-memstick.img :1172165120 -> 781637623 (x1.500), 3528.1 MB/s, 4900.8 MB/s
```
**Analysis of results**
For CMake and Meson: it can be seen that the performance goes down between 1 thread and 6 threads: ~1100 MB/s to ~700 MB/s.
For plain make, the performance goes up between 1 thread and 6 threads: ~1100 MB/s to ~3500 MB/s.
Decompression speed (the second value) does not seem to vary significantly across the experiments however.
**Desktop (please complete the following information):**
- OS: Arch Linux
- Version 1.5.2 (upstream tarball)
- Compiler: GCC 12.1.0
- Flags: Defaults for each build system, though I tested with some basics such as -O2, but it did not affect the overall behaviour.
- Other relevant hardware specs: AMD Ryzen 5 5600X 6-Core Processor
- Build system: Multiple ones, that is the whole point of this bug
**Additional context**
Add any other context about the problem here.