I finally decided to look into GPU acceleration after playing with whisper.cpp and realizing that OpenCL was still Actually Useful(tm). (Seriously, someone should've nudged me a while ago. Maybe I'm stubborner than I think I am... ;) )
This would involve a bit of refactoring, but if it gets a 2x performance boost it'd be worth it:
- The multithreaded demodcache should be torn out (this would generally be a win)
- Data needs to be kept GPU-side as much as possible
- There would probably have to be a wrapper so that non-OpenCL envs would still run. (Apple has deprecated it so it will probably go away sooner or later on Mac - there will probably be something to run there by the time it does...)
I'm still in the testing phase. On my main test platform (Dell T3600 w/6-core Sandy Bridge and a Geforce 3060 12GB) pyvkfft is 150% faster at the standard blocksize (64K samples), and ~15x faster at 1MB. So this will probably shift the bottleneck to the TBC even further unless things can be kept on the GPU side most of the time.
I'm also going to look at a secondary test potato^Wplatform, a Mele Quieter3C which has a Celeron N5105 and it's integrated GPU. The latter does pyvkfft benchmarks at about 4-5% the speed of the 3060, but since the CPU does not support AVX(2) it might still be faster. (By the way, the new Nxxx series does have AVX2 and would only lag behind a Haswell i5 because it only has one memory channel. Not bad.)
At a later point, I'm planning on getting my hands on an rk3588 board - if OpenCL is running there with the free drivers I'll try that too, but the A76 has enough SIMD that it might not help.
I finally decided to look into GPU acceleration after playing with whisper.cpp and realizing that OpenCL was still Actually Useful(tm). (Seriously, someone should've nudged me a while ago. Maybe I'm stubborner than I think I am... ;) )
This would involve a bit of refactoring, but if it gets a 2x performance boost it'd be worth it:
I'm still in the testing phase. On my main test platform (Dell T3600 w/6-core Sandy Bridge and a Geforce 3060 12GB) pyvkfft is 150% faster at the standard blocksize (64K samples), and ~15x faster at 1MB. So this will probably shift the bottleneck to the TBC even further unless things can be kept on the GPU side most of the time.
I'm also going to look at a secondary test potato^Wplatform, a Mele Quieter3C which has a Celeron N5105 and it's integrated GPU. The latter does pyvkfft benchmarks at about 4-5% the speed of the 3060, but since the CPU does not support AVX(2) it might still be faster. (By the way, the new Nxxx series does have AVX2 and would only lag behind a Haswell i5 because it only has one memory channel. Not bad.)
At a later point, I'm planning on getting my hands on an rk3588 board - if OpenCL is running there with the free drivers I'll try that too, but the A76 has enough SIMD that it might not help.