master issue for various performance bottlenecks that could be improved on
Memory bandwidth/use between threads
As identified by several people, there is a fair bit of time spend on shuffling data to and back from the demod threads, and to concatenate the data afterwards, just removing the completely unused data in the shared recarray in #796 gave a notable improvement in performance, but there is more that could be improved
demod_raw is only used in one spot in the dropout detect function to check where the data exceeds a threshold, this could as well be done in the demod threads themselves, storing the boolean array data on where the thr is exceeded instead which should be much smaller.
demod_burst would likely be sufficient to store as 32-bit instead of 64-bit float since the data will be around where the floating point precision is high anyhow.
- ideally we should use shared memory for the result data if possible to avoid copying between threads, as noted by limer and putnam on irc/discord (they indicated they may submit a PR for this when they are back home)
FFT
The real-part only rfft functions should be used rather than fft where we don't need the imaginary part (which is only needed for the hilbert/demod function afaik), as they are gonna be faster and we don't need to store much data for the fft filters either.
We're using pyfft rather than numpy's fft for speed improvements as of now. It has a bunch of settings/caching one could maybe play around with to improve things. It's currently not used on windows as it seems to conflict with using Thread instead of Process (which doesn't work on win with the current code).
numba/native code optimization
Some of the tbc/sync stuff could benefit a ton from using numba (or alternatively cython or similar) as a lot of logic is being done in loops which is slow in python - dropout_detect_demod, refine_linelocs_pilot and refine_linelocs_hsync in particular, but probably more. (The last one I've implemented partially in cython in vhs-decode)
Any runs involving EFM will have a fair bit of extra startup time as it uses numba classes which the compilation can't be cached for, so it has to be re-compiled on every run. If we start using cython or similar in ld-decode it might be worth using that for this purpose instead.
JSON
I don't know if this has a large performance hit in practice but as of now we are rewriting the whole json rather than appending to the file, which can get pretty large on large runs. Might be worth looking if it's feasible to just append the file and modify the needed stuff at the start instead.
master issue for various performance bottlenecks that could be improved on
Memory bandwidth/use between threads
As identified by several people, there is a fair bit of time spend on shuffling data to and back from the demod threads, and to concatenate the data afterwards, just removing the completely unused data in the shared recarray in #796 gave a notable improvement in performance, but there is more that could be improved
demod_rawis only used in one spot in the dropout detect function to check where the data exceeds a threshold, this could as well be done in the demod threads themselves, storing the boolean array data on where the thr is exceeded instead which should be much smaller.demod_burstwould likely be sufficient to store as 32-bit instead of 64-bit float since the data will be around where the floating point precision is high anyhow.FFT
The real-part only rfft functions should be used rather than fft where we don't need the imaginary part (which is only needed for the hilbert/demod function afaik), as they are gonna be faster and we don't need to store much data for the fft filters either.
We're using pyfft rather than numpy's fft for speed improvements as of now. It has a bunch of settings/caching one could maybe play around with to improve things. It's currently not used on windows as it seems to conflict with using Thread instead of Process (which doesn't work on win with the current code).
numba/native code optimization
Some of the tbc/sync stuff could benefit a ton from using numba (or alternatively cython or similar) as a lot of logic is being done in loops which is slow in python -
dropout_detect_demod,refine_linelocs_pilotandrefine_linelocs_hsyncin particular, but probably more. (The last one I've implemented partially in cython in vhs-decode)Any runs involving EFM will have a fair bit of extra startup time as it uses numba classes which the compilation can't be cached for, so it has to be re-compiled on every run. If we start using cython or similar in ld-decode it might be worth using that for this purpose instead.
JSON
I don't know if this has a large performance hit in practice but as of now we are rewriting the whole json rather than appending to the file, which can get pretty large on large runs. Might be worth looking if it's feasible to just append the file and modify the needed stuff at the start instead.