背景:视频帧序列逐个跑一遍CNN模型,发现CPU被占满了
经过debug发现问题出在torchvision transorms.totensor上
经过搜索,torch在设计多线程加速计算时候会默认占满所有core,因此cpu会飙满,这对别的应用程序速度影响非常大,因此需要控制一下。
解决方法:
1)torch.set_num_threads(1)
手动控制一下torch占用的线程数
2)设置环境变量
export OMP_NUM_THREADS=1 or export MKL_NUM_THREADS=1
但是,开启多个线程去计算理论上是会提升计算效率的,但有没有提升还需要自己去测试。
关于OpenMP
OpenMP (Open Multi-Processing)是一种多线程加速库,OpenMP在一个进程内启动多个线程。默认线程数为CPU实际能同时运行的个数,通常来说是CPU核心个数,如果cpu有Simultaneous multithreading (SMT)特性也可以开启数倍的个数。如果你有一个四核非超线程 CPU,那么OpenMP默认想要启动4个线程。
如果你使用python多进程模块,你有一个四核CPU,你开了四个进程,此时OpenMP函数也会启动4个线程,因此你最终在4核上启动了16个线程。跑是没问题,但是可能由于任务切换导致的开销导致这样做并不高效。
设置OMP_NUM_THREADS=1关闭了OpenMP的多线程,使得python单进程仅跑单线程。
OpenMP does multi-threading within a process, and the default number of threads is typically the number that the CPU can actually run simultaneously. (This is generally the number of CPU cores, or a multiple of that number if the CPU has an SMT feature such as Intel’s Hyper-Threading.) So if you have, for example, a quad-core non-hyperthreaded CPU, OpenMP will want to run 4 threads by default.
When you use Python’s multiprocessing module, your program starts multiple Python processes which can run simultaneously. You can control the number of processes, but often you’ll want it to be the number of CPU cores/threads, e.g. returned by multiprocessing.cpu_count().
So, what happens on that quad-core CPU if you run a multiprocessing program that runs 4 Python processes, and each calls an OpenMP function runs 4 threads? You end up running 16 threads on 4 cores. That’ll work, but not at peak efficiency, since each core will have to spend some time switching between tasks.
Setting OMP_NUM_THREADS=1 basically turns off the OpenMP multi-threading, so each of your Python processes remains single-threaded.
Make sure you’re starting enough Python processes if you do this, though! If you have 4 CPU cores and you only run 2 single-threaded Python processes, you’ll have 2 cores utilized and the other 2 sitting idle. (In this case you might want to set OMP_NUM_THREADS=2.)
跑一个开多进程,同时包含numpy操作的程序,其中numpy矩阵计算会用到多线程加速。
对比了下限制还是放开OMP线程数,可以发现不限制OMP的话CPU占用非常高,导致最后运行时间相比限制OMP慢多了,因此有些场景还是限制一下OMP线程数反而提速很多。CPU占比过高并一定快!
没有设置OMP_NUM_THREADS时,cpu占用情况
设置export OMP_NUM_THREADS=1
参考文献:
[1] https://jdhao.github.io/2020/07/06/pytorch_set_num_threads/
[2] https://stackoverflow.com/questions/43894608/use-of-omp-num-threads-1-for-python-multiprocessing
[3] https://stackoverflow.com/questions/30791550/limit-number-of-threads-in-numpy