I benchmark my models using https://github.com/ofsoundof/NTIRE2022_ESR, and I got different runtime of valid and test dataset, e.g., valid of 29s while test of 19s.
Finally, I found that this is caused by first several runs of the model. That is the first several runs of valid dataset may be 60s, 45s, 40s and etc., and the rest runtimes of valid dataset and all runtimes of test dataset are normal (around 19s). This increases the average runtime of valid dataset, and finally the overall runtime.
Have you considered this difference in the evaluation? Or do we have to take this into consideration when designing our models?
Besides, it seems that a warming-up with several forwards of models can reduce the difference.
Hi,
we use multiple runs for GPU warm-up and report the average runtime per image over multiple inference passes as well. We will update the GitHub repo shortly, then you can test our inference procedure. In case something is off, please feel free to open an issue and we'll have a look.
Posted by: eduardzCVLJMU @ Feb. 14, 2023, 9:48 a.m.