I benchmark my models using https://github.com/ofsoundof/NTIRE2022_ESR, and I got different runtime of valid and test dataset, e.g., valid of 29s while test of 19s.
Finally, I found that this is caused by first several runs of the model. That is the first several runs of valid dataset may be 60s, 45s, 40s and etc., and the rest runtimes of valid dataset and all runtimes of test dataset are normal (around 19s). This increases the average runtime of valid dataset, and finally the overall runtime.
Have you considered this difference in the evaluation? Or do we have to take this into consideration when designing our models?
Besides, it seems that a warming-up with several forwards of models can reduce the difference.
Thanks a lot for the findings. We will consider that in the final phase.
Posted by: ofsoundof @ Feb. 14, 2023, 11:48 a.m.