CodaLab -

> Out of CUDA memory

Hello everyone.

May I ask for your help against the following bug occurred after my submission.
I am confused that how this error, happens, as I just submitted models rather than the code.

Traceback (most recent call last):
File "/tmp/codalab/tmptceN7l/run/program/evaluation.py", line 540, in <module>
result, attack_success_rates = check_specifications(save_dir, attack_specifications, data_root=os.path.join(input_dir, 'ref', 'evasive_trojans', 'data'))
File "/tmp/codalab/tmptceN7l/run/program/evaluation.py", line 256, in check_specifications
model.cuda().eval()
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 680, in cuda
return self._apply(lambda t: t.cuda(device))
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 570, in _apply
module._apply(fn)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 570, in _apply
module._apply(fn)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 593, in _apply
param_applied = fn(param)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 680, in <lambda>
return self._apply(lambda t: t.cuda(device))
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Posted by: tdteach @ Oct. 5, 2022, 3:17 p.m.

Hello,

I just restarted the compute worker that evaluates submissions. Could you try uploading your submission again?

Thanks,
Mantas (TDC co-organizer)

Posted by: mmazeika @ Oct. 5, 2022, 7:07 p.m.

After one successful submission, the error still occurs in my last submission.
Would you mind writing proper strategy to avoid this error happening again

Traceback (most recent call last):
File "/tmp/codalab/tmpp4pGkM/run/program/evaluation.py", line 540, in
result, attack_success_rates = check_specifications(save_dir, attack_specifications, data_root=os.path.join(input_dir, 'ref', 'evasive_trojans', 'data'))
File "/tmp/codalab/tmpp4pGkM/run/program/evaluation.py", line 256, in check_specifications
model.cuda().eval()
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 680, in cuda
return self._apply(lambda t: t.cuda(device))
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 570, in _apply
module._apply(fn)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 570, in _apply
module._apply(fn)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 593, in _apply
param_applied = fn(param)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 680, in
return self._apply(lambda t: t.cuda(device))
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Posted by: tdteach @ Oct. 6, 2022, 5:43 a.m.

This error waists my three chances of submission a day.
Could you be serious to solve this ERROR !!, maybe caused by your memory arrangement strategy.

Posted by: tdteach @ Oct. 6, 2022, 2:06 p.m.

Hello,

Apologies for not seeing your response yesterday. I am looking into this now. Would you be able to let me know whether the submission that worked is identical to the ones that failed?

All the best,
Mantas (TDC co-organizer)

Posted by: mmazeika @ Oct. 6, 2022, 6:40 p.m.

(this is likely an issue on our end; I'm just asking to gather more information)

Posted by: mmazeika @ Oct. 6, 2022, 6:58 p.m.

Yes, they are the same.

Posted by: tdteach @ Oct. 6, 2022, 7:47 p.m.

Actually, I have submitted 4 versions through 7 submissions in the order v1(accepted) v2(failed) v2(accepted) v3(failed) v4(failed) v3(failed) v4((accepted)

Posted by: tdteach @ Oct. 6, 2022, 7:49 p.m.

OK, thank you for the information.

I think I fixed the problem. It turns out there were two compute workers running, which were both accepting and processing jobs. One of them was using unauthorized resources, which caused submissions to fail sporadically. I thought I had removed the second compute worker, but it looks like I forgot to do so. Thank you for bringing this to my attention. Hopefully this fixes the problem, but let me know if it happens again.

All the best,
Mantas (TDC co-organizer)

Posted by: mmazeika @ Oct. 6, 2022, 8:23 p.m.

Thanks. the evaluation server works well on my latest submission.

Posted by: tdteach @ Oct. 7, 2022, 12:13 a.m.

Post in this thread

Forums

Trojan Detection Challenge - Evasive Trojans Track Forum

> Out of CUDA memory