Something seems to be a bit off with the calculation of the percentage of GPU memory currently used by a running torch instance. We need to understand why.
Edited
Designs
Child items
0
Show closed items
No child items are currently assigned. Use child items to break down this issue into smaller parts.
Linked items
0
Link issues together to show that they're related.
Learn more.
Related merge requests
1
When this merge request is accepted, this issue will be closed automatically.
I checked. The values reported by nvidia-smi are not recast in any way to percentage format.
The values we store at the log file correspond to the values read out by that application.
For reference, this is the command used at every epoch we log GPU utilisation:
@dkhalil: could you please check, in any running machine where your jobs are currently running, what is the output of that command by ssh'ing into that machine and running it?
The first column should correspond to the total memory available, then the total memory used in MB. By dividing the second column by the first and multiplying by 100, you should match the value of utilization.memory.
I just runned a simple job, and got the following results
Apparently, it seems that they are using other formula to calculate the percentage.
I found this when looking on the help of the command: Percent of time over the past sample period during which global (device) memory was being read or written.
The sample period may be between 1 second and 1/6 second depending on the product.
You are right. That corresponds to the time using the memory. To measure memory consumption, from now on, let’s stick with the division of memory used by total memory available.