Logits embedding validation gives NaN loss
according to https://www.tensorflow.org/api_docs/python/tf/nn/sparse_softmax_cross_entropy_with_logits?hl=en :
labels: Tensor of shape [d_0, d_1, ..., d_{r-1}] (where r is rank of labels and result) and dtype int32 or int64.
Each entry in labels must be an index in [0, num_classes).
Other values will raise an exception when this op is run on CPU, and return NaN for corresponding loss and gradient rows on GPU.
and I am getting NaNs as loss in GPU and exceptions in CPU mode when using the Logits
estimator with embedding_validation=True
.
This happens when I run bob tf eval
with ReplayMobile. It happens rarely so I don't know what is going on. Here is one error that I get on CPU:
InvalidArgumentError (see above for traceback): Received a label value of 13 which is outside the valid range of [0, 12). Label values: 10 1
0 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 1
0 10 10 10 10 10 10 10 10 10 10 10 10 13 10 10 10 10 10 10 13 10 13 10 10 10 10 10 10 10 10 10 10 13 13 10 13 13 10 13 10 10 10 10 13 10 13 1
0 13 13 10 13 10 10 10 10 13 10 13 10 13 10 10 13 10 13 10 13 13 13 13 13 10 10 10 13 13 13 10 13 13 10 10 13 13 10 10 13 13 13 13 13 10 13 1
0 10 13 13 13 13 13 13 10 13 10 13 10 10 13 13 13 13 10 10 13 13 13 10 13 10 13 13 13 13 10 10 13 13 13 13 13 10 10 10 13 13 10 13 13 10 10 1
3 13 13 13 13 13 13 13 13 10 13 13 6 13 13 13 13 10 13 6 13 13 6 13 13 6 6 13 6 13 13 13 13 13 13 13 13 13 6 10 10 13 13 13 6 13 10 13 13 6 1
3 10 6 13 6 13 13 6 10 13 13 10 6 6 13 6 10 13 6 6 6 6 6 6 13 13 6 6 6 6 6 6 6 6 10 13 13 13 6 10 6 13 13 13 13 13 13 6 6 13 6 13 13 13 12 12
6 6 12 13 6 13 6 13 6 12 6 6 13 13 6 12 6 12 6 6 10 12 6 10 12 12 12 6 12 13 6 6 6 6 12 12 6 12 6 12 10 13 6 12 12 10 12 12 12 6 12 12 6 13
12 12 12 13 6 12 12 6 12 12 12 13 12 6 12 12 6 12 12 12 12 13 6 12 13 13 13 12 12 12 12 12 12 12 12 13 12 6 12 6 12 13 12 10 12 12 12 12 12 1
2 6 12 12 13 12 12 12 12 13 12 12 13 13 12 6 12 12 12 12 12 6 13 12 6 12 12 12 12 10 12 13 13 12 6 12 12 6 12 12 12 12 12 12 6 12 13 12 12 6
12 12 12 12 12 12 12 12 13 12 13 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 13 12 12 12 13 12 12 6 12 12 12 12 12 12 12 12 12 12 12 1
2 12 13 12 12 12 12 12 12 12 12 12 6 12 12 12 12
[[node Bio_loss/sparse_softmax_cross_entropy_loss/xentropy/xentropy (defined at deep/sr
c/bob.learn.tensorflow/bob/learn/tensorflow/loss/epsc.py:10) = SparseSoftmaxCrossEntropyWithLogits[T=DT_FLOAT, Tlabels=DT_INT32, _device="/j
ob:localhost/replica:0/task:0/device:CPU:0"](Logits/Bio/BiasAdd, IteratorGetNext:2)]]
and I construct my labels like this:
files = database.all_files(groups=groups, flat=True)
CLIENT_IDS = sorted(set(str(f.client_id) for f in files))
CLIENT_IDS = dict(zip(CLIENT_IDS, range(len(CLIENT_IDS))))
load_data = load(load_data, context=context,
entry_point_group='bob', attribute_name='load_data')
def reader(f):
key = str(f.make_path("", "")).encode('utf-8')
label = CLIENT_IDS[str(f.client_id)]
so I am not sure what is going on. I suspect I am hitting a corner case in https://gitlab.idiap.ch/bob/bob.learn.tensorflow/blob/c7a4d9f78adbcb9b6ec3c22a0ece375e6a271468/bob/learn/tensorflow/utils/util.py#L192
Any ideas are welcome.