Skip to content

  • Projects
  • Groups
  • Snippets
  • Help
    • Loading...
    • Help
    • Support
    • Submit feedback
    • Contribute to GitLab
  • Sign in
bob.learn.tensorflow
bob.learn.tensorflow
  • Project overview
    • Project overview
    • Details
    • Activity
    • Releases
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 10
    • Issues 10
    • List
    • Boards
    • Labels
    • Milestones
  • Merge Requests 2
    • Merge Requests 2
  • CI / CD
    • CI / CD
    • Pipelines
    • Jobs
    • Schedules
  • Analytics
    • Analytics
    • CI / CD
    • Repository
    • Value Stream
  • Members
    • Members
  • Collapse sidebar
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
  • bob
  • bob.learn.tensorflowbob.learn.tensorflow
  • Issues
  • #76

Closed
Open
Opened Feb 26, 2019 by Amir MOHAMMADI@amohammadi
  • Report abuse
  • New issue
Report abuse New issue

Logits embedding validation gives NaN loss

according to https://www.tensorflow.org/api_docs/python/tf/nn/sparse_softmax_cross_entropy_with_logits?hl=en :

labels: Tensor of shape [d_0, d_1, ..., d_{r-1}] (where r is rank of labels and result) and dtype int32 or int64.
Each entry in labels must be an index in [0, num_classes).
Other values will raise an exception when this op is run on CPU, and return NaN for corresponding loss and gradient rows on GPU.

and I am getting NaNs as loss in GPU and exceptions in CPU mode when using the Logits estimator with embedding_validation=True. This happens when I run bob tf eval with ReplayMobile. It happens rarely so I don't know what is going on. Here is one error that I get on CPU:

InvalidArgumentError (see above for traceback): Received a label value of 13 which is outside the valid range of [0, 12).  Label values: 10 1
0 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 1
0 10 10 10 10 10 10 10 10 10 10 10 10 13 10 10 10 10 10 10 13 10 13 10 10 10 10 10 10 10 10 10 10 13 13 10 13 13 10 13 10 10 10 10 13 10 13 1
0 13 13 10 13 10 10 10 10 13 10 13 10 13 10 10 13 10 13 10 13 13 13 13 13 10 10 10 13 13 13 10 13 13 10 10 13 13 10 10 13 13 13 13 13 10 13 1
0 10 13 13 13 13 13 13 10 13 10 13 10 10 13 13 13 13 10 10 13 13 13 10 13 10 13 13 13 13 10 10 13 13 13 13 13 10 10 10 13 13 10 13 13 10 10 1
3 13 13 13 13 13 13 13 13 10 13 13 6 13 13 13 13 10 13 6 13 13 6 13 13 6 6 13 6 13 13 13 13 13 13 13 13 13 6 10 10 13 13 13 6 13 10 13 13 6 1
3 10 6 13 6 13 13 6 10 13 13 10 6 6 13 6 10 13 6 6 6 6 6 6 13 13 6 6 6 6 6 6 6 6 10 13 13 13 6 10 6 13 13 13 13 13 13 6 6 13 6 13 13 13 12 12
 6 6 12 13 6 13 6 13 6 12 6 6 13 13 6 12 6 12 6 6 10 12 6 10 12 12 12 6 12 13 6 6 6 6 12 12 6 12 6 12 10 13 6 12 12 10 12 12 12 6 12 12 6 13 
12 12 12 13 6 12 12 6 12 12 12 13 12 6 12 12 6 12 12 12 12 13 6 12 13 13 13 12 12 12 12 12 12 12 12 13 12 6 12 6 12 13 12 10 12 12 12 12 12 1
2 6 12 12 13 12 12 12 12 13 12 12 13 13 12 6 12 12 12 12 12 6 13 12 6 12 12 12 12 10 12 13 13 12 6 12 12 6 12 12 12 12 12 12 6 12 13 12 12 6 
12 12 12 12 12 12 12 12 13 12 13 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 13 12 12 12 13 12 12 6 12 12 12 12 12 12 12 12 12 12 12 1
2 12 13 12 12 12 12 12 12 12 12 12 6 12 12 12 12                                                                                             
         [[node Bio_loss/sparse_softmax_cross_entropy_loss/xentropy/xentropy (defined at deep/sr
c/bob.learn.tensorflow/bob/learn/tensorflow/loss/epsc.py:10)  = SparseSoftmaxCrossEntropyWithLogits[T=DT_FLOAT, Tlabels=DT_INT32, _device="/j
ob:localhost/replica:0/task:0/device:CPU:0"](Logits/Bio/BiasAdd, IteratorGetNext:2)]]                                                        

and I construct my labels like this:

    files = database.all_files(groups=groups, flat=True)

    CLIENT_IDS = sorted(set(str(f.client_id) for f in files))
    CLIENT_IDS = dict(zip(CLIENT_IDS, range(len(CLIENT_IDS))))

    load_data = load(load_data, context=context,
                     entry_point_group='bob', attribute_name='load_data')

    def reader(f):
        key = str(f.make_path("", "")).encode('utf-8')
        label = CLIENT_IDS[str(f.client_id)]

so I am not sure what is going on. I suspect I am hitting a corner case in https://gitlab.idiap.ch/bob/bob.learn.tensorflow/blob/c7a4d9f78adbcb9b6ec3c22a0ece375e6a271468/bob/learn/tensorflow/utils/util.py#L192

Any ideas are welcome.

Edited Feb 26, 2019 by Amir MOHAMMADI
Assignee
Assign to
None
Milestone
None
Assign milestone
Time tracking
None
Due date
None
0
Labels
None
Assign labels
  • View project labels
Reference: bob/bob.learn.tensorflow#76