bob.learn.tensorflow issueshttps://gitlab.idiap.ch/bob/bob.learn.tensorflow/-/issues2021-02-10T07:59:14Zhttps://gitlab.idiap.ch/bob/bob.learn.tensorflow/-/issues/86Callback vanilla biometrics2021-02-10T07:59:14ZTiago de Freitas PereiraCallback vanilla biometricsWould be nice to have a callback that triggers vanilla-biometrics.Would be nice to have a callback that triggers vanilla-biometrics.Tiago de Freitas PereiraTiago de Freitas Pereirahttps://gitlab.idiap.ch/bob/bob.learn.tensorflow/-/issues/85Integrate dask with ``bob keras fit`` for multi worker strategy setup2020-10-08T16:21:21ZAmir MOHAMMADIIntegrate dask with ``bob keras fit`` for multi worker strategy setupDask can be used to setup a cluster for tensorflow: https://gitlab.idiap.ch/bob/bob.tf_experimental/
We should do this automatically in our train script.Dask can be used to setup a cluster for tensorflow: https://gitlab.idiap.ch/bob/bob.tf_experimental/
We should do this automatically in our train script.https://gitlab.idiap.ch/bob/bob.learn.tensorflow/-/issues/84Allow for strategies in ``bob keras fit`` script2020-10-08T16:19:49ZAmir MOHAMMADIAllow for strategies in ``bob keras fit`` scriptWhen fitting models under a distributed strategy, the model needs be created and compiled under the strategy scope: https://www.tensorflow.org/tutorials/distribute/keras
``bob keras fit`` script should do this scoping automatically.When fitting models under a distributed strategy, the model needs be created and compiled under the strategy scope: https://www.tensorflow.org/tutorials/distribute/keras
``bob keras fit`` script should do this scoping automatically.https://gitlab.idiap.ch/bob/bob.learn.tensorflow/-/issues/80Keras gotchas2019-10-07T21:42:02ZAmir MOHAMMADIKeras gotchasUsing Keras with estimators (without using `tf.keras.estimator.model_to_estimator`) is really weird. I am opening an issue here to keep track of the gotchas.
Look at this guide: https://www.tensorflow.org/beta/guide/migration_guide#usin...Using Keras with estimators (without using `tf.keras.estimator.model_to_estimator`) is really weird. I am opening an issue here to keep track of the gotchas.
Look at this guide: https://www.tensorflow.org/beta/guide/migration_guide#using_a_custom_model_fn
which explains what you should do but it does not cover everything.
* Keras variables do not go to variable stores. To use `tf.train.init_from_checkpoint` with Keras variables, one needs to pass explicitly the list of variables to the function. Something like this:
```python
assignment_map = {v.name.split(":")[0]: v for v in model.variables}
tf.train.init_from_checkpoint(
ckpt_dir_or_file=model_folder, assignment_map=assignment_map
)
```
* Keras layers (especially batch norm) do not update `tf.GraphKeys.UPDATE_OPS` collections. Hence you have to add those manually:
```python
# Add batch norm updates to the graph
for update_op in model.get_updates_for(inputs) + model.get_updates_for(None):
tf.add_to_collection(tf.GraphKeys.UPDATE_OPS, update_op)
```
* Keras layers' variables go to global trainable variables (weird enough because you cannot use init_from_checkpoint on them). Doing something like:
```python
for layer in model.layers:
layer.trainable = False
```
will not remove those from that list. To use `tf.contrib.layers.optimize_loss` with keras layers, you have to do something like:
```python
tf.contrib.layers.optimize_loss(
...
variables=model.trainable_variables
)
```
Otherwise, you will be training all layers.
* In Keras Models, `model.variables` and `model.trainable_variables` are different. So you would handle L2 loss like this:
```python
# Add L2 losses to the graph
regularization_loss = 0.0
l2 = tf.keras.regularizers.l2(weight_decay)
for variable in model.trainable_variables:
regularization_loss += l2(variable)
tf.add_to_collection(tf.GraphKeys.REGULARIZATION_LOSSES, regularization_loss)
```
or you do something like this:
```python
# Get both the unconditional losses (the None part)
# and the input-conditional losses (the features part).
reg_losses = model.get_losses_for(None) + model.get_losses_for(features)
```
* You have to name every layer/model explicitly otherwise you end up with different names depending on the state of keras layers ...
https://gitlab.idiap.ch/bob/bob.learn.tensorflow/-/issues/79Follow-up from "A lot of new features"2019-04-26T08:46:47ZAmir MOHAMMADIFollow-up from "A lot of new features"The following discussion from !75 should be addressed:
- [ ] @amohammadi started a [discussion](https://gitlab.idiap.ch/bob/bob.learn.tensorflow/merge_requests/75#note_41878):
> @tiago.pereira I don't think putting this `os.enviro...The following discussion from !75 should be addressed:
- [ ] @amohammadi started a [discussion](https://gitlab.idiap.ch/bob/bob.learn.tensorflow/merge_requests/75#note_41878):
> @tiago.pereira I don't think putting this `os.environ['KMP_DUPLICATE_LIB_OK']='True'` here is a good idea. Maybe we should update our bob-devel?
@tiago.pereira let's remove this when things are fixed upstream.Tiago de Freitas PereiraTiago de Freitas Pereirahttps://gitlab.idiap.ch/bob/bob.learn.tensorflow/-/issues/76Logits embedding validation gives NaN loss2019-03-11T12:59:11ZAmir MOHAMMADILogits embedding validation gives NaN lossaccording to https://www.tensorflow.org/api_docs/python/tf/nn/sparse_softmax_cross_entropy_with_logits?hl=en :
```
labels: Tensor of shape [d_0, d_1, ..., d_{r-1}] (where r is rank of labels and result) and dtype int32 or int64.
Each ent...according to https://www.tensorflow.org/api_docs/python/tf/nn/sparse_softmax_cross_entropy_with_logits?hl=en :
```
labels: Tensor of shape [d_0, d_1, ..., d_{r-1}] (where r is rank of labels and result) and dtype int32 or int64.
Each entry in labels must be an index in [0, num_classes).
Other values will raise an exception when this op is run on CPU, and return NaN for corresponding loss and gradient rows on GPU.
```
and I am getting NaNs as loss in GPU and exceptions in CPU mode when using the `Logits` estimator with `embedding_validation=True`.
This happens when I run `bob tf eval` with ReplayMobile. It happens rarely so I don't know what is going on. Here is one error that I get on CPU:
```
InvalidArgumentError (see above for traceback): Received a label value of 13 which is outside the valid range of [0, 12). Label values: 10 1
0 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 1
0 10 10 10 10 10 10 10 10 10 10 10 10 13 10 10 10 10 10 10 13 10 13 10 10 10 10 10 10 10 10 10 10 13 13 10 13 13 10 13 10 10 10 10 13 10 13 1
0 13 13 10 13 10 10 10 10 13 10 13 10 13 10 10 13 10 13 10 13 13 13 13 13 10 10 10 13 13 13 10 13 13 10 10 13 13 10 10 13 13 13 13 13 10 13 1
0 10 13 13 13 13 13 13 10 13 10 13 10 10 13 13 13 13 10 10 13 13 13 10 13 10 13 13 13 13 10 10 13 13 13 13 13 10 10 10 13 13 10 13 13 10 10 1
3 13 13 13 13 13 13 13 13 10 13 13 6 13 13 13 13 10 13 6 13 13 6 13 13 6 6 13 6 13 13 13 13 13 13 13 13 13 6 10 10 13 13 13 6 13 10 13 13 6 1
3 10 6 13 6 13 13 6 10 13 13 10 6 6 13 6 10 13 6 6 6 6 6 6 13 13 6 6 6 6 6 6 6 6 10 13 13 13 6 10 6 13 13 13 13 13 13 6 6 13 6 13 13 13 12 12
6 6 12 13 6 13 6 13 6 12 6 6 13 13 6 12 6 12 6 6 10 12 6 10 12 12 12 6 12 13 6 6 6 6 12 12 6 12 6 12 10 13 6 12 12 10 12 12 12 6 12 12 6 13
12 12 12 13 6 12 12 6 12 12 12 13 12 6 12 12 6 12 12 12 12 13 6 12 13 13 13 12 12 12 12 12 12 12 12 13 12 6 12 6 12 13 12 10 12 12 12 12 12 1
2 6 12 12 13 12 12 12 12 13 12 12 13 13 12 6 12 12 12 12 12 6 13 12 6 12 12 12 12 10 12 13 13 12 6 12 12 6 12 12 12 12 12 12 6 12 13 12 12 6
12 12 12 12 12 12 12 12 13 12 13 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 13 12 12 12 13 12 12 6 12 12 12 12 12 12 12 12 12 12 12 1
2 12 13 12 12 12 12 12 12 12 12 12 6 12 12 12 12
[[node Bio_loss/sparse_softmax_cross_entropy_loss/xentropy/xentropy (defined at deep/sr
c/bob.learn.tensorflow/bob/learn/tensorflow/loss/epsc.py:10) = SparseSoftmaxCrossEntropyWithLogits[T=DT_FLOAT, Tlabels=DT_INT32, _device="/j
ob:localhost/replica:0/task:0/device:CPU:0"](Logits/Bio/BiasAdd, IteratorGetNext:2)]]
```
and I construct my labels like this:
```python
files = database.all_files(groups=groups, flat=True)
CLIENT_IDS = sorted(set(str(f.client_id) for f in files))
CLIENT_IDS = dict(zip(CLIENT_IDS, range(len(CLIENT_IDS))))
load_data = load(load_data, context=context,
entry_point_group='bob', attribute_name='load_data')
def reader(f):
key = str(f.make_path("", "")).encode('utf-8')
label = CLIENT_IDS[str(f.client_id)]
```
so I am not sure what is going on. I suspect I am hitting a corner case in https://gitlab.idiap.ch/bob/bob.learn.tensorflow/blob/c7a4d9f78adbcb9b6ec3c22a0ece375e6a271468/bob/learn/tensorflow/utils/util.py#L192
Any ideas are welcome.https://gitlab.idiap.ch/bob/bob.learn.tensorflow/-/issues/73Create an utility click command that describes the checkpoint file2019-01-25T16:24:41ZTiago de Freitas PereiraCreate an utility click command that describes the checkpoint fileBasically wraps this thing
`from tensorflow.python.tools.inspect_checkpoint import print_tensors_in_checkpoint_file`Basically wraps this thing
`from tensorflow.python.tools.inspect_checkpoint import print_tensors_in_checkpoint_file`Tiago de Freitas PereiraTiago de Freitas Pereirahttps://gitlab.idiap.ch/bob/bob.learn.tensorflow/-/issues/69Follow-up from "Several changes"2018-11-02T07:57:49ZTiago de Freitas PereiraFollow-up from "Several changes"The following discussion from !68 should be addressed:
- [ ] @tiago.pereira started a [discussion](https://gitlab.idiap.ch/bob/bob.learn.tensorflow/merge_requests/68#note_36057): (+2 comments)
> Implement the new mechanism of movi...The following discussion from !68 should be addressed:
- [ ] @tiago.pereira started a [discussion](https://gitlab.idiap.ch/bob/bob.learn.tensorflow/merge_requests/68#note_36057): (+2 comments)
> Implement the new mechanism of moving averages in the Logits, Triple and Siamese estimatorshttps://gitlab.idiap.ch/bob/bob.learn.tensorflow/-/issues/59Random seed is ignored in input_fn2018-07-10T11:45:58ZAmir MOHAMMADIRandom seed is ignored in input_fnAs a solution we set the seed manually to 0 in image augmentation.
The following discussion from !57 should be addressed:
- [ ] @tiago.pereira started a [discussion](https://gitlab.idiap.ch/bob/bob.learn.tensorflow/merge_requests/57#no...As a solution we set the seed manually to 0 in image augmentation.
The following discussion from !57 should be addressed:
- [ ] @tiago.pereira started a [discussion](https://gitlab.idiap.ch/bob/bob.learn.tensorflow/merge_requests/57#note_32794): (+1 comment)
> Well, I don't know what is going on here.
> It seems that the pseudo-random number generator from the model_fn and the input_fn are different.
>
> Shall we open another issue for this and move one?
>
> I will try to isolate the code and see if this is our problem or TF problemhttps://gitlab.idiap.ch/bob/bob.learn.tensorflow/-/issues/56train_and_evaluate function return Cache lockfile already exists after eval part2018-06-20T10:11:04ZSaeed SARFJOOtrain_and_evaluate function return Cache lockfile already exists after eval partUsing the train_and_evaluate method from estimator API and cache policy on filesystem, I get an error because the evaluation starts before that all cache is written on the filesystem. So when the train runs the second time, after the eva...Using the train_and_evaluate method from estimator API and cache policy on filesystem, I get an error because the evaluation starts before that all cache is written on the filesystem. So when the train runs the second time, after the evaluation, I get the error because it finds the lock file. Mostly this is a problem with large dataset when after 100 steps or 600 seconds as throttle_secs the first epoch is not done yet.
Similar error is reported in https://github.com/tensorflow/tensorflow/issues/18266https://gitlab.idiap.ch/bob/bob.learn.tensorflow/-/issues/49Provide a way to train reproducible networks2017-12-07T10:53:29ZAmir MOHAMMADIProvide a way to train reproducible networksCurrently we have the `bob/learn/tensorflow/utils/reproducible.py` module but that is not working at all! Also, we need to figure out how to resume the state of the dataset when training is killed.Currently we have the `bob/learn/tensorflow/utils/reproducible.py` module but that is not working at all! Also, we need to figure out how to resume the state of the dataset when training is killed.