Releases · bob / bob.pipelines

v4.0.1

Evidence collection

v4.0.1-evidences-276.json 512e55cb

Collected 8 months ago

Release notes

!109: Add a way to retrieve protocol definition files

Removes bob.extension's get_file().
Be more lenient with the dependencies version pinning.

v4.0.0

Evidence collection

v4.0.0-evidences-199.json 03c22680

Collected 1 year ago

Release notes

!99: remove sampleloaders and prepare for bob.bio.base!300

Needed for bob.bio.base!300
!102: Ci refactoring

Refactoring of the CI process.

Linked to Issue #42
!104: Bob extension replacement

Part of the replacement of bob.extension to exposed and auto-intersphinx. Related to #43.
!103: Add protocols as classmethod for FileListDatabase

Allows inheriting classes to retrieve a default protocols definition file and list protocols.
!105: [utils.py] changed return type in check_parameters_for_validity to ensure that a list is returned

Closes #44
!106: Update deprecated dask-jobqueue names

Parameters from dask-jobqueue's classes (Job and JobQueueCluster) will change name soon. This follows those changes (job_extra to job_extra_directives and env_extra to job_script_prologue.

A config option (jobqueue.sge.job-extra) became invalid due to the name changes and returned None which was not handled correctly, making the submit commands fail silently and the scheduler waiting for the jobs. The option is now renamed.
!108: Change UserDefaults calls to match last implementation

Changed UserDefaults calls to match the last implementation done in the package exposed.
!109: Add a way to retrieve protocol definition files

Removes bob.extension's get_file().
!110: Modifying rc file name to bobrc.toml
!111: [pyproject.toml] Changing documentation link to master/sphinx
!112: Replace clapp by clapper.
!113: meta [entry-points]: Revert dask.client group name

Switch back to dask.client instead of bob.pipelines.dask.client for

the dask Client entry-points group name in pyproject.toml.

Fixes #46.
!114: meta [readme]: Switch the README.rst to markdown

Renames README.rst to README.md to be supported by the release script.
!115: meta(deps): add bob as dependency in new structure

Adapt to the new structure of bob with bob/bob on top.

v3.0.3

Evidence collection

v3.0.3-evidences-149.json 0a78d33d

Collected 2 years ago

Release notes

!101 Pin numpy on the minor version: Prevents increment of numpy minor version over the bob.devtools defined pin.

v3.0.2

Evidence collection

v3.0.2-evidences-121.json 98399c5f

Collected 2 years ago

Release notes

!97 pipeline wrappers tweaks: 1. SampleWrapper to be able choose the type of output: Sample vs DelayedSample 2. SampleWrapper to make sure there is no invalid samples when calling fit 3. DaskWrapper to avoid calling fit multiple times
!100 Fix the doctest of xarray failing on python 3.8
!98 DelayedSample tweak: 1. Make kwargs take precedence over parents' delayed_attributes. This change is made to follow more closely the implementation of the Sample class. 2. Make sure an attribute is not present in both delayed_attributes and kwargs of __init__ function. Which is semantically not sound.

v3.0.1

Evidence collection

v3.0.1-evidences-94.json 0c3c3034

Collected 2 years ago

Release notes

!96 Fix Dask documentation: This MR fixes the issues with the Dask documentation

v3.0.0

Evidence collection

v3.0.0-evidences-72.json 27023333

Collected 2 years ago

Release notes

!79 Fixing compatibility issues with dask_jobqueue=0.7.2: closes #37 Unfortunately we can't test this on the CI (there's no SGE there)
!78 pin dask versions more strictly: Fixes #40
!77 Resolve "local-parallel queue is not setup well": Closes #38
!80 Implemented a force mechanism: Created a force option for the CheckpointWrapper Related to: bob.bio.base#173
!81 Fix get_bob_tags to return default tags: When passing None as estimator to get_bob_tags, returns the default tags.
!84 Created a function checking if a Scikit learn pipeline is wrapped: Created the function is_estimator_wrapped
!70 Handle estimator tags in wrapper classes: Allows setting some parameters of the SampleWrapper and CheckpointWrapper via estimator tags. bob.bio.base#143
!83 [dask] Convert dask bags to arrays more efficiently: Most inefficiencies were coming from that fact that we were creating a dask array with each sample as a separate chunk.
!82 breaking: checkpoint the inner estimator only
!85 Prevent a reference invalidation when wrapped with sample and checkpoint.: Prevents creating a new estimator when loading a sample-wrapped estimator with CheckpointWrapper (continuation of !82 which prevented the creation of a new estimator right "below"). This now checks if the estimator is wrapped with SampleWrapper and updates the estimator at that level. Fixes bob.bio.gmm#30.
!86 Fix fit extra parameters: Allowed extra fit parameters to be non-array (e.g. str). Added a tag to prevent stacking of the input array of the fit method if it expects partitioned data.
!87 Add a non-adaptive io-big queue
!88 Add support for fitting estimators on dask bags: The estimators that can handle dask bags should set the bob_fit_supports_dask_bag as True. This commit also includes * Adds a new tag: bob_fit_supports_dask_bag * Adds a new tag: bob_checkpoint_features for when you want to always avoid checkpointing features for a specific estimator. * Expose dask_tags, get_bob_tags in the main API * The SampleWrapper was modified to support bob_fit_supports_dask_bag * The CheckpointWrapper now loads estimators without losing references correctly.
!89 Load checkpointed estimators inside the scheduler: Also adds resilience to loading checkpointed samples
!90 replace is_estimator_stateless with estimator_requires_fit: The actual code before meant to check if an estimator requires fit or not while the function was named is_estimator_stateless.
!92 better logging overall
!91 Many API changes: Expose utils API in the root API. Fix the docs API. Remove unused transformers. Fix SGE GPU submissions.
!93 Add documentation for CSV databases

v2.0.0

Release notes

!63 Implemented a mechanism in the Checkpoint wrapper that asserts if data was...: Implemented a mechanism in the Checkpoint wrapper that asserts if data was properly written in the disk Closes #31
!66 Handled failed processing (Failure to Acquire) in the wrappers: Fixes #32
!67 Some minor updates on the checkpoint wrapper and SGE
!68 Fix parent's delayed_attributes modified by child: A DelayedSample child's delayed_attributes is no longer referencing the parent's delayed_attributes.
!69 [SampleSet] Do not load delayed attributes by not copying them over
!71 [CheckpointWrapper] Use atomic writing when saving features
!72 + breakdown_SampleSet: As desciribed in #33, a function is added which takes as input a SampleSet with N samples and outputs N SampleSets with 1 Sample each. ping @tiago.pereira Closes #33
!73 Remove samples_to_hdf5 methods: These methods were not used anywhere.
!74 Add a DelayedSample.from_sample classmethod: This method can be used to transparently create new DelayedSamples from either Samples or DelayedSamples without loading delayed attributes and data
!64 Fix delayed attributes: Delayed attributes are no longer loaded when _copy_attributes is called to create a DelayedSample.
!75 Add worker Time To Live limitation: Hello, I have regularly been annoyed by Dask runs that hang indefinitely because of some workers being disconnected from the scheduler. In this case, the scheduler actually assumes the worker must still be doing its job so it doesn't reassign the task, leading to a completely blocked run that needs to be interrupted by hand. This typically happens on very heavy experiments e.g. on IJBC, FRGC. From what I understand this can be handled using the worker_ttl parameter of the scheduler, which puts a limit on how long a worker can be unseen by the scheduler before being killed and reassigning its task. It is None by default, I have been working for a while on a local branch where I set the default to 60s, it helped quite a lot. I am proposing to merge this change, however I wanted to know what you think of it. My main concern is that it might be hiding some underlying issue (why do the workers actually disconnect ?), so I am not 100% sure it's a good change to make. ping @tiago.pereira @amohammadi
!76 [docs] update docs to match new API of xarray: Fixes #35 Disabled testing Sphinx docs on mac builds.

v1.0.0

Release notes

!37 Revert "For some reason, the class information is not passed in the sample wrapper": This reverts merge request !36
!38 [sge] In dask some sublacessd classes need a config name. Fixes #20
!40 Add dask-client configurations as resources: Fixes #19 Removes the sge-demanding configuration as all nodes at Idiap have a fast connection now. Depends on bob.bio.base!201
!39 [dask][sge] Added the variables idle_timeout and allowed_failures as: part of our .bobrc and added better defaults
!41 Added a GPU queue that defaults to short_gpu
!43 Allow setting specific attributes of sample: Specify the sample attribute to assign the output of an estimator to, instead of 'data' in SampleWrapper. Specify the attribute of sample to save and load in CheckpointWrapper.
!44 Fix sphinx warnings
!45 Multiple Changes: * When checkpointing, checkpoing all steps in a pipeline * Better names in dask graph for FunctionTransformer * [xarray] Allow for multi argument transformers * SampleBatch in public API
!46 move vstack_features to bob.io.base
!48 Improvements on CheckpointWrapper: Added the optional argument hash_fn in the CheckpointWrapper class. Once this is set, sample.key generates a hash code and this hash code is used to compose the final path where sample will be checkpointed. This is optional and generic enough for our purposes. This hash function can be shipped in the database interface. Closes #25
!47 Multiple changes: * [DelayedSample] Allow for arbitrary delayed attributes * [SampleBatch] Allow other attributes than data Fixes #26 #24
!49 [DelayedSample] Fix issues when an attribute was set
!50 [DelayedSample(Set)] make load and delayed_attributes private: This removes the need for a lot of guessing in downstream packages as they can start removing all keys that start with _ when access of the sample's attribute is needed.
!51 [dask][sge] Multiqueue updates: In this merge request I: - Simplified the way multi-queue is set in our scripts - Updated our Dask documentation Example ------- Setting the fit method to run on q_short_gpu python pipeline = mario.wrap( ["sample", "checkpoint", "dask"], pipeline, model_path=model_path, fit_tag="q_short_gpu", ) You have to explicitly set the list of resource tags available. python pipeline.fit_transform(...).compute( scheduler=dask_client, resources=cluster.get_sge_resources()
!53 Updates: Implemented two updates in this MR - Removed the random behavior on the hash_string function (i had some problems in large scale tests). - Implemented the DelayedSampleSetCached. I need this behavior to speed-up the score computation.
!52 [CheckpointWrapper] Allow custom save and load functions through estimator tags
!54 Fixed multiqueue: Hi @amohammadi @ydayer I'm fixing here the issue raised with the multiqueue. I was wrongly setting all tasks to run in a particular resource restriction. Now the problem is fixed. To get it running you have to wrap your pipeline in the same way as before and fetch the resources like this python pipeline = bob.pipelines.wrap( ["sample", "checkpoint", "dask"], pipeline, model_path="./", transform_extra_arguments=(("metadata", "metadata"),), fit_tag="q_short_gpu", ) from bob.pipelines.distributed.sge import get_resource_requirements resources = get_resource_requirements(pipeline) pipeline.fit_transform(X_as_sample).compute( scheduler=client, resources=resources )
!56 Two new features: - Moved dask_get_partition_size from bob.bio.base to bob.pipelines - Updated the target duration of a task to 10s. Being very aggressive in scale-up
!58 Moved the CSVBaseSampleLoader from bob.bio.base to bob.pipelines. This is a general function
!55 Moved VALID_DASK_CLIENT_STRINGS to bob.pipelines
!59 Dask client names
!60 CSVSampleLoaders as transformers: Made CSVSampleLoaders as scikit-learn transformers This is a good idea indeed. I made to classes. The CSVToSampleLoader converts one line to one sample; and AnnotationsLoader that aggregates from CSVToSampleLoader to read annotations using bob.db.base.read_anno.... This is delayed. I'm already porting this stuff on bob.bio.base. Code is way more cleaner. ping @amohammadi @ydayer Closes #30
!61 Fixed modules: config files from here are not available once conda install bob.pipelines
!62 Implement a new simple generic csv-based database interface: Depends on bob.extension!126