Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
bob.pipelines
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Build
Pipelines
Jobs
Pipeline schedules
Artifacts
Deploy
Releases
Package registry
Model registry
Operate
Environments
Terraform modules
Monitor
Incidents
Analyze
Value stream analytics
Contributor analytics
CI/CD analytics
Repository analytics
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
bob
bob.pipelines
Commits
c1bb19f5
Commit
c1bb19f5
authored
2 years ago
by
Yannick DAYER
Browse files
Options
Downloads
Patches
Plain Diff
doc [sample]: Added info on estimator and pipeline
parent
6992c938
No related branches found
No related tags found
No related merge requests found
Changes
4
Hide whitespace changes
Inline
Side-by-side
Showing
4 changed files
doc/estimator.rst
+63
-0
63 additions, 0 deletions
doc/estimator.rst
doc/index.rst
+2
-0
2 additions, 0 deletions
doc/index.rst
doc/pipeline.rst
+66
-0
66 additions, 0 deletions
doc/pipeline.rst
doc/sample.rst
+57
-3
57 additions, 3 deletions
doc/sample.rst
with
188 additions
and
3 deletions
doc/estimator.rst
0 → 100644
+
63
−
0
View file @
c1bb19f5
.. _bob.pipelines.estimator:
Scikit-learn Estimators
=======================
This is a short example showing the difference between an estimator taking one parameter
and one taking metadata for each sample in addition.
Example of a custom Estimator
-----------------------------
Let us make an Estimator that takes batches of arrays as input and applies a simple
function to each of them:
.. doctest:: custom_estimator
>>> import numpy as np
>>> from sklearn.base import BaseEstimator
>>> class OffsetTransformer(BaseEstimator):
... """Demo Estimator to add an offset to arrays."""
... def __init__(self, offset=1):
... self.offset = offset
...
... def transform(self, arrays: np.ndarray) -> np.ndarray:
... """Add the offset to each array"""
... return [arr + self.offset for arr in arrays]
>>> transformer = OffsetTransformer(offset=2)
>>> transformer.transform([np.array([1, 2]), np.array([2, 3])])
[array([3, 4]), array([4, 5])]
.. note::
The ``transform`` method accepts a series of data samples (it works on batches). If
you work with 2D numpy arrays, your ``transform`` needs to handle 3D arrays (or you
need to loop over the first dimension to handle sample individually).
Now let's edit the ``OffsetTransformer`` so that each sample can be offset by a
different value. This may be the case when applying preprocessing on some data with
annotations. Each will be different for each sample.
Here, we want to pass an offset for each sample given to ``transform``:
.. doctest:: custom_estimator
>>> class OffsetTransformerPerSample(BaseEstimator):
... """Demo Estimator to add a different offset to each sample array."""
... def transform(self, arrays: np.ndarray, offsets: np.ndarray) -> np.ndarray:
... """Add its offset to each array"""
... return [arr + o for arr, o in zip(arrays, offsets)]
>>> transformer_2 = OffsetTransformerPerSample()
We need to pass two series of arrays to ``transform``, one for the samples data and one
containing the offsets:
.. doctest:: custom_estimator
>>> transformer_2.transform(arrays=[np.array([3,4,5]), np.array([1,2,3])], offsets=[1,2])
[array([4, 5, 6]), array([3, 4, 5])]
We will see how the second estimator (with multiple parameters in ``transform``) can
cause problems with the pipeline API in the next page.
This diff is collapsed.
Click to expand it.
doc/index.rst
+
2
−
0
View file @
c1bb19f5
...
@@ -31,6 +31,8 @@ User Guide
...
@@ -31,6 +31,8 @@ User Guide
.. toctree::
.. toctree::
:maxdepth: 2
:maxdepth: 2
estimator
pipeline
sample
sample
checkpoint
checkpoint
dask
dask
...
...
This diff is collapsed.
Click to expand it.
doc/pipeline.rst
0 → 100644
+
66
−
0
View file @
c1bb19f5
.. _bob.pipelines.pipeline:
Making pipelines
================
We rely on the :ref:`scikit-learn pipeline API<scikit-learn:pipeline>` to run our
pipelines.
Basically, a pipeline is a series of :py:class:`sklearn.base.BaseEstimator` objects.
When data is fed to the pipeline, the first Estimator receives it, processes it and what
is returned is then fed to the second Estimator in the pipeline, and so on until the
end of the steps. The result of the last Estimator is the result of the pipeline.
To make a pipeline, you can call :py:func:`sklearn.pipeline.make_pipeline` or directly
instantiate a Pipeline object with :py:class:`sklearn.pipeline.Pipeline`.
Below is a quick example on how to make a pipeline out of two Estimators:
.. doctest:: pipeline
>>> import numpy
>>> from sklearn.pipeline import make_pipeline
>>> from sklearn.preprocessing import StandardScaler
>>> from sklearn.decomposition import PCA
>>> pipeline = make_pipeline(StandardScaler(), PCA())
>>> pipeline = pipeline.fit([[0, 0], [0, 0], [2, 2]])
>>> pipeline.transform([[0, 1], [2, 1]])
array([[-0.25, 0.75],
[ 1.25, -0.75]])
The issue with scikit-learn pipelines
-------------------------------------
The previous example works fine as long as we only have one parameter to give to the
``transform`` or ``fit`` method of all the Estimators in the pipeline.
If we want to pass some metadata to the Estimators in a pipeline (like the ``offsets``
parameters from the previous page), we cannot include the Estimator in a pipeline.
This example takes the transformer with an ``offsets`` parameters defined in the
previous page and tries to make a pipeline out of it:
.. testsetup:: custom_estimator
from sklearn.base import BaseEstimator
import numpy as np
class OffsetTransformerPerSample(BaseEstimator):
"""Demo Estimator to add a different offset to each sample array."""
def transform(self, arrays: np.ndarray, offsets: np.ndarray) -> np.ndarray:
"""Add its offset to each array"""
return [arr + o for arr, o in zip(arrays, offsets)]
transformer_2 = OffsetTransformerPerSample()
.. doctest:: custom_estimator
>>> from sklearn.pipeline import make_pipeline
>>> pipeline = make_pipeline(transformer_2)
>>> pipeline.transform([np.array([3,4,5]), np.array([1,2,3])], [1,2])
Traceback (most recent call last):
...
TypeError: _transform() takes 2 positional arguments but 3 were given
In order to include such an Estimator in a pipeline, we must add some logic to redirect
data and metadata through the pipeline. This is the goal of the :any:`Sample` and
:any:`SampleWrapper` presented in the next page.
This diff is collapsed.
Click to expand it.
doc/sample.rst
+
57
−
3
View file @
c1bb19f5
...
@@ -6,7 +6,7 @@ Samples, a way to enhance scikit pipelines with metadata
...
@@ -6,7 +6,7 @@ Samples, a way to enhance scikit pipelines with metadata
Some tasks in pattern recognition demands the usage of metadata to support some processing (e.g. face cropping, audio segmentation).
Some tasks in pattern recognition demands the usage of metadata to support some processing (e.g. face cropping, audio segmentation).
To support scikit-learn based estimators with such requirement task, this package provides two mechanisms that:
To support scikit-learn based estimators with such requirement task, this package provides two mechanisms that:
1. Wraps input data in a
lay
er called :any:`Sample` that allows you to append some metadata to your original input data.
1. Wraps input data in a
contain
er called :any:`Sample` that allows you to append some metadata to your original input data.
2. A wrapper class (:any:`SampleWrapper`) that interplay between :any:`Sample` and your estimator.
2. A wrapper class (:any:`SampleWrapper`) that interplay between :any:`Sample` and your estimator.
...
@@ -87,8 +87,8 @@ While this transformer works well by itself, it can't be used by
...
@@ -87,8 +87,8 @@ While this transformer works well by itself, it can't be used by
TypeError: _transform() takes 2 positional arguments but 3 were given
TypeError: _transform() takes 2 positional arguments but 3 were given
To approach this issue, :any:`SampleWrapper` can be used. This class wraps
To approach this issue, :any:`SampleWrapper` can be used. This class wraps
other estimators and accepts as input
sample
s and passes the data with
metadata inside
other estimators and accepts as input
:any:`Sample` object
s and passes the data with
samples to the wrapped estimator:
metadata inside
samples to the wrapped estimator:
.. doctest::
.. doctest::
...
@@ -215,3 +215,57 @@ transform each sample inside and returns the same SampleSets with new data.
...
@@ -215,3 +215,57 @@ transform each sample inside and returns the same SampleSets with new data.
DelayedSample(offset=array([1]))
DelayedSample(offset=array([1]))
>>> transformed_sample_sets[0].samples[1].data
>>> transformed_sample_sets[0].samples[1].data
array([1., 1.])
array([1., 1.])
Using Tags
----------
If an estimator needs always the same elements from the :any:`Sample` objects, you can
define which attribute it takes by setting tags at the class level, instead of requiring
the user to define this when instantiating:
.. testsetup::
from typing import Any
.. doctest::
>>> class TaggedTransformer(BaseEstimator):
... """Transforms samples with minimal user configuration"""
... def transform(self, X: np.ndarray, offsets_kwarg: np.ndarray) -> np.ndarray:
... """Adds an offset to each sample."""
... return np.array(X) + np.array(offsets_kwarg)
...
... def _more_tags(self) -> dict[str, Any]:
... return {
... "requires_fit": False, # sklearn Estimator tag
... "bob_input": "data", # Optional (data is the default)
... "bob_transform_extra_input": [("offsets_kwarg", "offset")],
... "bob_output": "data", # Optional (data is the default)
... }
With these tags defined, when wrapping the transformer with :any:`SampleWrapper` you
don't need to specify any parameters (unless you want to override the tags):
.. doctest::
>>> my_transformer = TaggedTransformer()
>>> my_wrapped_transformer = bob.pipelines.SampleWrapper(my_transformer)
>>> pipe = make_pipeline(my_wrapped_transformer)
>>> result_samples = pipe.transform(samples)
>>> np.array([s.data for s in result_samples])
array([[0., 0.],
[1., 1.],
[2., 2.]])
The tags related to the :any:`SampleWrapper` are the following:
- ``bob_input``: Name of an attribute in :any:`Sample` that will be passed as first
parameter of the ``transform`` or ``fit`` method.
- ``bob_transform_extra_input``: The additional :any:`Sample` attributes to pass as
parameters to ``transform``. The format is a list of tuples, each containing the
parameter name in ``transform`` and the corresponding :any:`Sample` attribute to
load.
- ``bob_fit_extra_input``: Additional :any:`Sample` attributes to pass to the fit method.
- ``bob_output``: The :any:`Sample` attribute that will receive the result of
``transform``.
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment