Skip to content
Snippets Groups Projects
Commit 9b4716e5 authored by Amir MOHAMMADI's avatar Amir MOHAMMADI
Browse files

Merge branch 'datasets-docs' into 'master'

Add documentation for CSV databases

See merge request !93
parents 2ff94990 307b3c37
Branches
Tags
1 merge request!93Add documentation for CSV databases
Pipeline #61781 passed
# isort: skip_file
from . import distributed # noqa: F401
from . import transformers # noqa: F401
from . import utils # noqa: F401
......@@ -33,6 +34,7 @@ from .wrappers import ( # noqa: F401
is_instance_nested,
is_pipeline_wrapped,
)
from .datasets import FileListToSamples, CSVToSamples, FileListDatabase
def __appropriate__(*args):
......@@ -66,6 +68,9 @@ __appropriate__(
CheckpointWrapper,
DaskWrapper,
ToDaskBag,
FileListToSamples,
CSVToSamples,
FileListDatabase,
)
# gets sphinx autodoc done right - don't remove it
......
......@@ -961,6 +961,9 @@ def wrap(bases, estimator=None, **kwargs):
If ``estimator`` is a pipeline, the estimators in that pipeline are wrapped.
The default behavior of wrappers can be customized through the tags; see
:any:`bob.pipelines.get_bob_tags` for more information.
Parameters
----------
bases : list
......
.. _bob.pipelines.csv_database:
File List Databases (CSV)
=========================
We saw in :ref:`bob.pipelines.sample` that how using samples can improve the
workflow of our machine learning experiments. However, we did not discuss how to
create the samples in the first place.
In all reproducible machine learning experiments, each database comes with one
or several protocols that define exactly which files should be used for
training, development, and testing. These protocols can be defined in ``.csv``
files where each row represents a sample. Using ``.csv`` files to define the
protocols of a database is advantageous because the files are easy to create and
read. And, they can be imported and used in many different libraries.
Here, we provide :any:`bob.pipelines.FileListDatabase` that can be used to read
``.csv`` files and generate :py:class:`bob.pipelines.Sample`. The format is extremely
simple. You must put all the protocol files in a folder with the following
structure::
dataset_protocols_path/<protocol>/<group>.csv
where each subfolder points to a specific *protocol* and each file contains the
samples of a specific *group* or *set* (e.g. training set). The names of the
protocols are the names of folders and the name of each group is the name of the
file.
.. note::
Instead of pointing to a folder, you can also point to a compressed tarball
that contains the protocol files.
The ``.csv`` files must have the following structure::
attribute_1,attribute_2,...,attribute_n
sample_1_attribute_1,sample_1_attribute_2,...,sample_1_attribute_n
sample_2_attribute_1,sample_2_attribute_2,...,sample_2_attribute_n
...
sample_n_attribute_1,sample_n_attribute_2,...,sample_n_attribute_n
Each row will contain exactly **one** sample (e.g. one image) and
each column will represent one attribute of samples (e.g. path to data or other
metadata).
An Example
----------
Below is an example of creating the iris database. The ``.csv`` files are
distributed with this package have the following format::
iris_database/
default/
train.csv
test.csv
As you can see there is only protocol called ``default`` and two groups
``train`` and ``test``. Moreover, ``.csv`` files have the following format::
sepal_length,sepal_width,petal_length,petal_width,target
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3,1.4,0.2,Iris-setosa
...
.. doctest:: csv_iris_database
>>> import pkg_resources
>>> import bob.pipelines as mario
>>> dataset_protocols_path = pkg_resources.resource_filename(
... 'bob.pipelines', 'tests/data/iris_database')
>>> database = mario.FileListDatabase(
... dataset_protocols_path,
... protocol="default",
... )
>>> database.samples(groups="train")
[Sample(data=None, sepal_length='5.1', sepal_width='3.5', petal_length='1.4', petal_width='0.2', target='Iris-setosa'), Sample(...)]
>>> database.samples(groups="test")
[Sample(data=None, sepal_length='5', sepal_width='3', petal_length='1.6', petal_width='0.2', target='Iris-setosa'), Sample(...)]
As you can see, all attributes are strings. Furthermore, we may want to
*transform* our samples further before using them.
Transforming Samples
--------------------
:any:`bob.pipelines.FileListDatabase` accepts a transformer that will be applied
to all samples:
.. doctest:: csv_iris_database
>>> import numpy as np
>>> from sklearn.preprocessing import FunctionTransformer
>>> def prepare_data(sample):
... return np.array(
... [sample.sepal_length, sample.sepal_width,
... sample.petal_length, sample.petal_width],
... dtype=float
... )
>>> def prepare_iris_samples(samples):
... return [mario.Sample(prepare_data(sample), parent=sample) for sample in samples]
>>> database = mario.FileListDatabase(
... dataset_protocols_path,
... protocol="default",
... transformer=FunctionTransformer(prepare_iris_samples),
... )
>>> database.samples(groups="train")
[Sample(data=array([5.1, 3.5, 1.4, 0.2]), sepal_length='5.1', sepal_width='3.5', petal_length='1.4', petal_width='0.2', target='Iris-setosa'), Sample(...)]
.. note::
The ``transformer`` used in the ``FileListDatabase`` will not be fitted and
you should not perform any computationally heavy processing on the samples
in this transformer. You are expected to do the minimal processing of
samples here to make them ready for experiments. Most of the time you just
load the data from disk in this transformer and return delayed samples.
Now our samples are ready to be used and we can run a simple experiment with
them.
Running An Experiment
---------------------
Here, we want to train an Linear Discriminant Analysis (LDA) on the data. Before
that, we want to normalize the range of our data and convert the ``target``
labels to integers.
.. doctest:: csv_iris_database
>>> from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
>>> from sklearn.preprocessing import StandardScaler, LabelEncoder
>>> from sklearn.pipeline import Pipeline
>>> scaler = StandardScaler()
>>> encoder = LabelEncoder()
>>> lda = LinearDiscriminantAnalysis()
>>> scaler = mario.wrap(["sample"], scaler)
>>> encoder = mario.wrap(["sample"], encoder, input_attribute="target", output_attribute="y")
>>> lda = mario.wrap(["sample"], lda, fit_extra_arguments=[("y", "y")])
>>> pipeline = Pipeline([('scaler', scaler), ('encoder', encoder), ('lda', lda)])
>>> pipeline.fit(database.samples(groups="train"))
Pipeline(...)
>>> encoder.estimator.classes_
array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica']...)
>>> predictions = pipeline.predict(database.samples(groups="test"))
>>> predictions[0].data, predictions[0].target, predictions[0].y
(0, 'Iris-setosa', 0)
doc/img/checkpoint.png

27 KiB

doc/img/dask.png

17.8 KiB

doc/img/metadata.png

276 KiB

......@@ -8,34 +8,22 @@
Easily boost your :doc:`Scikit Learn Pipelines <modules/generated/sklearn.pipeline.Pipeline>` with powerful features, such as:
.. figure:: img/dask.png
:width: 40%
:align: center
Scale them with Dask
.. figure:: img/metadata.png
:width: 40%
:align: center
Wrap datapoints with metadata and pass them to the `estimator.fit` and `estimator.transform` methods
.. figure:: img/checkpoint.png
:width: 40%
:align: center
Checkpoint datapoints after each step of your pipeline
* Scaling experiments on dask_.
* Wrapping data-points with metadata and passing them to the `estimator.fit` and `estimator.transform` methods.
* Checkpointing data-points after each step of your pipeline.
* Expressing database protocol as csv files and using them easily.
.. warning::
Before any investigation of this package is capable of, check the scikit learn :ref:`user guide <scikit-learn:pipeline>`. Several :ref:`tutorials <scikit-learn:tutorial_menu>` are available online.
.. warning::
If you want to implement your own scikit-learn estimator, please, check it out this :doc:`link <scikit-learn:developers/develop>`
Before any investigation of this package is capable of, check the scikit
learn :ref:`user guide <scikit-learn:pipeline>`. Several :ref:`tutorials
<scikit-learn:tutorial_menu>` are available online.
.. warning::
If you want to implement your own scikit-learn estimator, please, check out
this :doc:`link <scikit-learn:developers/develop>`
User Guide
==========
......@@ -46,5 +34,8 @@ User Guide
sample
checkpoint
dask
datasets
xarray
py_api
.. include:: links.rst
......@@ -30,9 +30,9 @@ Wrapper's API
Database's API
--------------
.. autosummary::
bob.pipelines.datasets.FileListDatabase
bob.pipelines.datasets.FileListToSamples
bob.pipelines.datasets.CSVToSamples
bob.pipelines.FileListDatabase
bob.pipelines.FileListToSamples
bob.pipelines.CSVToSamples
Transformers' API
-----------------
......@@ -77,7 +77,3 @@ Transformers
xarray Wrapper
==============
.. automodule:: bob.pipelines.xarray
Filelist Datasets
=================
.. automodule:: bob.pipelines.datasets
......@@ -3,6 +3,11 @@
Efficient pipelines with dask and xarray
========================================
.. note::
This section of the API is not used by ``bob.bio`` and ``bob.pad`` packages.
If you are only interested in learning those packages, you can skip this page.
In this guide, we will see an alternative method to what was discussed before
about sample-based processing, checkpointing, dask integration. We will be doing
the same concepts as discussed before but try to do it in a more efficient way.
......@@ -12,7 +17,7 @@ In this guide we are interested in several things:
#. Sample-based processing and carrying the sample metadata over the
pipeline.
#. Checkpointing: we may want to save intermediate steps.
#. Lazy operations and graph optimizations. Well define all operations using
#. Lazy operations and graph optimizations. We'll define all operations using
dask and we will benefit from lazy operations and graph optimizations.
#. Failed sample handling: we may want to drop some samples in the pipeline if
we fail to process them.
......@@ -23,7 +28,7 @@ This guide builds upon `scikit-learn`_, `dask`_, and `xarray`_. If you are
not familiar with those libraries, you may want to get familiar with those
libraries first.
First, lets run our example classification problem without using the tools here
First, let's run our example classification problem without using the tools here
to get familiar with our examples. We are going to do an Scaler+PCA+LDA example
on the iris dataset:
......@@ -54,13 +59,13 @@ on the iris dataset:
As you can see here, the example ran fine. The ``iris.data`` was transformed
twice using ``scaler.transform`` and ``pca.transform`` but that's ok and we
could have avoided that at the cost of complexity and more memory usage.
Lets go through through this example again and increase its complexity
Let's go through through this example again and increase its complexity
as we progress.
Sample-based processing
-----------------------
First, lets look at how we can turn this into a sample-based pipeline. We need
First, let's look at how we can turn this into a sample-based pipeline. We need
to convert our dataset to a list of samples first:
.. doctest::
......@@ -80,7 +85,7 @@ to convert our dataset to a list of samples first:
You may be already familiar with our sample concept. If not, please read more on
:ref:`bob.pipelines.sample`. Now, to optimize our process, we will represent our
samples in an :any:`xarray.Dataset` using :any:`dask.array.Array`s:
samples in an :any:`xarray.Dataset` using :any:`dask.array.Array`'s:
.. doctest::
......@@ -146,10 +151,10 @@ dictionary instead.
DatasetPipeline(...)
The dictionaries are used to construct :any:`Block`s. You can checkout
The dictionaries are used to construct :any:`Block`'s. You can checkout
that class to see what options are possible.
Now lets fit our pipeline with our xarray dataset. Ideally, we want
Now let's fit our pipeline with our xarray dataset. Ideally, we want
this fit step be postponed until the we call :any:`dask.compute` on our
results. But this does not happen here which we will explain later.
......@@ -157,7 +162,7 @@ results. But this does not happen here which we will explain later.
>>> _ = pipeline.fit(dataset)
Now lets call ``decision_function`` on our pipeline. What will be
Now let's call ``decision_function`` on our pipeline. What will be
returned is a new dataset with the ``data`` variable changed to the
output of ``lda.decision_function``.
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment