Merge branch 'datasets-docs' into 'master'

Add documentation for CSV databases See merge request !93

Merge branch 'datasets-docs' into 'master'
9b4716e5 · Amir MOHAMMADI · 2ff94990 · 307b3c37 · 9b4716e5 · 9b4716e5
Commit 9b4716e5 authored 3 years ago by Amir MOHAMMADI
--- a/bob/pipelines/__init__.py
+++ b/bob/pipelines/__init__.py
+# isort: skip_file
 from . import distributed  # noqa: F401
 from . import transformers  # noqa: F401
 from . import utils  # noqa: F401
@@ -33,6 +34,7 @@ from .wrappers import (  # noqa: F401
    is_instance_nested,
    is_pipeline_wrapped,
 )
+from .datasets import FileListToSamples, CSVToSamples, FileListDatabase


 def __appropriate__(*args):
@@ -66,6 +68,9 @@ __appropriate__(
    CheckpointWrapper,
    DaskWrapper,
    ToDaskBag,
+    FileListToSamples,
+    CSVToSamples,
+    FileListDatabase,
 )

 # gets sphinx autodoc done right - don't remove it

--- a/bob/pipelines/wrappers.py
+++ b/bob/pipelines/wrappers.py
@@ -961,6 +961,9 @@ def wrap(bases, estimator=None, **kwargs):

    If ``estimator`` is a pipeline, the estimators in that pipeline are wrapped.

+    The default behavior of wrappers can be customized through the tags; see
+    :any:`bob.pipelines.get_bob_tags` for more information.
+
    Parameters
    ----------
    bases : list

--- a/doc/datasets.rst
+++ b/doc/datasets.rst
+.. _bob.pipelines.csv_database:
+
+File List Databases (CSV)
+=========================
+
+We saw in :ref:`bob.pipelines.sample` that how using samples can improve the
+workflow of our machine learning experiments. However, we did not discuss how to
+create the samples in the first place.
+
+In all reproducible machine learning experiments, each database comes with one
+or several protocols that define exactly which files should be used for
+training, development, and testing. These protocols can be defined in ``.csv``
+files where each row represents a sample. Using ``.csv`` files to define the
+protocols of a database is advantageous because the files are easy to create and
+read. And, they can be imported and used in many different libraries.
+
+Here, we provide :any:`bob.pipelines.FileListDatabase` that can be used to read
+``.csv`` files and generate :py:class:`bob.pipelines.Sample`. The format is extremely
+simple. You must put all the protocol files in a folder with the following
+structure::
+
+    dataset_protocols_path/<protocol>/<group>.csv
+
+where each subfolder points to a specific *protocol* and each file contains the
+samples of a specific *group* or *set* (e.g. training set). The names of the
+protocols are the names of folders and the name of each group is the name of the
+file.
+
+.. note::
+
+    Instead of pointing to a folder, you can also point to a compressed tarball
+    that contains the protocol files.
+
+The ``.csv`` files must have the following structure::
+
+    attribute_1,attribute_2,...,attribute_n
+    sample_1_attribute_1,sample_1_attribute_2,...,sample_1_attribute_n
+    sample_2_attribute_1,sample_2_attribute_2,...,sample_2_attribute_n
+    ...
+    sample_n_attribute_1,sample_n_attribute_2,...,sample_n_attribute_n
+
+Each row will contain exactly **one** sample (e.g. one image) and
+each column will represent one attribute of samples (e.g. path to data or other
+metadata).
+
+An Example
+----------
+
+Below is an example of creating the iris database. The ``.csv`` files are
+distributed with this package have the following format::
+
+    iris_database/
+        default/
+            train.csv
+            test.csv
+
+As you can see there is only protocol called ``default`` and two groups
+``train`` and ``test``. Moreover, ``.csv`` files have the following format::
+
+    sepal_length,sepal_width,petal_length,petal_width,target
+    5.1,3.5,1.4,0.2,Iris-setosa
+    4.9,3,1.4,0.2,Iris-setosa
+    ...
+
+.. doctest:: csv_iris_database
+
+    >>> import pkg_resources
+    >>> import bob.pipelines as mario
+    >>> dataset_protocols_path = pkg_resources.resource_filename(
+    ...     'bob.pipelines', 'tests/data/iris_database')
+    >>> database = mario.FileListDatabase(
+    ...     dataset_protocols_path,
+    ...     protocol="default",
+    ... )
+    >>> database.samples(groups="train")
+    [Sample(data=None, sepal_length='5.1', sepal_width='3.5', petal_length='1.4', petal_width='0.2', target='Iris-setosa'), Sample(...)]
+    >>> database.samples(groups="test")
+    [Sample(data=None, sepal_length='5', sepal_width='3', petal_length='1.6', petal_width='0.2', target='Iris-setosa'), Sample(...)]
+
+As you can see, all attributes are strings. Furthermore, we may want to
+*transform* our samples further before using them.
+
+Transforming Samples
+--------------------
+
+:any:`bob.pipelines.FileListDatabase` accepts a transformer that will be applied
+to all samples:
+
+.. doctest:: csv_iris_database
+
+    >>> import numpy as np
+    >>> from sklearn.preprocessing import FunctionTransformer
+
+    >>> def prepare_data(sample):
+    ...     return np.array(
+    ...         [sample.sepal_length, sample.sepal_width,
+    ...          sample.petal_length, sample.petal_width],
+    ...         dtype=float
+    ...     )
+
+    >>> def prepare_iris_samples(samples):
+    ...     return [mario.Sample(prepare_data(sample), parent=sample) for sample in samples]
+
+    >>> database = mario.FileListDatabase(
+    ...     dataset_protocols_path,
+    ...     protocol="default",
+    ...     transformer=FunctionTransformer(prepare_iris_samples),
+    ... )
+    >>> database.samples(groups="train")
+    [Sample(data=array([5.1, 3.5, 1.4, 0.2]), sepal_length='5.1', sepal_width='3.5', petal_length='1.4', petal_width='0.2', target='Iris-setosa'), Sample(...)]
+
+.. note::
+
+    The ``transformer`` used in the ``FileListDatabase`` will not be fitted and
+    you should not perform any computationally heavy processing on the samples
+    in this transformer. You are expected to do the minimal processing of
+    samples here to make them ready for experiments. Most of the time you just
+    load the data from disk in this transformer and return delayed samples.
+
+Now our samples are ready to be used and we can run a simple experiment with
+them.
+
+Running An Experiment
+---------------------
+
+Here, we want to train an Linear Discriminant Analysis (LDA) on the data. Before
+that, we want to normalize the range of our data and convert the ``target``
+labels to integers.
+
+.. doctest:: csv_iris_database
+
+    >>> from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
+    >>> from sklearn.preprocessing import StandardScaler, LabelEncoder
+    >>> from sklearn.pipeline import Pipeline
+    >>> scaler = StandardScaler()
+    >>> encoder = LabelEncoder()
+    >>> lda = LinearDiscriminantAnalysis()
+
+    >>> scaler = mario.wrap(["sample"], scaler)
+    >>> encoder = mario.wrap(["sample"], encoder, input_attribute="target", output_attribute="y")
+    >>> lda = mario.wrap(["sample"], lda, fit_extra_arguments=[("y", "y")])
+
+    >>> pipeline = Pipeline([('scaler', scaler), ('encoder', encoder), ('lda', lda)])
+    >>> pipeline.fit(database.samples(groups="train"))
+    Pipeline(...)
+    >>> encoder.estimator.classes_
+    array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica']...)
+    >>> predictions = pipeline.predict(database.samples(groups="test"))
+    >>> predictions[0].data, predictions[0].target, predictions[0].y
+    (0, 'Iris-setosa', 0)
--- a/doc/img/checkpoint.png
+++ b/doc/img/checkpoint.png
--- a/doc/img/dask.png
+++ b/doc/img/dask.png
--- a/doc/img/metadata.png
+++ b/doc/img/metadata.png
--- a/doc/index.rst
+++ b/doc/index.rst
@@ -8,34 +8,22 @@

 Easily boost your :doc:`Scikit Learn Pipelines <modules/generated/sklearn.pipeline.Pipeline>` with powerful features, such as:

-
-
-.. figure:: img/dask.png
-    :width: 40%
-    :align: center
-
-    Scale them with Dask
-
-.. figure:: img/metadata.png
-   :width: 40%
-   :align: center
-
-   Wrap datapoints with metadata and pass them to the `estimator.fit` and `estimator.transform` methods
-
-.. figure:: img/checkpoint.png
-   :width: 40%
-   :align: center
-
-   Checkpoint datapoints after each step of your pipeline
+* Scaling experiments on dask_.
+* Wrapping data-points with metadata and passing them to the `estimator.fit` and `estimator.transform` methods.
+* Checkpointing data-points after each step of your pipeline.
+* Expressing database protocol as csv files and using them easily.


 .. warning::
-    Before any investigation of this package is capable of, check the scikit learn :ref:`user guide <scikit-learn:pipeline>`. Several :ref:`tutorials <scikit-learn:tutorial_menu>` are available online.

-.. warning::
-    If you want to implement your own scikit-learn estimator, please, check it out this :doc:`link <scikit-learn:developers/develop>`
+    Before any investigation of this package is capable of, check the scikit
+    learn :ref:`user guide <scikit-learn:pipeline>`. Several :ref:`tutorials
+    <scikit-learn:tutorial_menu>` are available online.

+.. warning::

+    If you want to implement your own scikit-learn estimator, please, check out
+    this :doc:`link <scikit-learn:developers/develop>`

 User Guide
 ==========
@@ -46,5 +34,8 @@ User Guide
   sample
   checkpoint
   dask
+   datasets
   xarray
   py_api
+
+.. include:: links.rst
--- a/doc/py_api.rst
+++ b/doc/py_api.rst
@@ -30,9 +30,9 @@ Wrapper's API
 Database's API
 --------------
 .. autosummary::
-    bob.pipelines.datasets.FileListDatabase
-    bob.pipelines.datasets.FileListToSamples
-    bob.pipelines.datasets.CSVToSamples
+    bob.pipelines.FileListDatabase
+    bob.pipelines.FileListToSamples
+    bob.pipelines.CSVToSamples

 Transformers' API
 -----------------
@@ -77,7 +77,3 @@ Transformers
 xarray Wrapper
 ==============
 .. automodule:: bob.pipelines.xarray
-
-Filelist Datasets
-=================
-.. automodule:: bob.pipelines.datasets
--- a/doc/xarray.rst
+++ b/doc/xarray.rst
@@ -3,6 +3,11 @@
 Efficient pipelines with dask and xarray
 ========================================

+.. note::
+
+   This section of the API is not used by ``bob.bio`` and ``bob.pad`` packages.
+   If you are only interested in learning those packages, you can skip this page.
+
 In this guide, we will see an alternative method to what was discussed before
 about sample-based processing, checkpointing, dask integration. We will be doing
 the same concepts as discussed before but try to do it in a more efficient way.
@@ -12,7 +17,7 @@ In this guide we are interested in several things:
 #. Sample-based processing and carrying the sample metadata over the
   pipeline.
 #. Checkpointing: we may want to save intermediate steps.
-#. Lazy operations and graph optimizations. We’ll define all operations using
+#. Lazy operations and graph optimizations. We'll define all operations using
   dask and we will benefit from lazy operations and graph optimizations.
 #. Failed sample handling: we may want to drop some samples in the pipeline if
   we fail to process them.
@@ -23,7 +28,7 @@ This guide builds upon `scikit-learn`_, `dask`_, and `xarray`_. If you are
 not familiar with those libraries, you may want to get familiar with those
 libraries first.

-First, let’s run our example classification problem without using the tools here
+First, let's run our example classification problem without using the tools here
 to get familiar with our examples. We are going to do an Scaler+PCA+LDA example
 on the iris dataset:

@@ -54,13 +59,13 @@ on the iris dataset:
 As you can see here, the example ran fine. The ``iris.data`` was transformed
 twice using ``scaler.transform`` and ``pca.transform`` but that's ok and we
 could have avoided that at the cost of complexity and more memory usage.
-Let’s go through through this example again and increase its complexity
+Let's go through through this example again and increase its complexity
 as we progress.

 Sample-based processing
 -----------------------

-First, let’s look at how we can turn this into a sample-based pipeline. We need
+First, let's look at how we can turn this into a sample-based pipeline. We need
 to convert our dataset to a list of samples first:

 .. doctest::
@@ -80,7 +85,7 @@ to convert our dataset to a list of samples first:

 You may be already familiar with our sample concept. If not, please read more on
 :ref:`bob.pipelines.sample`. Now, to optimize our process, we will represent our
-samples in an :any:`xarray.Dataset` using :any:`dask.array.Array`’s:
+samples in an :any:`xarray.Dataset` using :any:`dask.array.Array`'s:

 .. doctest::

@@ -146,10 +151,10 @@ dictionary instead.
   DatasetPipeline(...)


-The dictionaries are used to construct :any:`Block`’s. You can checkout
+The dictionaries are used to construct :any:`Block`'s. You can checkout
 that class to see what options are possible.

-Now let’s fit our pipeline with our xarray dataset. Ideally, we want
+Now let's fit our pipeline with our xarray dataset. Ideally, we want
 this fit step be postponed until the we call :any:`dask.compute` on our
 results. But this does not happen here which we will explain later.

@@ -157,7 +162,7 @@ results. But this does not happen here which we will explain later.

   >>> _ = pipeline.fit(dataset)

-Now let’s call ``decision_function`` on our pipeline. What will be
+Now let's call ``decision_function`` on our pipeline. What will be
 returned is a new dataset with the ``data`` variable changed to the
 output of ``lda.decision_function``.