Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
bob.pipelines
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Build
Pipelines
Jobs
Pipeline schedules
Artifacts
Deploy
Releases
Package registry
Model registry
Operate
Environments
Terraform modules
Monitor
Incidents
Analyze
Value stream analytics
Contributor analytics
CI/CD analytics
Repository analytics
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
bob
bob.pipelines
Commits
4403f266
Commit
4403f266
authored
3 years ago
by
Amir MOHAMMADI
Browse files
Options
Downloads
Plain Diff
Merge branch 'pin-dask' into 'master'
[docs] update docs to match new API of xarray Closes
#35
See merge request
!76
parents
c8761c34
c84ceb0d
No related branches found
No related tags found
1 merge request
!76
[docs] update docs to match new API of xarray
Pipeline
#54328
passed
3 years ago
Stage: build
Stage: deploy
Changes
3
Pipelines
1
Hide whitespace changes
Inline
Side-by-side
Showing
3 changed files
bob/pipelines/xarray.py
+13
-2
13 additions, 2 deletions
bob/pipelines/xarray.py
conda/meta.yaml
+2
-2
2 additions, 2 deletions
conda/meta.yaml
doc/xarray.rst
+31
-33
31 additions, 33 deletions
doc/xarray.rst
with
46 additions
and
37 deletions
bob/pipelines/xarray.py
+
13
−
2
View file @
4403f266
...
...
@@ -7,6 +7,7 @@ from functools import partial
import
cloudpickle
import
dask
import
h5py
import
numpy
as
np
import
xarray
as
xr
...
...
@@ -14,14 +15,24 @@ from sklearn.base import BaseEstimator
from
sklearn.pipeline
import
_name_estimators
from
sklearn.utils.metaestimators
import
_BaseComposition
from
bob.io.base
import
load
,
save
from
.sample
import
SAMPLE_DATA_ATTRS
,
_ReprMixin
from
.utils
import
is_estimator_stateless
logger
=
logging
.
getLogger
(
__name__
)
def
save
(
data
,
path
):
array
=
np
.
require
(
data
,
requirements
=
(
"
C_CONTIGUOUS
"
,
"
ALIGNED
"
))
with
h5py
.
File
(
path
,
"
w
"
)
as
f
:
f
.
create_dataset
(
"
array
"
,
data
=
array
)
def
load
(
path
):
with
h5py
.
File
(
path
,
"
r
"
)
as
f
:
data
=
np
.
array
(
f
[
"
array
"
])
return
data
def
_load_fn_to_xarray
(
load_fn
,
meta
=
None
):
if
meta
is
None
:
meta
=
np
.
array
(
load_fn
())
...
...
This diff is collapsed.
Click to expand it.
conda/meta.yaml
+
2
−
2
View file @
4403f266
...
...
@@ -56,8 +56,8 @@ test:
-
{{
name
}}
commands
:
-
pytest --verbose --cov {{ name }} --cov-report term-missing --cov-report html:{{ project_dir }}/sphinx/coverage --cov-report xml:{{ project_dir }}/coverage.xml --pyargs {{ name }}
-
sphinx-build -aEW {{ project_dir }}/doc {{ project_dir }}/sphinx
-
sphinx-build -aEb doctest {{ project_dir }}/doc sphinx
-
sphinx-build -aEW {{ project_dir }}/doc {{ project_dir }}/sphinx
# [linux]
-
sphinx-build -aEb doctest {{ project_dir }}/doc sphinx
# [linux]
-
conda inspect linkages -p $PREFIX {{ name }}
# [not win]
-
conda inspect objects -p $PREFIX {{ name }}
# [osx]
requires
:
...
...
This diff is collapsed.
Click to expand it.
doc/xarray.rst
+
31
−
33
View file @
4403f266
...
...
@@ -87,11 +87,11 @@ samples in an :any:`xarray.Dataset` using :any:`dask.array.Array`’s:
>>> dataset = mario.xr.samples_to_dataset(samples, npartitions=3)
>>> dataset # doctest: +NORMALIZE_WHITESPACE
<xarray.Dataset>
Dimensions: (
dim_0: 4,
sample: 150)
Dimensions without coordinates:
dim_0, sample
Dimensions: (sample: 150
, dim_0: 4
)
Dimensions without coordinates:
sample, dim_0
Data variables:
target (sample) int64 dask.array<chunksize=(50,), meta=np.ndarray>
data (sample, dim_0) float64 dask.array<chunksize=(50, 4), meta=np.ndarray>
target (sample) int64 dask.array<chunksize=(50,), meta=np.ndarray>
data (sample, dim_0) float64 dask.array<chunksize=(50, 4), meta=np.ndarray>
You can see here that our ``samples`` were converted to a dataset of dask
arrays. The dataset is made of two *dimensions*: ``sample`` and ``dim_0``. We
...
...
@@ -114,13 +114,12 @@ about ``data`` in our samples:
>>> dataset = mario.xr.samples_to_dataset(samples, npartitions=3, meta=meta)
>>> dataset # doctest: +NORMALIZE_WHITESPACE
<xarray.Dataset>
Dimensions: (
feature: 4, sample: 150
)
Dimensions without coordinates:
feature, sampl
e
Dimensions: (
sample: 150, feature: 4
)
Dimensions without coordinates:
sample, featur
e
Data variables:
target (sample) int64 dask.array<chunksize=(50,), meta=np.ndarray>
data (sample, feature) float64 dask.array<chunksize=(50, 4), meta=np.ndarray>
Now, we want to build a pipeline that instead of numpy arrays, processes this
dataset instead. We can do that with our :any:`DatasetPipeline`. A dataset
pipeline is made of scikit-learn estimators but instead of working on numpy
...
...
@@ -167,13 +166,12 @@ output of ``lda.decision_function``.
>>> ds = pipeline.decision_function(dataset)
>>> ds # doctest: +NORMALIZE_WHITESPACE
<xarray.Dataset>
Dimensions: (
c: 3,
sample: 150)
Dimensions without coordinates:
c,
sample
Dimensions: (sample: 150
, c: 3
)
Dimensions without coordinates: sample
, c
Data variables:
target (sample) int64 dask.array<chunksize=(50,), meta=np.ndarray>
data (sample, c) float64 dask.array<chunksize=(50, 3), meta=np.ndarray>
To get the results as numpy arrays you can call ``.compute()`` on xarray
or dask objects:
...
...
@@ -181,8 +179,8 @@ or dask objects:
>>> ds.compute() # doctest: +NORMALIZE_WHITESPACE
<xarray.Dataset>
Dimensions: (
c: 3,
sample: 150)
Dimensions without coordinates:
c,
sample
Dimensions: (sample: 150
, c: 3
)
Dimensions without coordinates: sample
, c
Data variables:
target (sample) int64 0 0 0 0 0 0 0 0 0 0 0 0 ... 2 2 2 2 2 2 2 2 2 2 2 2
data (sample, c) float64 28.42 -15.84 -59.68 20.69 ... -57.81 3.79 6.92
...
...
@@ -220,8 +218,8 @@ For new and unknown dimension sizes use `np.nan`.
>>> ds = pipeline.fit(dataset).decision_function(dataset)
>>> ds # doctest: +NORMALIZE_WHITESPACE
<xarray.Dataset>
Dimensions: (
class: 3,
sample: 150)
Dimensions without coordinates:
class, sample
Dimensions: (sample: 150
, class: 3
)
Dimensions without coordinates:
sample, class
Data variables:
target (sample) int64 dask.array<chunksize=(50,), meta=np.ndarray>
data (sample, class) float64 dask.array<chunksize=(50, 3), meta=np.ndarray>
...
...
@@ -234,8 +232,8 @@ This time nothing was computed. We can get the results by calling
>>> ds.compute() # doctest: +NORMALIZE_WHITESPACE
<xarray.Dataset>
Dimensions: (
class: 3,
sample: 150)
Dimensions without coordinates:
class, sample
Dimensions: (sample: 150
, class: 3
)
Dimensions without coordinates:
sample, class
Data variables:
target (sample) int64 0 0 0 0 0 0 0 0 0 0 0 0 ... 2 2 2 2 2 2 2 2 2 2 2 2
data (sample, class) float64 28.42 -15.84 -59.68 ... -57.81 3.79 6.92
...
...
@@ -272,8 +270,8 @@ features. Let's add the ``key`` metadata to our dataset first:
>>> dataset = mario.xr.samples_to_dataset(samples, npartitions=3, meta=meta)
>>> dataset # doctest: +NORMALIZE_WHITESPACE
<xarray.Dataset>
Dimensions: (
feature: 4, sample: 150
)
Dimensions without coordinates:
feature, sampl
e
Dimensions: (
sample: 150, feature: 4
)
Dimensions without coordinates:
sample, featur
e
Data variables:
target (sample) int64 dask.array<chunksize=(50,), meta=np.ndarray>
key (sample) int64 dask.array<chunksize=(50,), meta=np.ndarray>
...
...
@@ -312,12 +310,12 @@ features:
>>> ds = pipeline.fit(dataset).decision_function(dataset)
>>> ds.compute() # doctest: +NORMALIZE_WHITESPACE
<xarray.Dataset>
Dimensions: (
class: 3,
sample: 150)
Dimensions without coordinates:
class, sample
Dimensions: (sample: 150
, class: 3
)
Dimensions without coordinates:
sample, class
Data variables:
target (sample) int64 0 0 0 0 0 0 0 0 0 0 0 0 ... 2 2 2 2 2 2 2 2 2 2 2 2
key (sample) int64 0 1 2 3 4 5 6 7 ... 142 143 144 145 146 147 148 149
data (sample, class) float64 28.42 -15.84 -59.68 ... -57.81 3.79 6.92
target (sample) int64 0 0 0 0 0 0 0 0 0 0 0 0 ... 2 2 2 2 2 2 2 2 2 2 2 2
key (sample) int64 0 1 2 3 4 5 6 7 ... 142 143 144 145 146 147 148 149
data (sample, class) float64 28.42 -15.84 -59.68 ... -57.81 3.79 6.92
Now if you repeat the operations, the checkpoints will be used:
...
...
@@ -326,12 +324,12 @@ Now if you repeat the operations, the checkpoints will be used:
>>> ds = pipeline.fit(dataset).decision_function(dataset)
>>> ds.compute() # doctest: +NORMALIZE_WHITESPACE
<xarray.Dataset>
Dimensions: (
class: 3,
sample: 150)
Dimensions without coordinates:
class, sample
Dimensions: (sample: 150
, class: 3
)
Dimensions without coordinates:
sample, class
Data variables:
target (sample) int64 0 0 0 0 0 0 0 0 0 0 0 0 ... 2 2 2 2 2 2 2 2 2 2 2 2
key (sample) int64 0 1 2 3 4 5 6 7 ... 142 143 144 145 146 147 148 149
data (sample, class) float64 28.42 -15.84 -59.68 ... -57.81 3.79 6.92
target (sample) int64 0 0 0 0 0 0 0 0 0 0 0 0 ... 2 2 2 2 2 2 2 2 2 2 2 2
key (sample) int64 0 1 2 3 4 5 6 7 ... 142 143 144 145 146 147 148 149
data (sample, class) float64 28.42 -15.84 -59.68 ... -57.81 3.79 6.92
>>> ds.data.data.visualize(format="svg") # doctest: +SKIP
...
...
@@ -386,12 +384,12 @@ Now in our pipeline, we want to drop ``nan`` samples after PCA transformations:
>>> ds = pipeline.fit(dataset).decision_function(dataset)
>>> ds.compute() # doctest: +NORMALIZE_WHITESPACE
<xarray.Dataset>
Dimensions: (
class: 3,
sample: 75)
Dimensions without coordinates:
class, sample
Dimensions: (sample: 75
, class: 3
)
Dimensions without coordinates:
sample, class
Data variables:
target (sample) int64 0 0 0 0 0 0 0 0 0 0 0 0 ... 2 2 2 2 2 2 2 2 2 2 2 2
key (sample) int64 1 3 5 7 9 11 13 15 ... 137 139 141 143 145 147 149
data (sample, class) float64 21.74 -13.45 -54.81 ... -58.76 4.178 8.07
target (sample) int64 0 0 0 0 0 0 0 0 0 0 0 0 ... 2 2 2 2 2 2 2 2 2 2 2 2
key (sample) int64 1 3 5 7 9 11 13 15 ... 137 139 141 143 145 147 149
data (sample, class) float64 21.74 -13.45 -54.81 ... -58.76 4.178 8.07
You can see that we have 75 samples now instead of 150 samples. The
``dataset_map`` option is generic. You can apply any operation in this function.
...
...
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment