From bed15697d70f3dc236fbc01f9dbaba8298a64288 Mon Sep 17 00:00:00 2001 From: Yannick DAYER <yannick.dayer@idiap.ch> Date: Fri, 7 Jun 2024 15:53:22 +0200 Subject: [PATCH] doc(doctest): fix output of xarray showing sizes. --- doc/xarray.rst | 68 +++++++++++++++++++++++++------------------------- 1 file changed, 34 insertions(+), 34 deletions(-) diff --git a/doc/xarray.rst b/doc/xarray.rst index 8baf2a1..9d7d4d3 100644 --- a/doc/xarray.rst +++ b/doc/xarray.rst @@ -91,12 +91,12 @@ samples in an :any:`xarray.Dataset` using :any:`dask.array.Array`'s: >>> dataset = bob.pipelines.xr.samples_to_dataset(samples, npartitions=3) >>> dataset # doctest: +NORMALIZE_WHITESPACE - <xarray.Dataset> + <xarray.Dataset> Size: 6kB Dimensions: (sample: 150, dim_0: 4) Dimensions without coordinates: sample, dim_0 Data variables: - target (sample) int64 dask.array<chunksize=(50,), meta=np.ndarray> - data (sample, dim_0) float64 dask.array<chunksize=(50, 4), meta=np.ndarray> + target (sample) int64 1kB dask.array<chunksize=(50,), meta=np.ndarray> + data (sample, dim_0) float64 5kB dask.array<chunksize=(50, 4), meta=np.ndarray> You can see here that our ``samples`` were converted to a dataset of dask arrays. The dataset is made of two *dimensions*: ``sample`` and ``dim_0``. We @@ -118,12 +118,12 @@ about ``data`` in our samples: >>> meta = xr.DataArray(samples[0].data, dims=("feature")) >>> dataset = bob.pipelines.xr.samples_to_dataset(samples, npartitions=3, meta=meta) >>> dataset # doctest: +NORMALIZE_WHITESPACE - <xarray.Dataset> + <xarray.Dataset> Size: 6kB Dimensions: (sample: 150, feature: 4) Dimensions without coordinates: sample, feature Data variables: - target (sample) int64 dask.array<chunksize=(50,), meta=np.ndarray> - data (sample, feature) float64 dask.array<chunksize=(50, 4), meta=np.ndarray> + target (sample) int64 1kB dask.array<chunksize=(50,), meta=np.ndarray> + data (sample, feature) float64 5kB dask.array<chunksize=(50, 4), meta=np.ndarray> Now, we want to build a pipeline that instead of numpy arrays, processes this dataset instead. We can do that with our :any:`DatasetPipeline`. A dataset @@ -170,12 +170,12 @@ output of ``lda.decision_function``. >>> ds = pipeline.decision_function(dataset) >>> ds # doctest: +NORMALIZE_WHITESPACE - <xarray.Dataset> + <xarray.Dataset> Size: 5kB Dimensions: (sample: 150, c: 3) Dimensions without coordinates: sample, c Data variables: - target (sample) int64 dask.array<chunksize=(50,), meta=np.ndarray> - data (sample, c) float64 dask.array<chunksize=(50, 3), meta=np.ndarray> + target (sample) int64 1kB dask.array<chunksize=(50,), meta=np.ndarray> + data (sample, c) float64 4kB dask.array<chunksize=(50, 3), meta=np.ndarray> To get the results as numpy arrays you can call ``.compute()`` on xarray or dask objects: @@ -183,12 +183,12 @@ or dask objects: .. doctest:: >>> ds.compute() # doctest: +NORMALIZE_WHITESPACE - <xarray.Dataset> + <xarray.Dataset> Size: 5kB Dimensions: (sample: 150, c: 3) Dimensions without coordinates: sample, c Data variables: - target (sample) int64 0 0 0 0 0 0 0 0 0 0 0 0 ... 2 2 2 2 2 2 2 2 2 2 2 2 - data (sample, c) float64 28.42 -15.84 -59.68 20.69 ... -57.81 3.79 6.92 + target (sample) int64 1kB 0 0 0 0 0 0 0 0 0 0 0 ... 2 2 2 2 2 2 2 2 2 2 2 + data (sample, c) float64 4kB 28.42 -15.84 -59.68 ... -57.81 3.79 6.92 Our operations were not lazy here (you can't see in the docs that it was not @@ -222,12 +222,12 @@ For new and unknown dimension sizes use `np.nan`. >>> ds = pipeline.fit(dataset).decision_function(dataset) >>> ds # doctest: +NORMALIZE_WHITESPACE - <xarray.Dataset> + <xarray.Dataset> Size: 5kB Dimensions: (sample: 150, class: 3) Dimensions without coordinates: sample, class Data variables: - target (sample) int64 dask.array<chunksize=(50,), meta=np.ndarray> - data (sample, class) float64 dask.array<chunksize=(50, 3), meta=np.ndarray> + target (sample) int64 1kB dask.array<chunksize=(50,), meta=np.ndarray> + data (sample, class) float64 4kB dask.array<chunksize=(50, 3), meta=np.ndarray> This time nothing was computed. We can get the results by calling @@ -236,12 +236,12 @@ This time nothing was computed. We can get the results by calling .. doctest:: >>> ds.compute() # doctest: +NORMALIZE_WHITESPACE - <xarray.Dataset> + <xarray.Dataset> Size: 5kB Dimensions: (sample: 150, class: 3) Dimensions without coordinates: sample, class Data variables: - target (sample) int64 0 0 0 0 0 0 0 0 0 0 0 0 ... 2 2 2 2 2 2 2 2 2 2 2 2 - data (sample, class) float64 28.42 -15.84 -59.68 ... -57.81 3.79 6.92 + target (sample) int64 1kB 0 0 0 0 0 0 0 0 0 0 0 ... 2 2 2 2 2 2 2 2 2 2 2 + data (sample, class) float64 4kB 28.42 -15.84 -59.68 ... 3.79 6.92 >>> ds.data.data.visualize(format="svg") # doctest: +SKIP In the visualization of the dask graph below, you can see that dask is only @@ -274,13 +274,13 @@ features. Let's add the ``key`` metadata to our dataset first: >>> meta = xr.DataArray(samples[0].data, dims=("feature")) >>> dataset = bob.pipelines.xr.samples_to_dataset(samples, npartitions=3, meta=meta) >>> dataset # doctest: +NORMALIZE_WHITESPACE - <xarray.Dataset> + <xarray.Dataset> Size: 7kB Dimensions: (sample: 150, feature: 4) Dimensions without coordinates: sample, feature Data variables: - target (sample) int64 dask.array<chunksize=(50,), meta=np.ndarray> - key (sample) int64 dask.array<chunksize=(50,), meta=np.ndarray> - data (sample, feature) float64 dask.array<chunksize=(50, 4), meta=np.ndarray> + target (sample) int64 1kB dask.array<chunksize=(50,), meta=np.ndarray> + key (sample) int64 1kB dask.array<chunksize=(50,), meta=np.ndarray> + data (sample, feature) float64 5kB dask.array<chunksize=(50, 4), meta=np.ndarray> .. testsetup:: @@ -314,13 +314,13 @@ features: >>> ds = pipeline.fit(dataset).decision_function(dataset) >>> ds.compute() # doctest: +NORMALIZE_WHITESPACE - <xarray.Dataset> + <xarray.Dataset> Size: 6kB Dimensions: (sample: 150, class: 3) Dimensions without coordinates: sample, class Data variables: - target (sample) int64 0 0 0 0 0 0 0 0 0 0 0 0 ... 2 2 2 2 2 2 2 2 2 2 2 2 - key (sample) int64 0 1 2 3 4 5 6 7 ... 142 143 144 145 146 147 148 149 - data (sample, class) float64 28.42 -15.84 -59.68 ... -57.81 3.79 6.92 + target (sample) int64 1kB 0 0 0 0 0 0 0 0 0 0 0 ... 2 2 2 2 2 2 2 2 2 2 2 + key (sample) int64 1kB 0 1 2 3 4 5 6 7 ... 143 144 145 146 147 148 149 + data (sample, class) float64 4kB 28.42 -15.84 -59.68 ... 3.79 6.92 Now if you repeat the operations, the checkpoints will be used: @@ -328,13 +328,13 @@ Now if you repeat the operations, the checkpoints will be used: >>> ds = pipeline.fit(dataset).decision_function(dataset) >>> ds.compute() # doctest: +NORMALIZE_WHITESPACE - <xarray.Dataset> + <xarray.Dataset> Size: 6kB Dimensions: (sample: 150, class: 3) Dimensions without coordinates: sample, class Data variables: - target (sample) int64 0 0 0 0 0 0 0 0 0 0 0 0 ... 2 2 2 2 2 2 2 2 2 2 2 2 - key (sample) int64 0 1 2 3 4 5 6 7 ... 142 143 144 145 146 147 148 149 - data (sample, class) float64 28.42 -15.84 -59.68 ... -57.81 3.79 6.92 + target (sample) int64 1kB 0 0 0 0 0 0 0 0 0 0 0 ... 2 2 2 2 2 2 2 2 2 2 2 + key (sample) int64 1kB 0 1 2 3 4 5 6 7 ... 143 144 145 146 147 148 149 + data (sample, class) float64 4kB 28.42 -15.84 -59.68 ... 3.79 6.92 >>> ds.data.data.visualize(format="svg") # doctest: +SKIP @@ -388,13 +388,13 @@ Now in our pipeline, we want to drop ``nan`` samples after PCA transformations: ... ) >>> ds = pipeline.fit(dataset).decision_function(dataset) >>> ds.compute() # doctest: +NORMALIZE_WHITESPACE - <xarray.Dataset> + <xarray.Dataset> Size: 3kB Dimensions: (sample: 75, class: 3) Dimensions without coordinates: sample, class Data variables: - target (sample) int64 0 0 0 0 0 0 0 0 0 0 0 0 ... 2 2 2 2 2 2 2 2 2 2 2 2 - key (sample) int64 1 3 5 7 9 11 13 15 ... 137 139 141 143 145 147 149 - data (sample, class) float64 21.74 -13.45 -54.81 ... -58.76 4.178 8.07 + target (sample) int64 600B 0 0 0 0 0 0 0 0 0 0 0 ... 2 2 2 2 2 2 2 2 2 2 2 + key (sample) int64 600B 1 3 5 7 9 11 13 ... 137 139 141 143 145 147 149 + data (sample, class) float64 2kB 21.74 -13.45 -54.81 ... 4.178 8.07 You can see that we have 75 samples now instead of 150 samples. The ``dataset_map`` option is generic. You can apply any operation in this function. -- GitLab