Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
mednet
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Build
Pipelines
Jobs
Pipeline schedules
Artifacts
Deploy
Releases
Package registry
Model registry
Operate
Environments
Terraform modules
Monitor
Incidents
Analyze
Value stream analytics
Contributor analytics
CI/CD analytics
Repository analytics
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
medai
software
mednet
Commits
313d962d
Commit
313d962d
authored
1 year ago
by
André Anjos
Browse files
Options
Downloads
Patches
Plain Diff
[data.split] Make splits to be lazy-loadable (closes
#27
)
parent
1120d9df
No related branches found
No related tags found
No related merge requests found
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
src/ptbench/data/split.py
+57
-56
57 additions, 56 deletions
src/ptbench/data/split.py
with
57 additions
and
56 deletions
src/ptbench/data/split.py
+
57
−
56
View file @
313d962d
...
...
@@ -3,6 +3,7 @@
# SPDX-License-Identifier: GPL-3.0-or-later
import
csv
import
functools
import
importlib.abc
import
json
import
logging
...
...
@@ -26,7 +27,7 @@ class JSONDatabaseSplit(DatabaseSplit):
.. code-block:: json
{
"
sub
set1
"
: [
"
data
set1
"
: [
[
"
sample1-data1
"
,
"
sample1-data2
"
,
...
...
@@ -38,7 +39,7 @@ class JSONDatabaseSplit(DatabaseSplit):
"
sample2-data3
"
,
]
],
"
sub
set2
"
: [
"
data
set2
"
: [
[
"
sample42-data1
"
,
"
sample42-data2
"
,
...
...
@@ -47,14 +48,16 @@ class JSONDatabaseSplit(DatabaseSplit):
]
}
Your database split many contain any number of
sub
sets (dictionary
keys).
For simplicity, we recommend all sample entries are formatted
similarly so
that raw-data-loading is simplified. Use the function
Your database split many contain any number of
(raw) data
sets (dictionary
keys).
For simplicity, we recommend all sample entries are formatted
similarly so
that raw-data-loading is simplified. Use the function
:py:func:`check_database_split_loading` to test raw data loading and fine
tune the dataset split, or its loading.
Objects of this class behave like a dictionary in which keys are subset
names in the split, and values represent samples data and meta-data.
Objects of this class behave like a dictionary in which keys are dataset
names in the split, and values represent samples data and meta-data. The
actual JSON file descriptors are loaded on demand using
a py:func:`functools.cached_property`.
Parameters
...
...
@@ -69,21 +72,20 @@ class JSONDatabaseSplit(DatabaseSplit):
if
isinstance
(
path
,
str
):
path
=
pathlib
.
Path
(
path
)
self
.
_path
=
path
self
.
_subsets
=
self
.
_load_split_from_disk
()
def
_load_split_from_disk
(
self
)
->
DatabaseSplit
:
"""
Loads all subsets in a split from its file system representation.
@functools.cached_property
def
_datasets
(
self
)
->
DatabaseSplit
:
"""
Datasets in a split.
This method will load JSON information for the current split and return
all subsets of the given split after converting each entry through the
loader function.
The first call to this (cached) property will trigger full JSON file
loading from disk. Subsequent calls will be cached.
Returns
-------
sub
sets : dict
A dictionary mapping
sub
set names to lists of JSON objects
data
sets : dict
A dictionary mapping
data
set names to lists of JSON objects
"""
if
str
(
self
.
_path
).
endswith
(
"
.bz2
"
):
...
...
@@ -95,16 +97,16 @@ class JSONDatabaseSplit(DatabaseSplit):
return
json
.
load
(
f
)
def
__getitem__
(
self
,
key
:
str
)
->
typing
.
Sequence
[
typing
.
Any
]:
"""
Accesses
sub
set ``key`` from this split.
"""
return
self
.
_
sub
sets
[
key
]
"""
Accesses
data
set ``key`` from this split.
"""
return
self
.
_
data
sets
[
key
]
def
__iter__
(
self
):
"""
Iterates over the
sub
sets.
"""
return
iter
(
self
.
_
sub
sets
)
"""
Iterates over the
data
sets.
"""
return
iter
(
self
.
_
data
sets
)
def
__len__
(
self
)
->
int
:
"""
How many
sub
sets we currently have.
"""
return
len
(
self
.
_
sub
sets
)
"""
How many
data
sets we currently have.
"""
return
len
(
self
.
_
data
sets
)
class
CSVDatabaseSplit
(
DatabaseSplit
):
...
...
@@ -112,7 +114,7 @@ class CSVDatabaseSplit(DatabaseSplit):
CSV format.
To create a new database split, you need to provide one or more CSV
formatted files, each representing a
sub
set of this split, containing the
formatted files, each representing a
data
set of this split, containing the
sample data (one per row). Example:
Inside the directory ``my-split/``, one can file files ``train.csv``,
...
...
@@ -125,11 +127,11 @@ class CSVDatabaseSplit(DatabaseSplit):
sample2-value1,sample2-value2,sample2-value3
...
Each file in the provided directory defines the
sub
set name on the split.
So, the file ``train.csv`` will contain the data from the ``train``
sub
set,
Each file in the provided directory defines the
data
set name on the split.
So, the file ``train.csv`` will contain the data from the ``train``
data
set,
and so on.
Objects of this class behave like a dictionary in which keys are
sub
set
Objects of this class behave like a dictionary in which keys are
data
set
names in the split, and values represent samples data and meta-data.
...
...
@@ -138,7 +140,7 @@ class CSVDatabaseSplit(DatabaseSplit):
directory
Absolute path to a directory containing the database split layed down
as a set of CSV files, one per
sub
set.
as a set of CSV files, one per
data
set.
"""
def
__init__
(
...
...
@@ -150,53 +152,52 @@ class CSVDatabaseSplit(DatabaseSplit):
directory
.
is_dir
()
),
f
"
`
{
str
(
directory
)
}
` is not a valid directory
"
self
.
_directory
=
directory
self
.
_subsets
=
self
.
_load_split_from_disk
()
def
_load_split_from_disk
(
self
)
->
DatabaseSplit
:
"""
Loads all subsets in a split from its file system representation.
@functools.cached_property
def
_datasets
(
self
)
->
DatabaseSplit
:
"""
Datasets in a split.
This method will load CSV information for the current split and return all
subsets of the given split after converting each entry through the
loader function.
The first call to this (cached) property will trigger all CSV file
loading from disk. Subsequent calls will be cached.
Returns
-------
sub
sets : dict
A dictionary mapping
sub
set names to lists of JSON objects
data
sets : dict
A dictionary mapping
data
set names to lists of JSON objects
"""
retval
:
DatabaseSplit
=
{}
for
sub
set
in
self
.
_directory
.
iterdir
():
if
str
(
sub
set
).
endswith
(
"
.csv.bz2
"
):
logger
.
debug
(
f
"
Loading database split from
{
sub
set
}
...
"
)
with
__import__
(
"
bz2
"
).
open
(
sub
set
)
as
f
:
retval
:
dict
[
str
,
typing
.
Sequence
[
typing
.
Any
]]
=
{}
for
data
set
in
self
.
_directory
.
iterdir
():
if
str
(
data
set
).
endswith
(
"
.csv.bz2
"
):
logger
.
debug
(
f
"
Loading database split from
{
data
set
}
...
"
)
with
__import__
(
"
bz2
"
).
open
(
data
set
)
as
f
:
reader
=
csv
.
reader
(
f
)
retval
[
sub
set
.
name
[:
-
len
(
"
.csv.bz2
"
)]]
=
[
retval
[
data
set
.
name
[:
-
len
(
"
.csv.bz2
"
)]]
=
[
k
for
k
in
reader
]
elif
str
(
sub
set
).
endswith
(
"
.csv
"
):
with
sub
set
.
open
()
as
f
:
elif
str
(
data
set
).
endswith
(
"
.csv
"
):
with
data
set
.
open
()
as
f
:
reader
=
csv
.
reader
(
f
)
retval
[
sub
set
.
name
[:
-
len
(
"
.csv
"
)]]
=
[
k
for
k
in
reader
]
retval
[
data
set
.
name
[:
-
len
(
"
.csv
"
)]]
=
[
k
for
k
in
reader
]
else
:
logger
.
debug
(
f
"
Ignoring file
{
sub
set
}
in CSVDatabaseSplit readout
"
f
"
Ignoring file
{
data
set
}
in CSVDatabaseSplit readout
"
)
return
retval
def
__getitem__
(
self
,
key
:
str
)
->
typing
.
Sequence
[
typing
.
Any
]:
"""
Accesses
sub
set ``key`` from this split.
"""
return
self
.
_
sub
sets
[
key
]
"""
Accesses
data
set ``key`` from this split.
"""
return
self
.
_
data
sets
[
key
]
def
__iter__
(
self
):
"""
Iterates over the
sub
sets.
"""
return
iter
(
self
.
_
sub
sets
)
"""
Iterates over the
data
sets.
"""
return
iter
(
self
.
_
data
sets
)
def
__len__
(
self
)
->
int
:
"""
How many
sub
sets we currently have.
"""
return
len
(
self
.
_
sub
sets
)
"""
How many
data
sets we currently have.
"""
return
len
(
self
.
_
data
sets
)
def
check_database_split_loading
(
...
...
@@ -204,7 +205,7 @@ def check_database_split_loading(
loader
:
RawDataLoader
,
limit
:
int
=
0
,
)
->
int
:
"""
For each
sub
set in the split, check if all data can be correctly loaded
"""
For each
data
set in the split, check if all data can be correctly loaded
using the provided loader function.
This function will return the number of errors loading samples, and will
...
...
@@ -216,14 +217,14 @@ def check_database_split_loading(
database_split
A mapping that, contains the database split. Each key represents the
name of a
sub
set in the split. Each value is a (potentially complex)
name of a
data
set in the split. Each value is a (potentially complex)
object that represents a single sample.
loader
A loader object that knows how to handle full-samples or just labels.
limit
Maximum number of samples to check (in each split/
sub
set
Maximum number of samples to check (in each split/
data
set
combination) in this dataset. If set to zero, then check
everything.
...
...
@@ -235,10 +236,10 @@ def check_database_split_loading(
Number of errors found
"""
logger
.
info
(
"
Checking if can load all samples in all
sub
sets of this split...
"
"
Checking if can load all samples in all
data
sets of this split...
"
)
errors
=
0
for
sub
set
,
samples
in
database_split
.
items
():
for
data
set
,
samples
in
database_split
.
items
():
samples
=
samples
if
not
limit
else
samples
[:
limit
]
for
pos
,
sample
in
enumerate
(
samples
):
try
:
...
...
@@ -246,7 +247,7 @@ def check_database_split_loading(
assert
isinstance
(
data
,
torch
.
Tensor
)
except
Exception
as
e
:
logger
.
info
(
f
"
Found error loading entry
{
pos
}
in
sub
set `
{
sub
set
}
`:
{
e
}
"
f
"
Found error loading entry
{
pos
}
in
data
set `
{
data
set
}
`:
{
e
}
"
)
errors
+=
1
return
errors
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment