grain.sources module#
APIs for reading data from various file formats.
List of Members#
- class grain.sources.RandomAccessDataSource(*args, **kwargs)[source]#
Interface for datasets where storage supports efficient random access.
This Protocol defines the contract for any custom data source injected into the PyGrain pipeline. Implementations do not need to inherit from this class directly; they only need to implement the required structural methods (__len__ and __getitem__).
Notes: Checkpointing: If used with DataLoader, __repr__ has to be additionally implemented to support checkpointing.
Multiprocessing: If used with multiprocessing, the instance must be fully picklable.
Example
Implementing a minimal, checkpoint-safe custom data source:
from grain.sources import RandomAccessDataSource class MyInMemorySource: def __init__(self, data: list): self._data = data def __len__(self) -> int: return len(self._data) def __getitem__(self, index: int): return self._data[index] def __repr__(self) -> str: # Required for PyGrain checkpointing with DataLoader return f"MyInMemorySource(size={len(self)})" source = MyInMemorySource(["a", "b", "c"]) # source satisfies the RandomAccessDataSource protocol. assert isinstance(source, RandomAccessDataSource)
- __getitem__(index)[source]#
Returns the value for the given index.
This method must be thread-safe and deterministic.
Note that a number of sources take SupportsIndex instead of int for index. Such sources will still support int index and pass the isinstance check with this protocol, but all new source implementations should use int directly.
- Parameters:
index (int) – An integer in [0, len(self)-1].
- Returns:
The corresponding record. File data sources often return the raw bytes but records can be any Python object.
- Return type:
T
- class grain.sources.ArrayRecordDataSource(*args, **kwargs)[source]#
Data source for ArrayRecord files.
- Parameters:
paths (array_record.python.array_record_data_source.PathLikeOrFileInstruction | Sequence[array_record.python.array_record_data_source.PathLikeOrFileInstruction])
reader_options (dict[str, str] | None)
- __init__(paths, reader_options=None)[source]#
Creates a new ArrayRecordDataSource object.
See array_record.ArrayRecordDataSource for more details.
- Parameters:
paths (array_record.python.array_record_data_source.PathLikeOrFileInstruction | Sequence[array_record.python.array_record_data_source.PathLikeOrFileInstruction]) – A single path/FileInstruction or list of paths/FileInstructions.
reader_options (dict[str, str] | None) – a dict[str, str] to be passed when creating a reader. For example, {index_storage_option:”in_memory”} stores the reader indices in memory versus {index_storage_option:”offloaded”} stores the indices on disk to save memory usage.
Simple in-memory data source for sequences that is sharable among multiple processes.
Note
This constrains storable values to only the int, float, bool, str (less than 10M bytes each), bytes (less than 10M bytes each), and None built-in data types. It also notably differs from the built-in list type in that these lists can not change their overall length (i.e. no append, insert, etc.)
- Parameters:
elements (Sequence[Any] | None)
name (str | None)
Creates a new InMemoryDataSource object.
- Parameters:
elements (Sequence[Any] | None) – The elements for the sharable list.
name (str | None) – The name of the datasource.