TorchData 0.5.0 Release Notes
TorchData 0.5.0 Release Notes
- Highlights
- Backwards Incompatible Change
- Deprecations
- New Features
- Improvements
- Bug Fixes
- Performance
- Documentation
- Future Plans
- Beta Usage Note
Highlights
We are excited to announce the release of TorchData 0.5.0. This release is composed of about 236 commits since 0.4.1, including ones from PyTorch Core since 1.12.1, made by more than 35 contributors. We want to sincerely thank our community for continuously improving TorchData.
TorchData 0.5.0 updates are focused on consolidating the DataLoader2
and ReadingService
APIs and benchmarking. Highlights include:
- Added support to load data from more cloud storage providers, now covering AWS, Google Cloud Storage, and Azure. Detailed tutorial can be found here
- AWS S3 Benchmarking result
- Consolidated API for
DataLoader2
and provided a fewReadingServices
, with detailed documentation now available here - Provided more comprehensive
DataPipe
operations, e.g.,random_split
,repeat
,set_length
, andprefetch
. - Provided pre-compiled torchdata binaries for arm64 Apple Silicon
Backwards Incompatible Change
DataPipe
Changed the returned value of MapDataPipe.shuffle
to an IterDataPipe
(pytorch/pytorch#83202)
IterDataPipe
is used to to preserve data order
MapDataPipe.shuffle | |
---|---|
0.4.1 | 0.5.0 |
>>> from torch.utils.data import IterDataPipe, MapDataPipe
>>> from torch.utils.data.datapipes.map import SequenceWrapper
>>> dp = SequenceWrapper(list(range(10))).shuffle()
>>> isinstance(dp, MapDataPipe)
True
>>> isinstance(dp, IterDataPipe)
False
|
>>> from torch.utils.data import IterDataPipe, MapDataPipe
>>> from torch.utils.data.datapipes.map import SequenceWrapper
>>> dp = SequenceWrapper(list(range(10))).shuffle()
>>> isinstance(dp, MapDataPipe)
False
>>> isinstance(dp, IterDataPipe)
True
|
on_disk_cache
now doesn’t accept generator functions for the argument of filename_fn
(#810)
on_disk_cache | |
---|---|
0.4.1 | 0.5.0 |
>>> url_dp = IterableWrapper(["https://path/to/filename", ])
>>> def filepath_gen_fn(url):
… yield from [url + f”/{i}” for i in range(3)]
>>> cache_dp = url_dp.on_disk_cache(filepath_fn=filepath_gen_fn)
|
>>> url_dp = IterableWrapper(["https://path/to/filename", ])
>>> def filepath_gen_fn(url):
… yield from [url + f”/{i}” for i in range(3)]
>>> cache_dp = url_dp.on_disk_cache(filepath_fn=filepath_gen_fn)
# AssertionError
|
DataLoader2
Imposed single iterator constraint on DataLoader2
(#700)
DataLoader2 with a single iterator | |
---|---|
0.4.1 | 0.5.0 |
>>> dl = DataLoader2(IterableWrapper(range(10)))
>>> it1 = iter(dl)
>>> print(next(it1))
0
>>> it2 = iter(dl) # No reset here
>>> print(next(it2))
1
>>> print(next(it1))
2
|
>>> dl = DataLoader2(IterableWrapper(range(10)))
>>> it1 = iter(dl)
>>> print(next(it1))
0
>>> it2 = iter(dl) # DataLoader2 resets with the creation of a new iterator
>>> print(next(it2))
0
>>> print(next(it1))
# Raises exception, since it1 is no longer valid
|
Deep copy DataPipe
during DataLoader2
initialization or restoration (#786, #833)
Previously, if a DataPipe is being passed to multiple DataLoaders, the DataPipe's state can be altered by any of those DataLoaders. In some cases, that may raise an exception due to the single iterator constraint; in other cases, some behaviors can be changed due to the adapters (e.g. shuffling) of another DataLoader.
Deep copy DataPipe during DataLoader2 constructor | |
---|---|
0.4.1 | 0.5.0 |
>>> dp = IterableWrapper([0, 1, 2, 3, 4])
>>> dl1 = DataLoader2(dp)
>>> dl2 = DataLoader2(dp)
>>> for x, y in zip(dl1, dl2):
… print(x, y)
# RuntimeError: This iterator has been invalidated because another iterator has been created from the same IterDataPipe...
|
>>> dp = IterableWrapper([0, 1, 2, 3, 4])
>>> dl1 = DataLoader2(dp)
>>> dl2 = DataLoader2(dp)
>>> for x, y in zip(dl1, dl2):
… print(x, y)
0 0
1 1
2 2
3 3
4 4
|
Deprecations
DataLoader2
Deprecated traverse
function and only_datapipe
argument (pytorch/pytorch#85667)
Please use traverse_dps
with the behavior the same as only_datapipe=True
. (#793)
DataPipe traverse function | |
---|---|
0.4.1 | 0.5.0 |
>>> dp_graph = torch.utils.data.graph.traverse(datapipe, only_datapipe=False)
|
>>> dp_graph = torch.utils.data.graph.traverse(datapipe, only_datapipe=False)
FutureWarning: `traverse` function and only_datapipe argument will be removed after 1.13.
|
New Features
DataPipe
- Added AIStore DataPipe (#545, #667)
- Added support for
IterDataPipe
to trace DataFrames operations (pytorch/pytorch#71931, - Added support for
DataFrameMakerIterDataPipe
to acceptdtype_generator
to solve unserializabledtype
(#537) - Added graph snapshotting by counting number of successful yields for
IterDataPipe
(pytorch/pytorch#79479, pytorch/pytorch#79657) - Implemented
drop
operation forIterDataPipe
to drop column(s) (#725) - Implemented
FullSyncIterDataPipe
to synchronize distributed shards (#713) - Implemented
slice
andflatten
operations forIterDataPipe
(#730) - Implemented
repeat
operation forIterDataPipe
(#748) - Added
LengthSetterIterDataPipe
(#747) - Added
RandomSplitter
(without buffer) (#724) - Added
padden_tokens
tomax_token_bucketize
to bucketize samples based on total padded token length (#789) - Implemented thread based
PrefetcherIterDataPipe
(#770, #818, #826, #842)
DataLoader2
- Added
CacheTimeout
Adapter
to redefine cache timeout of theDataPipe
graph (#571) - Added
DistribtuedReadingService
to support uneven data sharding (#727) - Added
PrototypeMultiProcessingReadingService
Releng
- Provided pre-compiled torchdata binaries for arm64 Apple Silicon (#692)
Improvements
DataPipe
- Fixed error message coming from singler iterator constraint (pytorch/pytorch#79547)
- Enabled profiler record context in
__next__
forIterDataPipe
(pytorch/pytorch#79757) - Raised warning for unpickable local function (#547) (pytorch/pytorch#80232, #547)
- Cleaned up opened streams on the best effort basis (#560, pytorch/pytorch#78952)
- Used streaming reading mode for unseekable streams in
TarArchiveLoader
(#653)
Improved GDrive 'content-disposition' error message (#654) - Added
as_tuple
argument for CSVParserIterDataPipe` to convert output from list to tuple (#646) - Raised Error when
HTTPReader
get 404 Response (#160) (#569) - Added default no-op behavior for
flatmap
(#749) - Added support to validate
input_col
with the provided map function forDataPipe
(pytorch/pytorch#80267, #755, pytorch/pytorch#84279) - Made
ShufflerIterDataPipe
support snapshotting (#83535) - Unified implementations between
in_batch_shuffle
withshuffle
forIterDataPipe
(#745) - Made
IterDataPipe.to_map_datapipe
loading data lazily (#765) - Added
kwargs
to open files forFSSpecFileLister
andFSSpecSaver
(#804) - Added missing functional name for
FileLister
(#86497)
DataLoader
- Controlled shuffle option to all
DataPipes
withset_shuffle
API pytorch/pytorch#83741) - Made distributed process group lazily initialized & share seed via the process group (pytorch/pytorch#85279)
DataLoader2
- Improved graph traverse function
- Added support for unhashable
DataPipe
(pytorch/pytorch#80509, #559) - Added support for all python collection objects (pytorch/pytorch#84079, #773)
- Added support for unhashable
- Ensured
finalize
andfinalize_iteration
are called during shutdown or exception (#846)
Releng
- Enabled conda release to support GLIBC_2.27 (#859)
Bug Fixes
DataPipe
- Fixed error for static typing (#572, #645, #651, pytorch/pytorch#81275, #758)
- Fixed
fork
andunzip
operations for the case of a single child (pytorch/pytorch#81502) - Corrected the type of exception that is being raised by
ShufflerMapDataPipe
(pytorch/pytorch#82666) - Fixed buffer overflow for
unzip
whencolumns_to_skip
is specified (#658) - Fixed
TarArchiveLoader
to skipopen
for opened TarFile stream (#679) - Fixed mishandling of exception message in
IterDataPipe
(pytorch/pytorch#84676) - Fixed interface generation in setup.py (#87081)
Performance
DataLoader2
- Added benchmarking for
DataLoader2
Documentation
DataPipe
- Added examples for data loading with
DataPipe
- Improved docstring for
DataPipe
DataPipe
converters (#710)S3
DataPipe (#784)FileOpenerIterDataPipe
(pytorch/pytorch#81407)buffer_size
forMaxTokenBucketizer
(#834)Prefetcher
(#835)
- Added tutorial to load from Cloud Storage Provider including AWS S3, Google Cloud Platform and Azure Blob Storage (#812, #836)
- Improved tutorial
- Simplified long type names for online doc (#838)
DataLoader2
- Improved docstring for
DataLoader2
(#581, #817) - Added training examples using
DataLoader2
,ReadingService
andDataPipe
(#563, #664, #670, #787)
Releng
- Added contribution guide for third-party library (#663)
Future Plans
We will continue benchmarking over datasets on local disk and cloud storage using TorchData. And, we will continue making DataLoader2
and related ReadingService
more stable and provide more features like snapshotting the data pipeline and restoring it from the serialized state. Stay tuned and welcome any feedback.
Beta Usage Note
This library is currently in the Beta stage and currently does not have a stable release. The API may change based on user feedback or performance. We are committed to bring this library to stable release, but future changes may not be completely backward compatible. If you install from source or use the nightly version of this library, use it along with the PyTorch nightly binaries. If you have suggestions on the API or use cases you'd like to be covered, please open a GitHub issue. We'd love to hear thoughts and feedback.