TensorFlow Datasets: The Bad Parts

Random Access

import torch.utils.data

class RandomAccessDataset(torch.utils.data.Dataset):
def __init__(self, data: List) -> None:
self.data = data

def __len__(self) -> int:
return len(self.data)

def __getitem__(self, index: int): -> Any:
return self.data[index]

Sequential Access

def sequential_dataset(data: List) -> Iterator:
for item in data:
yield item
import itertools

def gen():
for i in itertools.count(1):
yield (i, [1] * i)

dataset = tf.data.Dataset.from_generator(
gen,
(tf.int64, tf.int64),
(tf.TensorShape([]), tf.TensorShape([None])))

Sequential Access in TensorFlow Datasets

import tensorflow as tf


dataset = tf.data.Dataset.from_tensor_slices([1,2,3])

for element in dataset:
print (element)
>>> tf.Tensor(1, shape=(), dtype=int32)
>>> tf.Tensor(2, shape=(), dtype=int32)
>>> tf.Tensor(3, shape=(), dtype=int32)

dataset = dataset.map(lambda x: x*2)
for element in dataset:
print (element)
>>> tf.Tensor(2, shape=(), dtype=int32)
>>> tf.Tensor(4, shape=(), dtype=int32)
>>> tf.Tensor(6, shape=(), dtype=int32)
dataset[0]
>>> TypeError: 'TensorSliceDataset' object does not support indexing

list(dataset.as_numpy_iterator())
>>> [1,2,3]

Data Shuffling

Data Sharding

  1. You need to split your data set into a larger number of files than the number of workers in your distributed training job. If you have a large dataset stored in a small number of files, you’re out of luck. Moreover, any size imbalances between those files will result in stragglers, hurting training performance.
  2. More likely, you might not realize any of this! A lot of real-world data loading code just converts a Python generator into a TensorFlow Dataset using Dataset.from_generator(). This will appear to work okay at small scale, but will quickly run into serious performance problems as your data set grows.

Saving and Restoring Iterator State

Solution

dataset = RandomAccessDataset()

def sequential_access_dataset() -> Iterator:
for index in range(len(dataset)):
yield dataset[index]

--

--

--

Open Source Deep Learning Training Platform, built for Experts by Experts. Focus on Models, not Infrastructure.

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Computer made Japanese letters through Variational Autoencoder

MACHINE LEARNING

CHD Predictor and Health Check Dashboard

Pre-Processing in Natural Language Machine Learning

KMeans in less than 5 minutes

BERT to the rescue!

My deep learning job interview experience sharing

Illustrating Predictive Models with the ROC Curve

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Determined AI

Determined AI

Open Source Deep Learning Training Platform, built for Experts by Experts. Focus on Models, not Infrastructure.

More from Medium

Popular Machine Learning Algorithms (Supervised and Unsupervised Learning)

What is TensorFlow?

Building your data science portfolio(1e) — Exploring CT image arrays (part 1)

Feature Engineering Example with Python