Public datasets fuel the machine learning research rocket (h/t Andrew Ng), but it’s still too difficult to simply get those datasets into your machine learning pipeline. Every researcher goes through the pain of writing one-off scripts to download and prepare every dataset they work with, which all have different source formats and complexities. Not anymore.
Today, we’re pleased to introduce TensorFlow Datasets (GitHub) which exposes public research datasets as
[tf.data.Datasets](https://www.tensorflow.org/api_docs/python/tf/data/Dataset) and as NumPy arrays. It does all the grungy work of fetching the source data and preparing it into a common format on disk, and it uses the
[tf.data API](https://www.tensorflow.org/guide/datasets) to build high-performance input pipelines, which are TensorFlow 2.0-ready and can be used with
tf.keras models. We’re launching with 29 popular research datasets such as MNIST, Street View House Numbers, the 1 Billion Word Language Model Benchmark, and the Large Movie Reviews Dataset, and will add more in the months to come; we hope that you join in and add a dataset yourself.
# Install: pip install tensorflow-datasets import tensorflow_datasets as tfds mnist_data = tfds.load("mnist") mnist_train, mnist_test = mnist_data["train"], mnist_data["test"] assert isinstance(mnist_train, tf.data.Dataset)
tfds out in a Colab notebook.
Every dataset is exposed as a DatasetBuilder, which knows:
You can directly instantiate any of the DatasetBuilders or fetch them by string with
import tensorflow_datasets as tfds # Fetch the dataset directly mnist = tfds.image.MNIST() # or by string name mnist = tfds.builder('mnist') # Describe the dataset with DatasetInfo assert mnist.info.features['image'].shape == (28, 28, 1) assert mnist.info.features['label'].num_classes == 10 assert mnist.info.splits['train'].num_examples == 60000 # Download the data, prepare it, and write it to disk mnist.download_and_prepare() # Load data from disk as tf.data.Datasets datasets = mnist.as_dataset() train_dataset, test_dataset = datasets['train'], datasets['test'] assert isinstance(train_dataset, tf.data.Dataset) # And convert the Dataset to NumPy arrays if you'd like for example in tfds.as_numpy(train_dataset): image, label = example['image'], example['label'] assert isinstance(image, np.array)
as_dataset() accepts a
batch_size argument which will give you batches of examples instead of one example at a time. For small datasets that fit in memory, you can pass
batch_size=-1 to get the entire dataset at once as a
tf.data.Datasets can easily be converted to iterables of NumPy arrays using
As a convenience, you can do all the above with
[tfds.load](https://www.tensorflow.org/datasets/api_docs/python/tfds/load), which fetches the DatasetBuilder by name, calls
download_and_prepare(), and calls
import tensorflow_datasets as tfds datasets = tfds.load("mnist") train_dataset, test_dataset = datasets["train"], datasets["test"] assert isinstance(train_dataset, tf.data.Dataset)
You can also easily get the
[DatasetInfo](https://www.tensorflow.org/datasets/api_docs/python/tfds/core/DatasetInfo) object from
tfds.load by passing
with_info=True. See the API documentation for all the options.
Every dataset is versioned (
builder.info.version) so that you can rest assured that the data doesn’t change underneath you and that results are reproducible. For now, we guarantee that if the data changes, the version will be incremented.
Note that while we do guarantee the data values and splits are identical given the same version, we do not currently guarantee the ordering of records for the same version.
Datasets with different variants are configured with named BuilderConfigs. For example, the Large Movie Review Dataset (
[tfds.text.IMDBReviews](https://www.tensorflow.org/datasets/datasets#imdb_reviews)) could have different encodings for the input text (for example, plain text, or a character encoding, or a subword encoding). The built-in configurations are listed with the dataset documentation and can be addressed by string, or you can pass in your own configuration.
# See the built-in configs configs = tfds.text.IMDBReviews.builder_configs assert "bytes" in configs # Address a built-in config with tfds.builder imdb = tfds.builder("imdb_reviews/bytes") # or when constructing the builder directly imdb = tfds.text.IMDBReviews(config="bytes") # or use your own custom configuration my_encoder = tfds.features.text.ByteTextEncoder(additional_tokens=['hello']) my_config = tfds.text.IMDBReviewsConfig( name="my_config", version="1.0.0", text_encoder_config=tfds.features.text.TextEncoderConfig(encoder=my_encoder), ) imdb = tfds.text.IMDBReviews(config=my_config)
See the section on dataset configuration in our documentation on adding a dataset.
Text datasets can be often be painful to work with because of different encodings and vocabulary files.
tensorflow-datasets makes it much easier. It’s shipping with many text tasks and includes three kinds of TextEncoders, all of which support Unicode:
[ByteTextEncoder](https://www.tensorflow.org/datasets/api_docs/python/tfds/features/text/ByteTextEncoder)for byte/character-level encodings
[TokenTextEncoder](https://www.tensorflow.org/datasets/api_docs/python/tfds/features/text/TokenTextEncoder)for word-level encodings based on a vocabulary file
[SubwordTextEncoder](https://www.tensorflow.org/datasets/api_docs/python/tfds/features/text/SubwordTextEncoder)for subword-level encodings (and the ability to construct the subword vocabulary tuned to a particular text corpus) with a byte-level fallback so that it’s fully invertible. For example, “hello world” could get split into [“he”, “llo”, “ “, “wor”, “ld”] and then integer-encoded. Subwords are a happy medium between word-level and byte-level encodings and are popular in some natural language research projects.
The encoders, along with their vocabulary sizes, can be accessed through
imdb = tfds.builder("imdb_reviews/subwords8k") # Get the TextEncoder from DatasetInfo encoder = imdb.info.features["text"].encoder assert isinstance(encoder, tfds.features.text.SubwordTextEncoder) # Encode, decode ids = encoder.encode("Hello world") assert encoder.decode(ids) == "Hello world" # Get the vocabulary size vocab_size = encoder.vocab_size
Both TensorFlow and TensorFlow Datasets will be working to improve text support even further in the future.
Our documentation site is the best place to start using
tensorflow-datasets. Here are some additional pointers for getting started:
We expect to be adding datasets in the coming months, and we hope that the community will join in. Open a GitHub Issue to request a dataset, vote on which datasets should be added next, discuss implementation, or ask for help. And Pull Requests very welcome! Add a popular dataset to contribute to the community, or if you have your own data, contribute it to TFDS to make your data famous!
Now that data is easy, happy modeling!
TensorFlow Datasets was a team effort. Our core developers are Etienne Pot,
Afroz Mohiuddin, Pierre Ruyssen, Marcin Michalski, and Ryan Sepassi. We’d
also like to thank Jiri Simsa for his help with tf.data, and Martin Wicke
for his support of the project. Thanks all!
We’d like to thank Stefan Webb of Oxford for allowing us to use the
tensorflow-datasets PyPI name. Thanks Stefan!
We’d also like to thank Lukasz Kaiser and the Tensor2Tensor project for inspiring and guiding tensorflow/datasets. Thanks Lukasz! T2T will be migrating to tensorflow/datasets soon.