Datasets for Machine Learning

August 15th 2016 by Tine

An important step in machine learning is creating or finding suitable data for training and testing an algorithm. Working with a good data set will help you to avoid or notice errors in your algorithm and improve the results of your application. As creating your own dataset is a very time consuming task in most cases, in this article I will present you with some useful sets for text classification and image classification problems.

Text classification

Image classification

Text classification

In the following sections you will find datasets that can be used for common text classification tasks such as the detection of spam messages, sentiment analysis and the classification by the subject of a document.

• Spam - Non Spam

The task of spam filtering is very common in text classification. Therefore, a lot of datasets can be found for this purpose.

SMS Spam Corpus The SMS Spam Corpus consists of text messages belonging to one of two classes. Each element is either labeled as spam or ham. The set can be downloaded as big (1002 ham, 322 spam) or small (1002 spam, 82 spam) version.

Enron Dataset If you want to have a look at spam filtering in emails instead, you might be interested in the Enron dataset, which provides a collection of thousands of mails, classified as spam or ham. It can be downloaded in a raw or preprocessed version.

Other datasets for spam classification in mails that might be interesting for you are SpamAssassin public mail corpus, TREC Public Spam Corpus or the Spambase Data Set.

• Sentiment Analysis

Another task that can be solved by Machine Learning is sentiment analysis of texts. An example for this task would be finding out if a text states a positive or negative opinion about a certain subject.

Twitter Sentiment Analysis Training Corpus In case you’re interested in tweet sentiment classification, the Twitter Sentiment Analysis Training Corpus might be the dataset you’re looking for. It consists of more then 1 million tweets in a .csv file. Each element is labeled as either positive (1) or negative (0).

Movie Review Data More complex texts can be found in the Movie Review Data, which provides a collection of 1,000 positive and 1,000 negative movie comments. The comments are available as unprocessed .html files and as processed texts. Part of this dataset is also a collection of sentences labeled as subjective or objective.

A list of more useful datasets for sentiment classification was put together in this blog post by Kavita Ganesan.

• Classification by subject

Classifying documents by their subject is a complex problem. Depending on the kinds of documents you want to work with, you will need an appropriate dataset for that exact case. An often investigated case is the classification of newspaper articles.

20 Newsgroups The 20 Newsproups dataset contains around 20,000 documents which are almost evenly distributed over 20 categories. The data is split into a train and test set. Some of the newsgroups are closely related, while others have nothing to do with each other. The groups in the dataset are the following:

Excerpt of the Chars74K dataset

Organization of the 20 newsgroups dataset, Source: http://qwone.com/~jason/20Newsgroups/

Reuters-21578 A dataset that is often used for evaluating text classification algorithms is the Reuters-21578 dataset. It consists of texts that appeared in the Reuters newswire in 1987 and was put together by Reuters Ltd. staff. Often only subsets of this dataset are used as the documents are not evenly distributed over the categories. In many cases only the 10 or 90 categories with the most documents are used.

A very helpful collection of single labeled text datasets is provided on Ana Cardoso Cachopo’s Homepage. Not only will you find an overview of useful data, but also human readable and preprocessed versions of the datasets, which might save you a lot of time and trouble.

Image classification

In the following sections we will introduce some datasets that you might find useful if you want to use machine learning for image classification. The listed datasets range from simple handwritten numbers to images of complex objects and might be useful for getting started with image classification or testing your algorithm.

• Numbers and Letters

MNIST The MNIST data set is a commonly used set for getting started with image classification. It contains thousands of labeled small binary images of handwritten numbers from 0 to 9, split up in a training and test set. The set can be downloaded from Yann LeCun’s website in the IDX file format. If you want to work with the data as images in the png format, you can find a converted version here.

Excerpt of the MNIST dataset

Excerpt of the MNIST dataset

Chars74K Another task that can be solved by machine learning is character recognition. For this purpose the Chars74K dataset can be used for testing and training. It contains more than 74,000 images of letters and numbers which are categorized into 64 different classes. The characters are handwritten, obtained from natural images or taken from computer fonts. Due to the bigger amount of classes and the fact that the data is available as color images, this dataset is a lot more complex than the MNIST set.

Excerpt of the Chars74K dataset

Excerpt of the Chars74K dataset, Source: http://www.ee.surrey.ac.uk/CVSSP/demos/chars74k/Samples/english.png

• Faces

Frontal Face Images The dataset for Frontal Face Images was created for evaluating applications for frontal face recognition in images. It contains images of humans and information about the location of their faces in the pictures given by the x and y coordinates. You can download the set here.

Frontal Faces in the Wild dataset

Examples from the Frontal Faces dataset

Labeled Faces in the Wild A commonly used set for face detection is the Labeled Faces in the Wild dataset. It holds more then 13,000 images that were collected from the web. Many of the people in the set are represented by more than one picture, which is useful for face recognition evaluation.

Labeled Faces in the Wild dataset

Examples from the Labeled Faces in the Wild dataset

• Animals

Oxford-IIIT Pet Dataset If you are looking for an extensive cats-and-dogs dataset, you might want to check out the Oxford-IIIT pet dataset. It covers 37 categories of different cat and dog races with 200 images per category. Unlike a lot of other datasets, the pictures included are not the same size. The cool thing about this dataset is that not only the images are provided, but also information about the position of the animal’s face and about the fore- and background of the image (see image below).

Examples from the Oxford-IIIT pet dataset

Examples from the Oxford-IIIT pet dataset, Source: http://www.robots.ox.ac.uk/~vgg/data/pets/

KTH-ANIMALS In case you are looking for a more general animal dataset, the KTH-ANIMALS dataset might be worth a look. It can be downloaded here and provides images for 19 different classes. Each class is represented by around 100 pictures of different sizes. As in the Oxford-IIIT pet dataset, there is also information provided about the fore- and background.

Overview of the the KTH-Animals dataset

Overview of the KTH-Animals dataset Source: http://www.csc.kth.se/~att/Site/Animals.html

• Various objects

CIFAR-10 and CIFAR-100 For more advanced image classification applications, you might be interested in the CIFAR sets. These sets contain coloured images with the size of 32x32pixels and can be downloaded from Alex Krizhevsky’s website. The CIFAR-10 dataset consists of 60,000 images, equally distributed over 10 categories. In case you are looking for a more complex set with more categories, you can use the CIFAR-100 dataset, which provides pictures from 100 classes and 20 superclasses.

Excerpt of the CIFAR-10 dataset

Excerpt of the CIFAR-10 dataset, Source: https://www.cs.toronto.edu/~kriz/cifar.html

Both CIFAR sets can be downloaded for python, matlab or as binary version. If you prefer to work with the data as png images, you can use this tool to convert the dataset.

STL-10 The images provided in the CIFAR datasets are very small, so if you want to work with higher resolution pictures, the STL-10 dataset could be interesting for you. The dataset contains labeled pictures of 10 classes and is similar to the CIFAR-10 dataset, but the images have the size of 96x96 pixels. There are also fewer labeled examples per class, but the set has a large collection of unlabeled images that can be used for unsupervised training.

Excerpt of the STL-10 dataset

Excerpt of the STL-10 dataset, Source: https://cs.stanford.edu/~acoates/stl10/images.png

comments powered by Disqus