Learn R Programming

⚠️There's a newer version (0.4.2) of this package.Take me there.

mldr.datasets

RUMDR - R Ultimate Multilabel Dataset Repository

Installation

Use install.packages to install mldr.datasets and its dependencies:

install.packages("mldr.datasets")

Alternatively, you can install it via install_github from the devtools package.

devtools::install_github("fcharte/mldr.datasets")

You can also clone the repository by using entering git clone https://github.com/fcharte/mldr.datasets.git at your command line (assuming git is installed in your system) or with your favourite git GUI. This way all the datasets will be inmediately available in your system. However, take into account that > 600MB will be needed to store the full repository.

Very large datasets, those > 100MB, are stored into the GitHub Large File Storage. So, before cloning the repository you will need to install this git extension, initializing it by entering git lfs init at the git command line. Without GitHub LFS the aforementioned datasets will appear in your local copy of the repository as files containing a link, instead of real data. This step is not needed in order to install the package though standard methods and access the datasets as explained below.

Usage and examples

This package provides a large collection of multilabel datasets along with the functions needed to export them to several formats and to obtain bibliographic information. Some of the datasets are integrated into the package, while others are externally available. To open a list with all the datasets integrated into the package use the following commands:

library(mldr.datasets)
data(package = "mldr.datasets")

Once the package has been loaded, any of the datasets can be queried as shown below:

birds$measures  # Obtain a list of characterization measures
flags$labels    # Retrieve information about the labels
emotions$attributes # All info about the attributes in the dataset
scene$labelsets # List of labelsets and their frequencies
cat(toBibtex(ng20)) # Print the BibTeX entry for the dataset

The external datasets are automatically donwloaded from GitHub the first time they are needed, then saved locally. To obtain a list of externally available datasets use the following commands:

library(mldr.datasets)
mldrs()

The external datasets are not inmediately available. To load any of them enter its name followed by empty parenthesis, as shown below:

bibtex()  # This will load the bibtex dataset, downloading it if is not locally available
bibtex$labels

The toBibtex S3 method returns bibliographic information about the dataset, if it is available. This can be printed with cat or copied to the clipboard to include it in your article.

For more examples and detailed explanation on available functions, please refer to the documentation.

License

This software is distributed under the following terms:

This program is free software: you can redistribute it and/or modify
it under the terms of the GNU Lesser General Public License as published
by the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU Lesser General Public License for more details.

You should have received a copy of the GNU Lesser General Public License
along with this program.  If not, see <http://www.gnu.org/licenses/>.

The datasets distributed within this software and inside this repository are propierty of their own authors. You can find authorship and citation information inside the additional-data folder.

Copy Link

Version

Install

install.packages('mldr.datasets')

Monthly Downloads

2,173

Version

0.3.15

License

LGPL (>= 3) | file LICENSE

Issues

Pull Requests

Stars

Forks

Maintainer

David Charte

Last Published

January 16th, 2016

Functions in mldr.datasets (0.3.15)

eurlexsm_tra

List with 10 folds of the train data from the EUR-Lex subject matters dataset
birds

Dataset with sounds produced by birds and the species they belong to
corel16k007

Datasets with data from the Corel image collection. There are 10 subsets in corel16k
corel16k005

Datasets with data from the Corel image collection. There are 10 subsets in corel16k
stackex_cooking

Dataset from the Stack Exchange's cooking forum
rcv1sub1

Dataset from the Reuters corpus (subset 1)
yahoo_reference

Dataset generated from the Yahoo! web site index (reference category)
corel16k001

Datasets with data from the Corel image collection. There are 10 subsets in corel16k
corel16k006

Datasets with data from the Corel image collection. There are 10 subsets in corel16k
eurlexdc_tra

List with 10 folds of the train data from the EUR-Lex directory codes dataset
corel5k

Dataset with data from the Corel image collection
bookmarks

Dataset with data from web bookmarks and their categories
toBibtex.mldr

BibTeX entry associated to an mldr object
ohsumed

Dataset generated from a subset of the Medline database
yahoo_arts

Dataset generated from the Yahoo! web site index (arts category)
corel16k010

Datasets with data from the Corel image collection. There are 10 subsets in corel16k
yahoo_health

Dataset generated from the Yahoo! web site index (health category)
yahoo_society

Dataset generated from the Yahoo! web site index (society category)
corel16k008

Datasets with data from the Corel image collection. There are 10 subsets in corel16k
mediamill

Dataset with features extracted from video sequences and semantic concepts assigned as labels
mldrs

Obtain and show a list of additional datasets available to download
nuswide_VLAD

Dataset obtained from the NUS-WIDE database with cVLAD+ representation
eurlexdc_test

List with 10 folds of the test data from the EUR-Lex directory codes dataset
rcv1sub4

Dataset from the Reuters corpus (subset 4)
tmc2007

Dataset from airplanes failures reports
write.mldr

Export an mldr object or set of mldr objects to different file formats
yahoo_education

Dataset generated from the Yahoo! web site index (arts education)
corel16k009

Datasets with data from the Corel image collection. There are 10 subsets in corel16k
eurlexev_test

List with 10 folds of the test data from the EUR-Lex EUROVOC descriptors dataset
stackex_coffee

Dataset from the Stack Exchange's coffee forum
yahoo_computers

Dataset generated from the Yahoo! web site index (computers category)
reutersk500

Dataset from the Reuters Corpus with the 500 most relevant features selected
delicious

Dataset generated from the del.icio.us site bookmarks
corel16k003

Datasets with data from the Corel image collection. There are 10 subsets in corel16k
yahoo_social

Dataset generated from the Yahoo! web site index (social category)
langlog

Dataset with data from the Language forum discussion
rcv1sub5

Dataset from the Reuters corpus (subset 5)
stackex_philosophy

Dataset from the Stack Exchange's philosophy forum
cal500

Dataset with music data along with labels for emotions, instruments, genres, etc.
medical

Dataset generated from medical reports
eurlexev_tra

List with 10 folds of the train data from the EUR-Lex EUROVOC descriptors dataset
random.kfolds

Partition an mldr object into k folds
nuswide_BoW

Dataset obtained from the NUS-WIDE database with BoW representation
emotions

Dataset with features extracted from music tracks and the emotions they produce
stackex_cs

Dataset from the Stack Exchange's computer science forum
stackex_chess

Dataset from the Stack Exchange's chess forum
yahoo_entertainment

Dataset generated from the Yahoo! web site index (arts entertainment)
yahoo_business

Dataset generated from the Yahoo! web site index (business category)
yeast

Dataset with protein profiles and their categories
check_n_load.mldr

Check if an mldr object is locally available and download it if needed
bibtex

Dataset with BibTeX entries
stratified.kfolds

Partition an mldr object into k folds
genbase

Dataset with genes data and their functional expression
eurlexsm_test

List with 10 folds of the test data from the EUR-Lex subject matters dataset
enron

Dataset with email messages and the folders where the users stored them
corel16k002

Datasets with data from the Corel image collection. There are 10 subsets in corel16k
corel16k004

Datasets with data from the Corel image collection. There are 10 subsets in corel16k
rcv1sub3

Dataset from the Reuters corpus (subset 3)
tmc2007_500

Dataset from airplanes failures reports (500 most relevant features extracted)
slashdot

Dataset generated from slashdot.org site entries
yahoo_science

Dataset generated from the Yahoo! web site index (science category)
ng20

Dataset with news messages and the news groups they belong to
imdb

Dataset generated from the IMDB film database
flags

Dataset with features correspoinding to world flags
scene

Dataset from images with different natural scenes
stackex_chemistry

Dataset from the Stack Exchange's chemistry forum
yahoo_recreation

Dataset generated from the Yahoo! web site index (recreation category)
rcv1sub2

Dataset from the Reuters corpus (subset 2)