Datasets

CAL500

cbar.datasets.fetch_cal500(data_home=None, download_if_missing=True, codebook_size=512)

Loader for the CAL500 dataset [1].

This dataset consists of 502 western pop songs, performed by 499 unique artists. Each song is tagged by at least three people using a standard survey and a fixed tag vocabulary of 174 musical concepts.

Warning

This utility downloads a ~1GB file to your home directory. This might take a few minutes, depending on your bandwidth.

Parameters:
  • data_home (optional) – Specify a download and cache folder for the datasets. By default (None) all data is stored in subfolders of ~/cbar_data.
  • download_if_missing (bool, optional) – If False, raise a IOError if the data is not locally available instead of trying to download the data from the source site. Defaults to True.
  • codebook_size (int, optional) – The codebook size. Defaults to 512.
Returns:

  • X (pd.DataFrame, shape = [502, codebook_size]) – Each row corresponds to a preprocessed song, represented as a sparse codebook vector.
  • Y (pd.DataFrame, shape = [502, 174]) – Tags associated with each song in binary indicator format.

Notes

The CAL500 dataset is downloaded from UCSD’s Computer Audition Laboratory’s datasets page.

The raw dataset consists of about 10,000 39-dimensional features vectors per minute of audio content. The feature vectors were created by:

  1. Sliding a half-overlapping short-time window of 12 milliseconds over each song’s waveform data.
  2. Extracting the 13 mel-frequency cepstral coefficients.
  3. Appending the instantaneous first-order and second-order derivatives.

Each song is represented by exactly 10,000 randomly subsampled, real-valued feature vectors as a bag-of-frames \(\mathcal{X} = \{\vec{x}_1, \ldots, \vec{x}_T\} \in \mathbb{R}^{d \times T}\), where \(d = 39\) and \(T = 10000\).

The bag-of-frames features for each song are further preprocessed into one k-dimensional feature vector with the following procedure:

  1. Encode feature vectors as code vectors. Each feature vector \(\vec{x}_t \in \mathbb{R}^d\) is encoded as a code vector \(\vec{c}_t \in \mathbb{R}^k\) according to a pre-defined codebook \(C \in \mathbb{R}^{d \times k}\). The intermediate representation for the encoded audio file is \(\mathcal{X}_{enc} \in \mathbb{R}^{k \times T}\).
  2. Pool code vectors into one compact vector. The encoded frame vectors are pooled together into a single compact vector. An audio file \(x\) can now be represented as a single k-dimensional vector \(\vec{x} \in \mathbb{R}^k\).

Specifically, the k-means clustering algorithm is used to cluster all audio files’ frames into codebook_size clusters in step 1. The resulting cluster centers correspond to the codewords in the codebook. Accordingly, the encoding step consists of assigning each frame vector to its closest cluster center.

References

[1]D. Turnbull, L. Barrington, D. Torres, and G. Lanckriet, Semantic Annotation and Retrieval of Music and Sound Effects. IEEE Transactions on Audio, Speech, and Language Processing, vol. 16, no. 2, pp. 467-476, Feb. 2008.

CAL10k

cbar.datasets.fetch_cal10k(data_home=None, download_if_missing=True, codebook_size=512, fold=1)

Loader for the CAL10k dataset [2].

This dataset consists of 10,870 western pop songs, performed by 4,597 unique artists. Each song is weakly annotated with 2 to 25 tags from a tag vocabulary of 153 genre tags and 475 acoustic tags but only tags associated with at least 30 songs are kept for the final tag vocabulary.

The CAL10k dataset has predefined train-test-splits for a 5-fold cross-validation

Warning

This utility downloads a ~2GB file to your home directory. This might take a few minutes, depending on your bandwidth.

Parameters:
  • data_home (optional) – Specify a download and cache folder for the datasets. By default (None) all data is stored in subfolders of ~/cbar_data.
  • download_if_missing (bool, optional) – If False, raise a IOError if the data is not locally available instead of trying to download the data from the source site. Defaults to True.
  • codebook_size (int, optional) – The codebook size. Defaults to 512.
  • fold (int, \(\in \{1, ..., 5\}\)) – The specific train-test-split to load. Defaults to 1.
Returns:

  • X_train (array-like, shape = [n_train_samples, codebook_size]) – Training set songs. Each row corresponds to a preprocessed song, represented as a sparse codebook vector.
  • X_test (array-like, shape = [n_test_samples, codebook_size]) – Test set songs. Each row corresponds to a preprocessed song, represented as a sparse codebook vector.
  • Y_train (array-like, shape = [n_train_samples, 581]) – Training set tags associated with each training set song in binary indicator format.
  • Y_test (array-like, shape = [n_test_samples, 581]) – Test set tags associated with each test set song in binary indicator format.

Notes

The CAL10k dataset is downloaded from UCSD’s Computer Audition Laboratory’s datasets page. The annotations are the “corrected” annotations from [3], downloaded from the CAL10k corrected metadata page.

The raw dataset consists of the 13 mel-frequency cepstral coefficients for each frame of each song. The acoustic data is preprocessed similar to the acoustic data in the CAL500 dataset (see notes in cbar.datasets.fetch_cal500()).

References

[2]D. Tingle, Y. E. Kim, and D. Turnbull, Exploring automatic music annotation with acoustically-objective tags. in Proceedings of the international conference on Multimedia information retrieval, 2010, pp. 55-62.
[3]Y. Vaizman, B. McFee, and G. Lanckriet, Codebook-based audio feature representation for music information retrieval. Audio, Speech, and Language Processing, IEEE/ACM Transactions on, vol. 22, no. 10, pp. 1483-1493, 2014.

Freesound

cbar.datasets.load_freesound(codebook_size, data_home=None, **kwargs)

Loader for the Freesound dataset [4].

Warning

You need to download the Freesound dataset from Kaggle and unpack it into your home directory or the directory specified as data_home for this loader to work.

This dataset consists of 227,085 sounds. Each sound is at most 30 seconds long and annotated with tags from a tag vocabulary of 3,466 tags. The sounds’ original tags were provided by the users who uploaded the sounds to Freesound. The more than 50,000 original tags were preprocessed to form a tag vocabulary of 3,466 tags.

Parameters:
  • codebook_size (int, 512, 1024, 2048, 4096) – The codebook size. The dataset is pre-encoded with codebook sizes of 512, 1024, 2048, and 4096. If you want to experiment with other codebook-sizes, you need to download the orginal MFCCs, append the first-order and second-order derivatives and quantize the resulting frame-vectors specifying the desired codebook_size using cbar.preprocess.quantize_mfccs().
  • data_home (optional) – Specify a home folder for the Freesound datasets. By default (None) the files are expected to be in ~/cbar_data/freesound/, where cbar_data is the data_home directory.
Returns:

  • X (pd.DataFrame, shape = [227085, codebook_size]) – Each row corresponds to a preprocessed sound, represented as a sparse codebook vector.
  • Y (pd.DataFrame, shape = [227085,]) – Tags associated with each sound provided as a list of strings. Use sklearn.preprocessing.MultiLabelBinarizer() to transform tags into binary indicator format.

References

[4]F. Font, G. Roma, and X. Serra, Freesound technical demo 2013, pp. 411-412.