5 weeks left

DAT264x: Identifying Accents in Spectrograms of Speech
Hosted By Microsoft


Problem Description

About the Data

Your goal is to predict the accent of the speaker from spectrograms of speech samples. A spectrogram is a visual representation of the various frequencies of sound as they vary with time. These spectrograms were generated from audio samples in the Mozilla Common Voice dataset. Each speech clip was sampled at 22,050 Hz, and contains an accent from one of the following three countries: Canada, India, and England. For more information on spectrograms, see the home page.

This is what the data directory looks like for the data you are given:

├── train
│   ├── 10000.png
│   ├── 10001.png
│   ├── 10002.png
│   ├── 10003.png
│   ├── ...
│   ├── 14498.png
│   └── 14499.png
├── test
│   ├── 20000.png
│   ├── 20001.png
│   ├── 20002.png
│   ├── 20003.png
│   ├── ...
│   ├── 25375.png
│   └── 25376.png
├── submission_format.csv
└── train_labels.csv
  • One folder of images is labeled train. You have the true labels for these images in train_labels.csv.
  • One folder of images is labeled test. You don't know the true labels for these.

Your job is to:

  1. Train a model using the images in train and the labels train_labels.csv
  2. Predict accent labels for the images in test for which you don't know the true accents.
  3. Output your predictions in a format that matches submission_format.csv exactly.
  4. Upload your predictions to this competition in order to get an accuracy score.
  5. Export your grading token (click the "Export Score for EdX" tab) and paste it into the assignment grader on edX to get your course grade.

For each observation, you are given a spectrogram of speech audio. The files are named with the convention {file_id}.png. The file_id in the filename matches the file_id column in train_labels.csv for the training data and in submission_format.csv for the test data.

The spectrograms have been scaled so that they are 128x173. Each image only has one channel so if it is properly loaded, the shape should be (128, 173). Using scikit-image, loading a file looks like:

from skimage import io
import matplotlib.pyplot as plt

my_image = io.imread('10000.png', as_gray=True)

# look at the image

Target Variable

The accent labels correspond to the following countries:

  • 0: Canada
  • 1: India
  • 2: England

Submission Format

The format for the submission file is CSV with a header row (file_id, accent). Each row contains a file ID (an integer) and the accent value (an integer) separated by a comma. The file ID corresponds to the filename of the images in the test/ folder. The accent value is the target label, one of {0, 1, 2}. Note that your accent labels must be integers.

Note: Except for the actual prediction values in the accent column, your submission must exactly match the submission_format.csv file provided, including the order of the rows.

For example, if you predicted that every speech sample had an Indian accent, you'd predict a value of 1 in every row:

file_id accent
20000 1
20001 1
20002 1
20003 1
20004 1

Your .csv file that you submit would look like:


Performance Metric

To measure your model's performance by looking at prediction error, we'll use the simple classification rate accuracy metric (often referred to simply as "accuracy"). This is a quantity in |$[0,1]$| where a higher value is better, and is given by the following formula:

$$ \mathrm{classification\ rate}( \hat y, y) = \frac{ N_\mathrm{correct} }{ N_\mathrm{predictions} } $$

In this case, |$N_\mathrm{predictions}=5,377$| because you are being asked to classify 5,377 rows in the test set.

Hint for How to Approach this Problem

One of the most common example problems for teaching introductory AI is the MNIST handwritten digits classification task. It is widely used as the "hello world" of deep learning problems for beginners. See, for example, this one from Keras.

In the MNIST problem you are trying to classify a set of 28x28 black and white images of handwritten digits into 10 label classes (values 0 through 9 which represent their respective numerals).

In this capstone challenge you are trying to classify a set of 128x173 grayscale images into 3 label classes (values 0 through 2 which represent the different accents).

Sample spectrograms by accent label


See the similarity? This problem is a bit different but the concepts carry over beautifully, so if you're not sure where to start, we recommend finding a good MNIST tutorial explained with the same modeling package that you will be using. Once you gain a deep understanding of that problem, it should be clear how to approach this modeling task.


Mozilla. Mozilla Common Voice dataset. https://voice.mozilla.org/en/datasets