Libri-adhoc40 is a synchronized speech corpus, which collects the replayed Librispeech data from loudspeakers by ad-hoc microphone arrays of 40 strongly synchronized distributed nodes in a real office environment. Besides, to provide the evaluation target for speech frontend processing and other applications, it also recorded the replayed Librispeech data in an anechoic chamber.
The Libri-adhoc40 dataset is built on the ‘train-clean-100’, ‘dev-clean’, and ‘test-clean’ corpora of Librispeech, which contains about 110 hours of US English speech from 331 speakers. Eventually, Libri-adhoc40 contains 4510 hours data in total with 110 hours data per microphone.
An overview of Libri-adhoc40 is listed in the following table:
subset | recording environment |
duration per channel |
spkr nums | ch nums | loudspeaker positions | playback corpus in Librispeech |
---|---|---|---|---|---|---|
training data | office room |
100h | 251 | 40 | 9 | train-clean-100 |
dev data | 5h | 40 | 40 | 4 | dev-clean | |
test data | 5h | 40 | 40 | 4 | test-clean | |
ground-truth clean data | anechoic chamber | 110h | 331 | 1 | 1 | train-clean-100 dev-clean test-clean |
For each utterance in ‘train-clean-100’, ‘dev-clean’, and ‘test-clean’ corpora, we replayed it through loudspeaker both in the office room and the anechoic chamber. Besides, when we collected the training data in the office room, the positions of the 40 microphones are different from those when collecting the development data and test data.
Assume that the sentence with the number of '229-130880-0017' was replayed, where number '229-130880-0017' means that speaker '229' speaked according to sentence '0017' in chapter '130880'. The naming rule can be described as follows:
We can obtain 41 channels of data in total, since we recorded it in the office room and the anechoic chamber respectively. For each sentence we recorded, we first classified them according to the position of loudspeaker and speaker, then according to the chapters, and finally according to the original sentences number. Specifically, for each utterance recorded in the office room, we created a new name for it through adding a suffix to the original number ('174-84280-0010') according to the number of the microphone. As for the utterances recorded in anechoic chamber, a suffix named 'anechoic' is added at the end of each utterance.
In Librispeech corpus, the relative path of sentence '229-130880-0017' is:
.\train-clean-100\229\130880\229-130880-0017.flac
In Libri-adhoc40 corpus, the relative path of recorded sentences from '229-130880-0017' have the following forms:
.\adhoc40-train\pos #\229\130880\229-130880-0017-ch-1.wav .\adhoc40-train\pos #\229\130880\229-130880-0017-ch-2.wav .\adhoc40-train\pos #\229\130880\229-130880-0017-ch-3.wav ... .\adhoc40-train\pos #\229\130880\229-130880-0017-ch-40.wav .\adhoc40-train\pos #\229\130880\229-130880-0017-anechoic.wav
The
pos #
indicates the position of loudspeaker. See below for more detailed descriptions.
The plane structure of the office room is shown below.
The red dot indicates the origin of the reference axes. The blue dots indicate the positions of the microphones, whose coordinates are listed in the upper-left corner. The positions and orientations of the loudspeaker are marked by loudspeaker icons. The terms ‘pos’ is short for position. The term ‘mic’ is short for microphone.
The height of the room is 4.2 m. Because the room size is large, and the floor is laid with smooth tiles, the room is highly reverberant with the T60 around 900 ms. Because the room is far from noisy environments, the recorded speech has little additive noise. A directional loudspeaker and 40 omnidirectional microphones of the same type were placed in the room. The sampling rate is 16 kHz.
The specific coordinates of these 40 microphones (for training data) are shown in the tables below:
mic | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
x(m) | 9.1 | 8.3 | 9.1 | 8.3 | 9.1 | 8.3 | 9.1 | 8.3 | 7.5 | 6.7 | 7.5 | 6.7 | 7.5 | 6.7 | 7.5 | 6.7 | 5.9 | 5.1 | 5.9 | 5.1 |
y(m) | 5.2 | 6.0 | 3.6 | 4.4 | 2 | 2.8 | 0.4 | 1.2 | 5.2 | 6.0 | 3.6 | 4.4 | 2 | 2.8 | 0.4 | 1.2 | 5.2 | 6.0 | 3.6 | 4.4 |
z(m) | 0.9 | 0.9 | 0.9 | 0.9 | 0.9 | 0.9 | 0.9 | 0.9 | 0.9 | 0.9 | 0.9 | 0.9 | 0.9 | 0.9 | 0.9 | 0.9 | 0.9 | 0.9 | 0.9 | 0.9 |
mic | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
x(m) | 5.9 | 5.1 | 5.9 | 5.1 | 4.3 | 3.5 | 4.3 | 3.5 | 4.3 | 3.5 | 4.3 | 3.5 | 2.7 | 1.9 | 2.7 | 1.9 | 2.7 | 1.9 | 2.7 | 1.9 |
y(m) | 2 | 2.8 | 0.4 | 1.2 | 5.2 | 6 | 3.6 | 4.4 | 2 | 2.8 | 0.4 | 1.2 | 5.2 | 6 | 3.6 | 4.4 | 2 | 2.8 | 0.4 | 1.2 |
z(m) | 0.9 | 0.9 | 0.9 | 0.9 | 0.9 | 0.9 | 0.9 | 0.9 | 0.9 | 0.9 | 0.9 | 0.9 | 0.9 | 0.9 | 0.9 | 0.9 | 0.9 | 0.9 | 0.9 | 0.9 |
The specific coordinates of the loudspeaker (for training data) are shown in the table below:
loudspeaker position |
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
---|---|---|---|---|---|---|---|---|---|
x(m) | 2.7 | 2.7 | 2.7 | 4.3 | 5.9 | 8.3 | 8.3 | 8.3 | 5.1 |
y(m) | 4.4 | 2.8 | 1.2 | 1.2 | 1.2 | 2.0 | 3.6 | 5.2 | 3.6 |
z(m) | 0.95 | 0.95 | 0.95 | 0.95 | 0.95 | 0.95 | 0.95 | 0.95 | 0.95 |
Note that the loudspeaker at ‘pos 9’ has 2 opposite orientations, we refer to the loudspeaker facing upward as
pos 9u
, and the another one aspos 9d
.
The relationships between the positions of loudspeaker and the identities of speakers can be found here. And the whole training set was saved under the subdirectory named .\adhoc40-train\
The plane structure of the office room and the positions of loudspeaker and microphones are shown below.
The red dot indicates the origin of the reference axes. The blue dots indicate the positions of the microphones, whose coordinates are listed in the upper-left corner. The positions and orientations of the loudspeaker are marked by loudspeaker icons. The terms ‘pos’ is short for position. The term ‘mic’ is short for microphone.
Pos 1 to 4 were selected to replay 'test-clean' corpus and pos 5 to 8 were selected to replay 'dev-clean' corpus.
The specific coordinates of these 40 microphones (for development and test data) are shown in the tables below:
mic | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
x(m) | 8.3 | 8.3 | 8.3 | 8.3 | 8.3 | 8.3 | 8.3 | 8.3 | 6.7 | 6.7 | 6.7 | 6.7 | 6.7 | 6.7 | 6.7 | 6.7 | 5.1 | 5.1 | 5.1 | 5.1 |
y(m) | 6 | 5.2 | 4.4 | 3.6 | 2.8 | 2 | 1.2 | 0.4 | 6 | 5.2 | 4.4 | 3.6 | 2.8 | 2 | 1.2 | 0.4 | 6 | 5.2 | 4.4 | 3.6 |
z(m) | 0.9 | 0.9 | 0.9 | 0.9 | 0.9 | 0.9 | 0.9 | 0.9 | 0.9 | 0.9 | 0.9 | 0.9 | 0.9 | 0.9 | 0.9 | 0.9 | 0.9 | 0.9 | 0.9 | 0.9 |
mic | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
x(m) | 5.1 | 5.1 | 5.1 | 5.1 | 3.5 | 3.5 | 3.5 | 3.5 | 3.5 | 3.5 | 3.5 | 3.5 | 1.9 | 1.9 | 1.9 | 1.9 | 1.9 | 1.9 | 1.9 | 1.9 |
y(m) | 2.8 | 2 | 1.2 | 0.4 | 6 | 5.2 | 4.4 | 3.6 | 2.8 | 2 | 1.2 | 0.4 | 6 | 5.2 | 4.4 | 3.6 | 2.8 | 2 | 1.2 | 0.4 |
z(m) | 0.9 | 0.9 | 0.9 | 0.9 | 0.9 | 0.9 | 0.9 | 0.9 | 0.9 | 0.9 | 0.9 | 0.9 | 0.9 | 0.9 | 0.9 | 0.9 | 0.9 | 0.9 | 0.9 | 0.9 |
The specific coordinates of the loudspeaker (for development and test data) are shown in the table below:
loudspeaker position |
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
---|---|---|---|---|---|---|---|---|
data catagory | test set | dev set | ||||||
x(m) | 2.7 | 4.3 | 5.9 | 7.5 | 2.7 | 4.3 | 5.9 | 7.5 |
y(m) | 1.2 | 1.2 | 1.2 | 1.2 | 5.2 | 5.2 | 5.2 | 5.2 |
z(m) | 0.95 | 0.95 | 0.95 | 0.95 | 0.95 | 0.95 | 0.95 | 0.95 |
The relationships between the positions of loudspeaker and the identities of speakers can be found here. The development set and test set were saved under the subdirectories named .\adhoc40-dev\
and .\adhoc40-test\
respectively.
The size of the net space of the anechoic chamber is 11.8×4.2×3.8 m after the installation of sound-absorbing materials.
We replayed the clean speech of Librispeech (including ‘train-clean-100’, ‘dev-clean’, and ‘test-clean’ corpora in Librispeech) in the anechoic chamber to provide the ground-truth clean speech of Libri-adhoc40. The distance between the loudspeaker and the recording device is 40 cm. The sound volume of the loudspeaker was set the same as that in the office room.
The test data of Libri-adhoc40 can be downloaded at https://www.dropbox.com/s/3ph407rvr8bhg0e/adhoc40-test.rar?dl=0 now.
The dev data of Libri-adhoc40 can be downloaded at https://www.dropbox.com/sh/xozyvr1bbybh3fi/AABLUwZxbKlJcpPgwfq-o4Mra?dl=0 now.
The rest will be available soon.
Libri-adhoc40: A dataset collected from synchronized ad-hoc microphone arrays
Scaling sparsemax based channel selection for speech recognition with ad-hoc microphone arrays
Attention-based multi-channel speaker verification with ad-hoc microphone arrays
Deep ad-hoc beamforming based on speaker extraction for target-dependent speech separation
Multi-Channel Far-Field Speaker Verification with Large-Scale Ad-hoc Microphone Arrays