Skip to content

A dataset collected from synchronized ad-hoc microphone arrays

License

Notifications You must be signed in to change notification settings

ISmallFish/Libri-adhoc40

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

98 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Libri-adhoc40

Libri-adhoc40 is a synchronized speech corpus, which collects the replayed Librispeech data from loudspeakers by ad-hoc microphone arrays of 40 strongly synchronized distributed nodes in a real office environment. Besides, to provide the evaluation target for speech frontend processing and other applications, it also recorded the replayed Librispeech data in an anechoic chamber.

Description of the dataset

The Libri-adhoc40 dataset is built on the ‘train-clean-100’, ‘dev-clean’, and ‘test-clean’ corpora of Librispeech, which contains about 110 hours of US English speech from 331 speakers. Eventually, Libri-adhoc40 contains 4510 hours data in total with 110 hours data per microphone.

An overview of Libri-adhoc40 is listed in the following table:

subset recording
environment
duration
per channel
spkr nums ch nums loudspeaker positions playback corpus
in Librispeech
training data office room
100h 251 40 9 train-clean-100
dev data 5h 40 40 4 dev-clean
test data 5h 40 40 4 test-clean
ground-truth clean data anechoic chamber 110h 331 1 1 train-clean-100
dev-clean
test-clean

For each utterance in ‘train-clean-100’, ‘dev-clean’, and ‘test-clean’ corpora, we replayed it through loudspeaker both in the office room and the anechoic chamber. Besides, when we collected the training data in the office room, the positions of the 40 microphones are different from those when collecting the development data and test data.

Assume that the sentence with the number of '229-130880-0017' was replayed, where number '229-130880-0017' means that speaker '229' speaked according to sentence '0017' in chapter '130880'. The naming rule can be described as follows:

We can obtain 41 channels of data in total, since we recorded it in the office room and the anechoic chamber respectively. For each sentence we recorded, we first classified them according to the position of loudspeaker and speaker, then according to the chapters, and finally according to the original sentences number. Specifically, for each utterance recorded in the office room, we created a new name for it through adding a suffix to the original number ('174-84280-0010') according to the number of the microphone. As for the utterances recorded in anechoic chamber, a suffix named 'anechoic' is added at the end of each utterance.

In Librispeech corpus, the relative path of sentence '229-130880-0017' is:

.\train-clean-100\229\130880\229-130880-0017.flac

In Libri-adhoc40 corpus, the relative path of recorded sentences from '229-130880-0017' have the following forms:

.\adhoc40-train\pos #\229\130880\229-130880-0017-ch-1.wav
.\adhoc40-train\pos #\229\130880\229-130880-0017-ch-2.wav
.\adhoc40-train\pos #\229\130880\229-130880-0017-ch-3.wav
                         ...
.\adhoc40-train\pos #\229\130880\229-130880-0017-ch-40.wav
.\adhoc40-train\pos #\229\130880\229-130880-0017-anechoic.wav

The pos # indicates the position of loudspeaker. See below for more detailed descriptions.

Training data

The plane structure of the office room is shown below.

The red dot indicates the origin of the reference axes. The blue dots indicate the positions of the microphones, whose coordinates are listed in the upper-left corner. The positions and orientations of the loudspeaker are marked by loudspeaker icons. The terms ‘pos’ is short for position. The term ‘mic’ is short for microphone.

The height of the room is 4.2 m. Because the room size is large, and the floor is laid with smooth tiles, the room is highly reverberant with the T60 around 900 ms. Because the room is far from noisy environments, the recorded speech has little additive noise. A directional loudspeaker and 40 omnidirectional microphones of the same type were placed in the room. The sampling rate is 16 kHz.

The specific coordinates of these 40 microphones (for training data) are shown in the tables below:

mic 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
x(m) 9.1 8.3 9.1 8.3 9.1 8.3 9.1 8.3 7.5 6.7 7.5 6.7 7.5 6.7 7.5 6.7 5.9 5.1 5.9 5.1
y(m) 5.2 6.0 3.6 4.4 2 2.8 0.4 1.2 5.2 6.0 3.6 4.4 2 2.8 0.4 1.2 5.2 6.0 3.6 4.4
z(m) 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9
mic 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
x(m) 5.9 5.1 5.9 5.1 4.3 3.5 4.3 3.5 4.3 3.5 4.3 3.5 2.7 1.9 2.7 1.9 2.7 1.9 2.7 1.9
y(m) 2 2.8 0.4 1.2 5.2 6 3.6 4.4 2 2.8 0.4 1.2 5.2 6 3.6 4.4 2 2.8 0.4 1.2
z(m) 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9

The specific coordinates of the loudspeaker (for training data) are shown in the table below:

loudspeaker
position
1 2 3 4 5 6 7 8 9
x(m) 2.7 2.7 2.7 4.3 5.9 8.3 8.3 8.3 5.1
y(m) 4.4 2.8 1.2 1.2 1.2 2.0 3.6 5.2 3.6
z(m) 0.95 0.95 0.95 0.95 0.95 0.95 0.95 0.95 0.95

Note that the loudspeaker at ‘pos 9’ has 2 opposite orientations, we refer to the loudspeaker facing upward as pos 9u, and the another one as pos 9d.

The relationships between the positions of loudspeaker and the identities of speakers can be found here. And the whole training set was saved under the subdirectory named .\adhoc40-train\

Development and test data

The plane structure of the office room and the positions of loudspeaker and microphones are shown below.

The red dot indicates the origin of the reference axes. The blue dots indicate the positions of the microphones, whose coordinates are listed in the upper-left corner. The positions and orientations of the loudspeaker are marked by loudspeaker icons. The terms ‘pos’ is short for position. The term ‘mic’ is short for microphone.

Pos 1 to 4 were selected to replay 'test-clean' corpus and pos 5 to 8 were selected to replay 'dev-clean' corpus.

The specific coordinates of these 40 microphones (for development and test data) are shown in the tables below:

mic 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
x(m) 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 6.7 6.7 6.7 6.7 6.7 6.7 6.7 6.7 5.1 5.1 5.1 5.1
y(m) 6 5.2 4.4 3.6 2.8 2 1.2 0.4 6 5.2 4.4 3.6 2.8 2 1.2 0.4 6 5.2 4.4 3.6
z(m) 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9
mic 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
x(m) 5.1 5.1 5.1 5.1 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9
y(m) 2.8 2 1.2 0.4 6 5.2 4.4 3.6 2.8 2 1.2 0.4 6 5.2 4.4 3.6 2.8 2 1.2 0.4
z(m) 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9

The specific coordinates of the loudspeaker (for development and test data) are shown in the table below:

loudspeaker
position
1 2 3 4 5 6 7 8
data catagory test set dev set
x(m) 2.7 4.3 5.9 7.5 2.7 4.3 5.9 7.5
y(m) 1.2 1.2 1.2 1.2 5.2 5.2 5.2 5.2
z(m) 0.95 0.95 0.95 0.95 0.95 0.95 0.95 0.95

The relationships between the positions of loudspeaker and the identities of speakers can be found here. The development set and test set were saved under the subdirectories named .\adhoc40-dev\ and .\adhoc40-test\ respectively.

Ground-truth clean speech

The size of the net space of the anechoic chamber is 11.8×4.2×3.8 m after the installation of sound-absorbing materials.

We replayed the clean speech of Librispeech (including ‘train-clean-100’, ‘dev-clean’, and ‘test-clean’ corpora in Librispeech) in the anechoic chamber to provide the ground-truth clean speech of Libri-adhoc40. The distance between the loudspeaker and the recording device is 40 cm. The sound volume of the loudspeaker was set the same as that in the office room.

Download Link

The test data of Libri-adhoc40 can be downloaded at https://www.dropbox.com/s/3ph407rvr8bhg0e/adhoc40-test.rar?dl=0 now.

The dev data of Libri-adhoc40 can be downloaded at https://www.dropbox.com/sh/xozyvr1bbybh3fi/AABLUwZxbKlJcpPgwfq-o4Mra?dl=0 now.

The rest will be available soon.

Reference

Libri-adhoc40: A dataset collected from synchronized ad-hoc microphone arrays

Scaling sparsemax based channel selection for speech recognition with ad-hoc microphone arrays

Attention-based multi-channel speaker verification with ad-hoc microphone arrays

Deep ad-hoc beamforming based on speaker extraction for target-dependent speech separation

Multi-Channel Far-Field Speaker Verification with Large-Scale Ad-hoc Microphone Arrays

About

A dataset collected from synchronized ad-hoc microphone arrays

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published