A DBSCAN-Based Automation for Speech Onset Detection
Download Praditor | English · 中文
Praditor is a speech onset detector that helps you find out boundaries between silence and sound automatically.
Praditor works for both single-onset and multi-onset audio files without any language limitation. It generates output as PointTiers in .TextGrid format.
- Onset/Offset Detection
- Silence Detection
Praditor also allows users to adjust parameters in the Dashboard to get a better performance.
We have prepared test_audio.wav for you to give it a try.
Praditor is written and maintained by Tony, Liu Zhengyuan from Centre for Cognitive and Brain Sciences, University of Macau.
If you have any questions in terms of how to use Praditor or its algorithm details, or you want me to help you write some additional
scripts like export audio files, export Excel tables,
feel free to contact me at [email protected]
or [email protected]
.
File
-> Read files...
-> Select your target audio file
For onset/offset...
Run
Apply Praditor algorithm on the current audioPrev
/Next
Go to previous/next audioRead
Read time points from current audio's .TextGrid resultsClear
Clear time points that are being displayed (but no change to .TextGrid)Onset
/Offset
Show/Hide onsets/offsets
For parameters...
Current/Default
Display default parameters or parameters for the current fileSave
Save the displayed parameters as Current/DefaultReset
Reset the displayed parameters to the last time you saved it.
On the menu...
File
>Read files...
> Select an audio fileHelp
>Parameters
> Show quick instruction on how our parameters work
In case you want to zoom in/out
- Wheel ↑/Wheel ↓ to zoom-in/zoom-out in timeline
- Ctrl+Wheel ↑/Wheel ↓ to zoom-in/zoom-out (for Windows users)
- Command+Wheel ↑/Wheel ↓ to zoom-in/zoom-out (for Mac users)
The audio signal is first band-pass filtered to remove some high/low frequency noise. Then, it is down sampled with max-pooling strategy (i.e., using the max value to represent each piece).
DBSCAN requires two dimensions. How do we transform 1-D audio signal into 2-D array? For every two consecutive pieces, they are grouped into a point. The point has two dimensions, previous and next frame. On this point array, Praditor applies DBSCAN clustering to these points. Noise points are usually gathered around (0, 0) due to their relatively small amplitudes.
At this point, noise areas are found, which means we have roughly pinpoint the probable locations of onsets (i.e., target area).
We do not continue to use the original amplitudes, but first derivatives. First-derivative thresholding is a common technique in other signal processing areas (e.g., ECG). It keeps the trend but remove the noisy ("spiky") part, which helps to improve the performance.
For every target area, we do the same procedure as below:
- Set up a noise reference. It's mean absolute first-derivatives as baseline.
- Set up a starting frame as the onset candidate (start from the very next frame from the noise reference).
- Scan from the starting frame. We use kernel smoothing to see if the current frame (or actually kernel/window) is valid/invalid.
- When we gather enough valid frames, the exact frame/time point we stop is the answer we want. Otherwise, we move on to the next starting frame.
Before we apply down sampling and clustering to the audio signal, a band pass filter is first applied to the original signal. The idea is that we do not need all the frequencies. Too high and too low frequency band can be contaminated.
What we need is the middle part that has high contrast between silence and sound.
Be reminded that the LowPass should not surpass the highest valid frequency (half of the sample rate, refer to Nyquist theorem).
DBSCAN clustering requires two parameters: EPS and MinPt. What DBSCAN does is to scan every point, take it as the circle center, and draw a circle with a radius EPS in length. Within that circle, calculate how many points within and count them valid if hit MinPt.
Praditor allows user to adjust EPS%. Since every audio file can have different amplitude level/silence-sound contrast, Praditor determines EPS = Current Audio's Largest Amplitude * EPS%.
After Praditor has confirmed target areas, the original amplitudes is the transformed into absolute first-derivatives. For each target area, Praditor would set up a Reference Area, whose mean value serves as the baseline for later thresholding.
The length of this reference area is determined by RefLen. When you want to capture silence that has very short length, it is better that you turn down RefLen a little bit as well.
It is the most used parameter. The core idea of thresholding method is about "Hitting the cliff". Whenever a talker speaks, the (absolute) amplitude rises up and creates a "cliff" (in amplitude, or other features).
Threshold has a minimum limitation at 1.00, which is based on the mean value of background-noise reference. However, background noise is not "smoothy" but actually "spiky". That is why Threshold is usually slightly larger than 1.00.
Besides, I would suggest you pay more attention to aspirated sound, as this type of sound has "very slow slope". Too large Threshold can end up in the middle of that "slope" (which is something you don't want). If that's the case, it can sound really weird, like a burst, rather than gradually smooth in.
After reference area and threshold are set, Praditor will (1) set up a starting frame (2) begin scan frame by frame (starting from the frame right next to ref area). It will repeat this process until the valid starting frame (i.e., onset) is found.
Usually we would compare the value (absolute 1st derivative) with threshold. If it surpasses, we call it valid; if not, then invalid. But, Praditor does it a little bit differently, using kernel smoothing. Praditor would borrow information from later frames, like setting up a window (kernel) with a length, KernelSize.
To prevent extreme values, Praditor would neglect the first few largest values in the window (kernel). Or, we only retain KernelFrm% of all frames (e.g., 80% of all). If there is actually extreme values, then we successfully avoid them; if not, then it would not hurt since they are among other values at similar level.
How do we say an onset is an onset? After that onset, lots of frames are above threshold consecutively.
Just as mentioned above, as Praditor scans frame by frame (window by window, or kernel by kernel), each frame is either going to be above or below the threshold. If the current frame (kernel) surpass the threshold, then it's valid and counted as +1; If it fails to surpass, then it's invalid and counted as -1 * Penalty.
Then, Praditor adds them up to get a sum. Whenever the sum hits zero or below zero, the scanning aborts, and we move on to the next starting frame. On other words, we only want a starting frame whose scanning sum stays positive.
Penalty here is like a "knob" for tuning noise sensitivity. Higher Penalty means higher sensitivity to below-threshold frames.
In summary, each scan has a starting frame (i.e., onset candidate). What we do is to check if this "starting frame" is "valid". By saying it "valid", we are saying that scanning sum stays positive and hits CountValid in the end.
Then, we can say, this is the exact time point (onset/offset) we want.
If you would like to download the datasets that were used in developing Praditor, please refer to our OSF storage .