Skip to content

Praditor: A DBSCAN-Based Automation for Speech Onset Detection

License

Notifications You must be signed in to change notification settings

Paradeluxe/Praditor

Repository files navigation

License GitHub Release Downloads



Praditor_icon

Praditor

A DBSCAN-Based Automation for Speech Onset Detection

Download Praditor | English · 中文


Features

Praditor is a speech onset detector that helps you find out boundaries between silence and sound automatically.

audio2textgrid.png

Praditor works for both single-onset and multi-onset audio files without any language limitation. It generates output as PointTiers in .TextGrid format.

  • Onset/Offset Detection
  • Silence Detection

Praditor also allows users to adjust parameters in the Dashboard to get a better performance.

We have prepared test_audio.wav for you to give it a try.

Authors

Praditor is written and maintained by Tony, Liu Zhengyuan from Centre for Cognitive and Brain Sciences, University of Macau.

If you have any questions in terms of how to use Praditor or its algorithm details, or you want me to help you write some additional scripts like export audio files, export Excel tables, feel free to contact me at [email protected] or [email protected].

How to use Praditor?

1. Import your audio

File -> Read files... -> Select your target audio file

import_audio.png

2. Play with Praditor

gui.png

For onset/offset...

  • Run Apply Praditor algorithm on the current audio
  • Prev/Next Go to previous/next audio
  • Read Read time points from current audio's .TextGrid results
  • Clear Clear time points that are being displayed (but no change to .TextGrid)
  • Onset/Offset Show/Hide onsets/offsets

For parameters...

  • Current/Default Display default parameters or parameters for the current file
  • Save Save the displayed parameters as Current/Default
  • Reset Reset the displayed parameters to the last time you saved it.

On the menu...

  • File > Read files... > Select an audio file
  • Help > Parameters > Show quick instruction on how our parameters work

In case you want to zoom in/out

  • Wheel ↑/Wheel ↓ to zoom-in/zoom-out in timeline
  • Ctrl+Wheel ↑/Wheel ↓ to zoom-in/zoom-out (for Windows users)
  • Command+Wheel ↑/Wheel ↓ to zoom-in/zoom-out (for Mac users)

How does Praditor work?

The audio signal is first band-pass filtered to remove some high/low frequency noise. Then, it is down sampled with max-pooling strategy (i.e., using the max value to represent each piece).

ds_maxp.png

DBSCAN requires two dimensions. How do we transform 1-D audio signal into 2-D array? For every two consecutive pieces, they are grouped into a point. The point has two dimensions, previous and next frame. On this point array, Praditor applies DBSCAN clustering to these points. Noise points are usually gathered around (0, 0) due to their relatively small amplitudes.

DBSCAN_small.png

At this point, noise areas are found, which means we have roughly pinpoint the probable locations of onsets (i.e., target area).

We do not continue to use the original amplitudes, but first derivatives. First-derivative thresholding is a common technique in other signal processing areas (e.g., ECG). It keeps the trend but remove the noisy ("spiky") part, which helps to improve the performance.

scan.png

For every target area, we do the same procedure as below:

  1. Set up a noise reference. It's mean absolute first-derivatives as baseline.
  2. Set up a starting frame as the onset candidate (start from the very next frame from the noise reference).
  3. Scan from the starting frame. We use kernel smoothing to see if the current frame (or actually kernel/window) is valid/invalid.
  4. When we gather enough valid frames, the exact frame/time point we stop is the answer we want. Otherwise, we move on to the next starting frame.

Parameters

HighPass/LowPass

Before we apply down sampling and clustering to the audio signal, a band pass filter is first applied to the original signal. The idea is that we do not need all the frequencies. Too high and too low frequency band can be contaminated.

choose_freq.png

What we need is the middle part that has high contrast between silence and sound.

Be reminded that the LowPass should not surpass the highest valid frequency (half of the sample rate, refer to Nyquist theorem).

EPS%

DBSCAN clustering requires two parameters: EPS and MinPt. What DBSCAN does is to scan every point, take it as the circle center, and draw a circle with a radius EPS in length. Within that circle, calculate how many points within and count them valid if hit MinPt.

DBSCAN.png

Praditor allows user to adjust EPS%. Since every audio file can have different amplitude level/silence-sound contrast, Praditor determines EPS = Current Audio's Largest Amplitude * EPS%.

RefLen

After Praditor has confirmed target areas, the original amplitudes is the transformed into absolute first-derivatives. For each target area, Praditor would set up a Reference Area, whose mean value serves as the baseline for later thresholding.

reflen.png

The length of this reference area is determined by RefLen. When you want to capture silence that has very short length, it is better that you turn down RefLen a little bit as well.

Threshold

It is the most used parameter. The core idea of thresholding method is about "Hitting the cliff". Whenever a talker speaks, the (absolute) amplitude rises up and creates a "cliff" (in amplitude, or other features).

threshold_possibly_close.png

Threshold has a minimum limitation at 1.00, which is based on the mean value of background-noise reference. However, background noise is not "smoothy" but actually "spiky". That is why Threshold is usually slightly larger than 1.00.

asp_sound.png

Besides, I would suggest you pay more attention to aspirated sound, as this type of sound has "very slow slope". Too large Threshold can end up in the middle of that "slope" (which is something you don't want). If that's the case, it can sound really weird, like a burst, rather than gradually smooth in.

KernelSize, KernelFrm%

After reference area and threshold are set, Praditor will (1) set up a starting frame (2) begin scan frame by frame (starting from the frame right next to ref area). It will repeat this process until the valid starting frame (i.e., onset) is found.

Usually we would compare the value (absolute 1st derivative) with threshold. If it surpasses, we call it valid; if not, then invalid. But, Praditor does it a little bit differently, using kernel smoothing. Praditor would borrow information from later frames, like setting up a window (kernel) with a length, KernelSize.

kernel.png

To prevent extreme values, Praditor would neglect the first few largest values in the window (kernel). Or, we only retain KernelFrm% of all frames (e.g., 80% of all). If there is actually extreme values, then we successfully avoid them; if not, then it would not hurt since they are among other values at similar level.

CountValid, Penalty

How do we say an onset is an onset? After that onset, lots of frames are above threshold consecutively.

Just as mentioned above, as Praditor scans frame by frame (window by window, or kernel by kernel), each frame is either going to be above or below the threshold. If the current frame (kernel) surpass the threshold, then it's valid and counted as +1; If it fails to surpass, then it's invalid and counted as -1 * Penalty.

Then, Praditor adds them up to get a sum. Whenever the sum hits zero or below zero, the scanning aborts, and we move on to the next starting frame. On other words, we only want a starting frame whose scanning sum stays positive.

Penalty here is like a "knob" for tuning noise sensitivity. Higher Penalty means higher sensitivity to below-threshold frames.

count_valid.png

In summary, each scan has a starting frame (i.e., onset candidate). What we do is to check if this "starting frame" is "valid". By saying it "valid", we are saying that scanning sum stays positive and hits CountValid in the end.

Then, we can say, this is the exact time point (onset/offset) we want.

Data and Materials

If you would like to download the datasets that were used in developing Praditor, please refer to our OSF storage .