Time Delay Estimation / Speaker Tracking

Here, we will distinguish speaker localization from speaker tracking. We say speaker localization is based on an instantaneous estimate of speaker’s position only. Speaker tracking may use single observation windows on the order of 25 milliseconds, but may combine multiple observations in order to track speaker’s trajectory over frames.

GCC-PHAT Time Delay Estimation

Many conventional speaker localization techniques use a time delay of arrival (TDOA) with multiple sensors. Among those TDOA estimation methods [CBH06], the most computationally efficient method would involve the phase transform (PHAT) [CAR81] [OS94] [DSB01] [AWPA05], which falls into a family of the generalized cross-correlation (GCC). It can be efficiently done by taking a time delay associated with the maximum PHAT value through FFT. It was also shown in [DSB01] that the GCC-PHAT normalization method improved robustness against noise and reverberation. The GCC-PHAT TDOA estimator can be easily implemented with PHATFeature and TDOAFeature. The sample script is placed as unit_test/test_tdoa_estimator.py in the repository.

The following command performs GCC-PHAT-based speaker localization.

$ cd ${YOUR_GIT_REPOSITORY}/unit_test
$ python test_tdoa_estimator.py \
      -c confs/gcc_phat_tdoae.json \
      -i data/CMU/R1/M1005/KINECT/RAW/segmented/U1001_1M_16k_b16_c1.wav \
         data/CMU/R1/M1005/KINECT/RAW/segmented/U1001_1M_16k_b16_c2.wav \
         data/CMU/R1/M1005/KINECT/RAW/segmented/U1001_1M_16k_b16_c3.wav \
         data/CMU/R1/M1005/KINECT/RAW/segmented/U1001_1M_16k_b16_c4.wav \
     -o out/U1001_1M_sl

This will output three files: the TDOA estimate, speaker trajectory position and average position files. The extension names of these files will be “.tdoa.json”, “.trj.pos.json” and “.ave.pos.json”, respectively.

Direct Kalman Filtering Speaker Tracking

In the case of the simplest array geometry, a linear array, it is trivial to calculate a direction of arrival (DOA) from the TDOA measurement. However, for more complicated geometry, it will not be that easy. Computing a source position will require the spherical intersection estimator [SR87], spherical interpolation estimator [AS87], linear intersection estimator [BRA95] or propagation vector method [YKA96]. These methods fall into the category of speaker localization techniques, inasmuch as they return an instantaneous position estimate followed by position trajectory smoothing. In other words, speaker localization and tracking are treated as two separate problems.

Instead, it is possible to adopt a unified approach whereby the time series of speaker’s positions is estimated without recourse to any intermediate localization instances [KGM06]. Klee, Gehrig and McDonough have shown that the TDOA and speaker positions can be simultaneously estimated under the iterated extended Kalman filtering (IEKF) framework. The BTK implements the IEKF-based speaker tracking algorithm as well as the EKF-based tracker as a reference.

To setup the IEKF-based speaker tracker, create the following JSON file:

$ cd ${YOUR_GIT_REPOSITORY}/unit_test
$ cat confs/iekfst.json
{
"array_type": "linear",
"microphone_positions": [[-113.0, 0.0, 2.0], [36.0, 0.0, 2.0], [76.0, 0.0, 2.0], [113.0, 0.0, 2.0]],
"tracker": {
            "type": "iekf",
            "shiftlen":4096,
            "fftlen":8192,
            "energy_threshold":100,
            "minimum_pairs":3,
            "cc_threshold":0.11,
            "boundaries": [[-3.141592653589793, 3.141592653589793], [-3.141592653589793, 3.141592653589793], [-3.141592653589793, 3.141592653589793]],
            "sigmaV2": 0.0004,
            "sigmaK2": 10000000000.0,
            "gate_prob": 0.95,
            "initial_estimate": [0],
            "sigmaU2": 10.0,
            "num_iterations":3,
            "iteration_threshold":1e-4,
            "pair_ids": [[0, 1], [0, 2], [0, 3], [1, 2], [1, 3], [2, 3]]
            }
}

This JSON file is saved as unit_test/confs/iekfst.json in BTK Git repository. The configuration file specifies parameters for IEKF-based speaker tracking. Table 3 describes what can be controlled with each parameter; see Sec. 4.3.5 in [WM09] for the relationship between the script parameter and IEKF equation.

Table 3 JSON Parameters for IEKF-based speaker tracking
JSON key What does this value mean?
[“array_type”] Array type: “linear”, “circular”, “nf”
[“microphone_positions”] List of coordinates of a microphone position
[“tracker”]  
[“type”] Type of a tracker algorithm
[“shiftlen”] Frame shift of an analysis window
[“energy_threshold”] Skip a frame if signal energy is below this threshold
[“pair_ids”] List of channel index pairs used for GCC computation
[“cc_threshold”] Ignore a frame if the GCC between 2 channels is below this
[“minimum_pairs”] This number of microphone pairs needs to have a higher GCC
[“boundaries”] Search range for each position value
[“sigmaV2”] Variance of the measurement noise covariance matrix
[“sigmaU2”] Variance of the process noise covariance matrix
[“sigmaK2”] Initial Kalman gain
[“gate_prob”] Probability that correct observation falls within the gate
[“initial_estimate”] Initial position estimate vector
[“num_iterations”] Number of local iterations for IEKF
[“iteration_threshold”] Quit the iteration if the difference becomes below this

Then, you can run IEKF source tracking by giving the JSON file to a script unit_test/test_source_tracking.py as

$ cd ${YOUR_GIT_REPOSITORY}/unit_test
$ python test_source_tracking.py \
       -c confs/iekfst.json \
       -i data/CMU/R1/M1005/KINECT/RAW/segmented/U1001_1M_16k_b16_c1.wav \
          data/CMU/R1/M1005/KINECT/RAW/segmented/U1001_1M_16k_b16_c2.wav \
          data/CMU/R1/M1005/KINECT/RAW/segmented/U1001_1M_16k_b16_c3.wav \
          data/CMU/R1/M1005/KINECT/RAW/segmented/U1001_1M_16k_b16_c4.wav \
      -o out/U1001_1M_iekf

In the same way as the GCC-PHAT speaker localizer, the script will generate three result files: the TDOA estimate, speaker trajectory position and average position files. The file extensions will be “.tdoa.json”, “.trj.pos.json” and “.ave.pos.json”, respectively.

SRP-PHAT Source Localization