One of our “northstar” projects at cgn.ai is indexing the podcast
universe — any data of spoken words really. An important piece in our pipeline is
speaker segmentation (diarization), i.e. the segmentation of an audio track into different
speaker tracks, answereing *who* spoke *when*.

Along those lines we explored different approaches to tackle this problem,
including a probabilistic approach where we build a probabilistic speaker
segmentation model in *Gen*. Gen is a general-purpose probabilistic
programming system, embedded in Julia.

The notebook can be found on our public repo at:

We define a model that generates speaker similarity matrices of a given size. In practice such matrices are constructed by creating voice embeddings $X$ vor a given wav file, and computing their similarity, i.e.

\[D = X X^T.\]To be more precise let $w_i$ and $w_j$ be two audio snippets from a given wav file and let $x_i$ and $x_j$ define their vector embeddings, then the $ij$’th eintry is given by

\[d_{i,j} = \langle x_i , x_j \rangle\]Here is a real world example of such a similarity matrix:

In contrast, here are a few samples from our probabilistic model:

### Inference - Reversible Jump MCMC:

### Excerpt from the notebook:

```
"""
matrix_model(M::Int "Max number of speakers",
T::Int "Size of the matrix (or length of the track)",
k::Int "Size of the observation band around diagonal",
poisson_rate::Float64 "Parameter shaping the `:num_segments` distribution")
Speaker similarity-matrix model.
"""
@gen function matrix_model(M::Int, T::Int, k::Int, poisson_rate::Float64)
#
# Sampling the number of segments `N` and
# their lengths `ls`, and computing
# their enpoints `ps`
#
N = {:num_segments} ~ poisson_plus_one(poisson_rate)
ls = {:len_segments} ~ dirichlet(N, 2.0)
ps = cumsum(ls)
ps[end] = 1.0 # Correct numeric issue: `ps[end]` might be 0.999999
#
# Sampling speaker ids for each segment
#
ids = [{(:id, i)} ~ categorical(ones(M)./M) for i=1:N]
#
# Mapping speaker ids to
# individual ticks.
#
xs = collect(0:1/(T-1):1)
I = [findfirst(p -> p >= x , ps) for x in xs]
ys = ids[I]
#
# Sampling entries for the similarity matrix. We distinguish
# 3 different cases:
# - Diagonal entries
# - Same speaker
# - Different speaker
#
# TODO: One could extend this by varying the values based on
# speaker similarity
#
D = zeros(T,T)
for i in 1:T, j in max(i-k,1):min(i+k,T)
if i == j
# The voice embeddings are normalized so on the
# diagonal the entries are 1.
# TODO: Don't need a random choice here really ...
D[i,j] = {:D => (i, j)} ~ normal(1, sqrt(0.001))
else
# Off-diagonal entries ramp down
# from 0.9 towards 0.5
off = 6
r = 1 - min(abs(i-j), off)/off
r = r^2
if ys[i] == ys[j]
D[i,j] = {:D => (i, j)} ~ normal(r*0.8 + (1-r)*0.5, sqrt(0.01))
else
D[i,j] = {:D => (i, j)} ~ normal(0.15, sqrt(0.007))
end
end
end
return Dict(
:T => T,
:N => N,
:ids => ids,
:xs => xs,
:ys => ys,
:D => D,
:ps => ps,
:ls => ls)
end;
```