README.rst 12.9 KB
Newer Older
1
2
3
4
5
.. -*- coding: utf-8 -*-

.. image:: https://img.shields.io/badge/docs-stable-yellow.svg
   :target: https://www.idiap.ch/software/bob/docs/bob/bob.pyannote/stable/index.html
.. image:: https://img.shields.io/badge/docs-latest-orange.svg
Pavel KORSHUNOV's avatar
Pavel KORSHUNOV committed
6
7
8
9
10
   :target: https://www.idiap.ch/software/bob/docs/bob/bob.pyannote/master/index.html
.. image:: https://gitlab.idiap.ch/bob/bob.pyannote/badges/master/build.svg
   :target: https://gitlab.idiap.ch/bob/bob.pyannote/commits/master
.. image:: https://gitlab.idiap.ch/bob/bob.pyannote/badges/master/coverage.svg
   :target: https://gitlab.idiap.ch/bob/bob.pyannote/commits/master
11
.. image:: https://img.shields.io/badge/gitlab-project-0000c0.svg
Pavel KORSHUNOV's avatar
Pavel KORSHUNOV committed
12
   :target: https://gitlab.idiap.ch/bob/bob.pyannote
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
.. image:: https://img.shields.io/pypi/v/bob.pyannote.svg
   :target: https://pypi.python.org/pypi/bob.pyannote


======================================
 Bob's wrapper for pyannote framework
======================================

This package is part of the signal-processing and machine learning toolbox Bob_.
The package provides the scripts and the documentation for speech
diarization pipeline that relies on pyannote_ framework.

Essentially, this package is a set of helper scripts and a documentation
on how to run and reproduce the diarization experiments using pyannote_
diarization framework. This documentation let you setup the pyannote_
environment to reproduce training and evaluating several paynnote_ based
models for voice activity detection (VAD), speaker change detection (SPD),
and embeddings (EMB). Also, we provide several scripts for computing various
metrics for measuring the performance of the trained models.

In short, this package can be considered as one-stop quick summary for
the diarization experiments with pyannote_.


Installation
------------

First install Conda_ and create a new environment::

  $ conda create --name pyannote python=3.6
  $ conda source activate pyannote

Then, follow bob's `installation`_ instructions. Then, install this
package::

  $ buildout

Depending on which speaker database is used in the experiments, the
installation instruction may vary, but the necessary step is to
install pyannote.audio_ package together with torch_ framework::

  $ conda install pip
  $ conda install pytorch torchvision -c pytorch

Then, install the pyannote.audio_ package::

  $ git clone https://github.com/pyannote/pyannote-audio.git
  $ cd pyannote-audio
  $ git checkout develop
  $ pip install .


Report of the experiments
=========================

The PDF of the report describing the experiments in more details and discussion of the evaluation results is available in ``bob/pyannote/report`` folder. 

Running the diarization experiments
===================================

We provide the configuration database file ``database.yml`` inside
``bob/pyannote/config`` folder that can be placed in the ``~/.pyannote/`` to point
pyannote.audio_ package how to read and process the databases that we present
in the next session.


Databases
---------

In our experiments, we have used the following databases:

- CallHomeSRE_, subset of NIST SRE 2000 of different language speakers, which
  was used as a standalone database for training.
- CallHome_ English subset from CABank corpora, which was used for training,
  validation (development subset), and the selection of hyper-parameters
  (development subset).
- SmallMeta - a smaller collection of databases, including CallHomeSRE_
  (full database), CallHome_ (train subset), LibriSpeech_ (other-train subset),
  and AMI_ (train subset). This collection was used for training different
  models of diarization pipeline.
- LargeMeta_ - a larger collection of databases, including SmallMeta
  plus REPERE_ (train subsets of phase 1 and 2), ESTER (train subsets of
  ESTER1_ and ESTER2_), and LibriSpeech_ (clean-train subset). This collection
  was used for training different models of diarization pipeline.
- AMI_ corpus, which was used for models validation (development subset) and
  the selection of hyper-parameters  (development subset).
- DIHARD_ database for the DIHARD2 challenge of 2019. This dataset has a
  several subsets of challenging speech recordings of different types, including
  restaurant conversations, discussions with children, and recordings of
  clinical trials. This is one of the more challenging datasets as of 2019.
Pavel KORSHUNOV's avatar
Pavel KORSHUNOV committed
103
104
105
- ODESSA_ VoIP database, which is the database of multi-party speech data over 
  Internet telephony. The whole database was used to test all of the above 
  pre-trained models and hyper-parameters.
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123

Almost all databases, have paynnote_ database package available, so that it is
easy to use these databases (once the data itself is downloaded, which should
be done separately with each databases copyright holders). Here is the list
of corresponding database packages that can be installed:

- CallHomeSRE:
  `https://github.com/yinruiqing/pyannote-db-callhome <https://github.com/yinruiqing/pyannote-db-callhome>`_
- CallHome:
  `https://github.com/hbredin/pyannote-db-callhome <https://github.com/hbredin/pyannote-db-callhome>`_
- LibriSpeech:
  `https://github.com/ironiksk/pyannote-db-librispeech <https://github.com/ironiksk/pyannote-db-librispeech>`_
- ESTER:
  `https://github.com/pyannote/pyannote-db-ester <https://github.com/pyannote/pyannote-db-ester>`_
- REPERE:
  `https://github.com/pyannote/pyannote-db-repere <https://github.com/pyannote/pyannote-db-repere>`_
- AMI:
  `https://github.com/pyannote/pyannote-db-odessa-ami <https://github.com/pyannote/pyannote-db-odessa-ami>`_
Pavel KORSHUNOV's avatar
Pavel KORSHUNOV committed
124
125
126
- ODESSA:
  `https://github.com/pyannote/pyannote-db-odessa-ip <https://github.com/pyannote/pyannote-db-odessa-ip>`_
  
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280

The configuration database file ``database.yml`` inside ``bob/pyannote/config``
shows all databases we use and it can be placed in the ``~/.pyannote/`` for
pyannote.audio_ package to know how to read and process these databases.
Just make sure you indicate correct paths to the actual database speech files.

Training and validating the models
----------------------------------

For training any models, it is practically necessary to have a GPU available,
so we assume that the training commands are run
in the environment with GPU available and the NVIDIA drivers installed.

Any model training with pyannote_ requires a configuration file that
describes the components and parameters
of the training. The templates for voice activity detection (VAD), speaker
change detection (SPD), and embeddings (EMB) training models
are available inside ``bob/pyannote/config`` folder. For instance
``bob/pyannote/config/vad/config.yml`` contains configuration for
VAD model training.

VAD training
------------

Once we have the folder for VAD training and the confi.yml placed inside it, the command for training voice activity detection
model is as follows (assuming here that we are training using the train set of the protocol ``X.SpeakerDiarization.MetaProtocolSAD``
which is described in the database config ``bob/pyannote/config/database.yml``)::

  $ pyannote-speech-detection train --gpu --to=1000 ./bob/pyannote/config/vad X.SpeakerDiarization.MetaProtocolSAD

The training will be done for 1000 epochs and the models for all of the epochs will be saved inside the folder
``./bob/pyannote/config/vad/train/X.SpeakerDiarization.MetaProtocolSAD.train/weights``

To validate the trained model on the validation/development set of a
database/protocol (or the same one) and to select the model/epoch that
results in the best detection error rate metric on the validation set, run the following::

  $ pyannote-speech-detection validate --gpu --from=0 --to=1000 --every=5 --chronological  ./bob/pyannote/config/vad/train/X.SpeakerDiarization.MetaProtocolSAD.train AMI.SpeakerDiarization.MixHeadset

The validation will be done for every 5th trained model and the best one
will be written (along with other parameters) inside the file
``./bob/pyannote/config/vad/train/X.SpeakerDiarization.MetaProtocolSAD.train/validate/AMI.SpeakerDiarization.MixHeadset.development/params.yml``


SCD training
------------

Similarly to VAD model, the commands for training and validating speaker change detection are as follows::

  $ pyannote-change-detection train --gpu --to=1000 ./bob/pyannote/config/scd X.SpeakerDiarization.MetaProtocolSAD
  $ pyannote-change-detection validate --gpu --from=0 --to=1000 --every=5 --chronological  ./bob/pyannote/config/scd/train/X.SpeakerDiarization.MetaProtocolSAD.train AMI.SpeakerDiarization.MixHeadset

Training and validation for SCD are the same as for VAD.


Training embeddings
-------------------

Training and validating embedding models are similar to VAD and SCD.
All the differences are inside the ``config.yml`` files, and for embeddings the template is located in
``bob/pyannote/config/emb/config.yml``::

  $ pyannote-speaker-embedding train --gpu --to=1000 ./bob/pyannote/config/scd X.SpeakerDiarization.MetaProtocolSAD
  $ pyannote-speaker-embedding validate --gpu --from=0 --to=1000 --every=5 --chronological  ./bob/pyannote/config/emb/train/X.SpeakerDiarization.MetaProtocolSAD.train AMI.SpeakerDiarization.MixHeadset


Applying the pre-trained models
-------------------------------

The pre-trained and validated models can be used to compute either the
speech changes (VAD task), speaker changes (SCD task),
or clustering can be done for embeddings (EMB task) for the test set of the desired database.

Assuming we did training and validation for VAD as in the section above,
the command to apply the selected model
to the, say, DIHARD evaluation data would be as following::

  $ pyannote-speech-detection apply ./bob/pyannote/config/vad/train/X.SpeakerDiarization.MetaProtocolSAD.train/validate/AMI.SpeakerDiarization.MixHeadset.development DIHARD2.SpeakerDiarization.All

The command will look inside
``./bob/pyannote/config/vad/train/X.SpeakerDiarization.MetaProtocolSAD.train/validate/AMI.SpeakerDiarization.MixHeadset.development``
for ``params.yml`` file,
which specifies the selected model located inside
``./bob/pyannote/config/vad/train/X.SpeakerDiarization.MetaProtocolSAD.train/weights`` folder.
The model will be used on all files in the test set of the DIHARD data and the
results will be written inside
``./bob/pyannote/config/vad/train/X.SpeakerDiarization.MetaProtocolSAD.train/validate/AMI.SpeakerDiarization.MixHeadset.development/apply/model_number/``

You may note that the scores files with ``Detection Error Rate`` are
available in ``DIHARD2.SpeakerDiarization.All.test.eval`` file inside the results folder. Also,
a corresponding RTTM file is also available with the speech changes for
all the files from test set of DIHARD dataset.

For SCD and EMB, the procedure is similar. Please run
``pyannote-change-detection --help`` and ``pyannote-speaker-embedding --help`` for more details.

Additional scripts
------------------

Folders ``./bob/pyannote/scripts`` and ``./bob/pyannote/bash_scripts`` contain
additional miscellaneous scripts for batch computing scores and for batch running
the above training and application commands for different evaluation
scenarios described in the accompanied report in ``./bob/pyannote/report``.

Converting NIST SRE Sphere files to WAV format
----------------------------------------------

First convert the Sphere files of the database to WAV files and then make sure
they are in 16K format.

* Convert Sphere files to WAV format. The best way to do it is
by using the NIST/LDC own software sph2pipe. Other approaches
(tried SPHFile, SoundFile, and audioop) are not always working,
mostly because of the confusing headers that Sphere files can have in
different LDC datasets.
So, download and compiled sph2pipe_v2.5 version and then run the
provided ``./bob/pyannote/bash_scripts/convert_sph2wav.sh`` script.

.. code:: sh

  $ ./convert_sph2wav.sh $(find path/to/sphere/formatted/database/ -name '*.sph')

* Use ffmpeg to upsample and convert the speech to a single channel
using the script ``./bob/pyannote/bash_scripts/convert_audio.sh``.

.. code:: sh

  $ ./convert_audio.sh $(find /path/to/converted/dabase/ -name '*.wav')


Contact
-------

For questions or reporting issues to this software package, contact our
development `mailing list`_.


.. Place your references here:
.. _idiap: http://www.idiap.ch
.. _bob: http://www.idiap.ch/software/bob
.. _installation: https://www.idiap.ch/software/bob/install
.. _mailing list: https://www.idiap.ch/software/bob/discuss
.. _pyannote: https://github.com/pyannote/
.. _pyannote.audio: https://github.com/pyannote/pyannote-audio/
.. _torch: http://torch.ch/
.. _conda: https://docs.conda.io/en/latest/
.. _CallHomeSRE: https://catalog.ldc.upenn.edu/LDC2001S97
.. _CallHome: https://ca.talkbank.org/browser/index.php?url=CallHome/eng/
.. _Librispeech: http://www.openslr.org/12
.. _AMI: http://groups.inf.ed.ac.uk/ami/download/
.. _REPERE: https://www.aclweb.org/anthology/L12-1410/
.. _ESTER: http://catalogue.elra.info/en-us/repository/browse/ELRA-S0241/
.. _ESTER2: http://catalog.elra.info/en-us/repository/browse/ELRA-S0338/
.. _DIHARD: https://coml.lscp.ens.fr/dihard/index.html
Pavel KORSHUNOV's avatar
Pavel KORSHUNOV committed
281
.. _ODESSA: https://www.idiap.ch/dataset/odessa/