README.rst 12.9 KB
Newer Older
1 2 3 4 5
.. -*- coding: utf-8 -*-

.. image:: https://img.shields.io/badge/docs-stable-yellow.svg
   :target: https://www.idiap.ch/software/bob/docs/bob/bob.pyannote/stable/index.html
.. image:: https://img.shields.io/badge/docs-latest-orange.svg
6 7 8 9 10
   :target: https://www.idiap.ch/software/bob/docs/bob/bob.pyannote/master/index.html
.. image:: https://gitlab.idiap.ch/bob/bob.pyannote/badges/master/build.svg
   :target: https://gitlab.idiap.ch/bob/bob.pyannote/commits/master
.. image:: https://gitlab.idiap.ch/bob/bob.pyannote/badges/master/coverage.svg
   :target: https://gitlab.idiap.ch/bob/bob.pyannote/commits/master
11
.. image:: https://img.shields.io/badge/gitlab-project-0000c0.svg
12
   :target: https://gitlab.idiap.ch/bob/bob.pyannote
13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102
.. image:: https://img.shields.io/pypi/v/bob.pyannote.svg
   :target: https://pypi.python.org/pypi/bob.pyannote


======================================
 Bob's wrapper for pyannote framework
======================================

This package is part of the signal-processing and machine learning toolbox Bob_.
The package provides the scripts and the documentation for speech
diarization pipeline that relies on pyannote_ framework.

Essentially, this package is a set of helper scripts and a documentation
on how to run and reproduce the diarization experiments using pyannote_
diarization framework. This documentation let you setup the pyannote_
environment to reproduce training and evaluating several paynnote_ based
models for voice activity detection (VAD), speaker change detection (SPD),
and embeddings (EMB). Also, we provide several scripts for computing various
metrics for measuring the performance of the trained models.

In short, this package can be considered as one-stop quick summary for
the diarization experiments with pyannote_.


Installation
------------

First install Conda_ and create a new environment::

  $ conda create --name pyannote python=3.6
  $ conda source activate pyannote

Then, follow bob's `installation`_ instructions. Then, install this
package::

  $ buildout

Depending on which speaker database is used in the experiments, the
installation instruction may vary, but the necessary step is to
install pyannote.audio_ package together with torch_ framework::

  $ conda install pip
  $ conda install pytorch torchvision -c pytorch

Then, install the pyannote.audio_ package::

  $ git clone https://github.com/pyannote/pyannote-audio.git
  $ cd pyannote-audio
  $ git checkout develop
  $ pip install .


Report of the experiments
=========================

The PDF of the report describing the experiments in more details and discussion of the evaluation results is available in ``bob/pyannote/report`` folder. 

Running the diarization experiments
===================================

We provide the configuration database file ``database.yml`` inside
``bob/pyannote/config`` folder that can be placed in the ``~/.pyannote/`` to point
pyannote.audio_ package how to read and process the databases that we present
in the next session.


Databases
---------

In our experiments, we have used the following databases:

- CallHomeSRE_, subset of NIST SRE 2000 of different language speakers, which
  was used as a standalone database for training.
- CallHome_ English subset from CABank corpora, which was used for training,
  validation (development subset), and the selection of hyper-parameters
  (development subset).
- SmallMeta - a smaller collection of databases, including CallHomeSRE_
  (full database), CallHome_ (train subset), LibriSpeech_ (other-train subset),
  and AMI_ (train subset). This collection was used for training different
  models of diarization pipeline.
- LargeMeta_ - a larger collection of databases, including SmallMeta
  plus REPERE_ (train subsets of phase 1 and 2), ESTER (train subsets of
  ESTER1_ and ESTER2_), and LibriSpeech_ (clean-train subset). This collection
  was used for training different models of diarization pipeline.
- AMI_ corpus, which was used for models validation (development subset) and
  the selection of hyper-parameters  (development subset).
- DIHARD_ database for the DIHARD2 challenge of 2019. This dataset has a
  several subsets of challenging speech recordings of different types, including
  restaurant conversations, discussions with children, and recordings of
  clinical trials. This is one of the more challenging datasets as of 2019.
103 104 105
- ODESSA_ VoIP database, which is the database of multi-party speech data over 
  Internet telephony. The whole database was used to test all of the above 
  pre-trained models and hyper-parameters.
106 107 108 109 110 111 112

Almost all databases, have paynnote_ database package available, so that it is
easy to use these databases (once the data itself is downloaded, which should
be done separately with each databases copyright holders). Here is the list
of corresponding database packages that can be installed:

- CallHomeSRE:
113
  `https://github.com/pkorshunov/pyannote-db-callhomesre <https://github.com/pkorshunov/pyannote-db-callhomesre>`_
114 115 116
- CallHome:
  `https://github.com/hbredin/pyannote-db-callhome <https://github.com/hbredin/pyannote-db-callhome>`_
- LibriSpeech:
117
  `https://github.com/pkorshunov/pyannote-db-librispeech <https://github.com/pkorshunov/pyannote-db-librispeech>`_
118
- ESTER:
119
  `https://github.com/pkorshunov/pyannote-db-ester <https://github.com/pkorshunov/pyannote-db-ester>`_
120 121 122 123
- REPERE:
  `https://github.com/pyannote/pyannote-db-repere <https://github.com/pyannote/pyannote-db-repere>`_
- AMI:
  `https://github.com/pyannote/pyannote-db-odessa-ami <https://github.com/pyannote/pyannote-db-odessa-ami>`_
124 125 126
- ODESSA:
  `https://github.com/pyannote/pyannote-db-odessa-ip <https://github.com/pyannote/pyannote-db-odessa-ip>`_
  
127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280

The configuration database file ``database.yml`` inside ``bob/pyannote/config``
shows all databases we use and it can be placed in the ``~/.pyannote/`` for
pyannote.audio_ package to know how to read and process these databases.
Just make sure you indicate correct paths to the actual database speech files.

Training and validating the models
----------------------------------

For training any models, it is practically necessary to have a GPU available,
so we assume that the training commands are run
in the environment with GPU available and the NVIDIA drivers installed.

Any model training with pyannote_ requires a configuration file that
describes the components and parameters
of the training. The templates for voice activity detection (VAD), speaker
change detection (SPD), and embeddings (EMB) training models
are available inside ``bob/pyannote/config`` folder. For instance
``bob/pyannote/config/vad/config.yml`` contains configuration for
VAD model training.

VAD training
------------

Once we have the folder for VAD training and the confi.yml placed inside it, the command for training voice activity detection
model is as follows (assuming here that we are training using the train set of the protocol ``X.SpeakerDiarization.MetaProtocolSAD``
which is described in the database config ``bob/pyannote/config/database.yml``)::

  $ pyannote-speech-detection train --gpu --to=1000 ./bob/pyannote/config/vad X.SpeakerDiarization.MetaProtocolSAD

The training will be done for 1000 epochs and the models for all of the epochs will be saved inside the folder
``./bob/pyannote/config/vad/train/X.SpeakerDiarization.MetaProtocolSAD.train/weights``

To validate the trained model on the validation/development set of a
database/protocol (or the same one) and to select the model/epoch that
results in the best detection error rate metric on the validation set, run the following::

  $ pyannote-speech-detection validate --gpu --from=0 --to=1000 --every=5 --chronological  ./bob/pyannote/config/vad/train/X.SpeakerDiarization.MetaProtocolSAD.train AMI.SpeakerDiarization.MixHeadset

The validation will be done for every 5th trained model and the best one
will be written (along with other parameters) inside the file
``./bob/pyannote/config/vad/train/X.SpeakerDiarization.MetaProtocolSAD.train/validate/AMI.SpeakerDiarization.MixHeadset.development/params.yml``


SCD training
------------

Similarly to VAD model, the commands for training and validating speaker change detection are as follows::

  $ pyannote-change-detection train --gpu --to=1000 ./bob/pyannote/config/scd X.SpeakerDiarization.MetaProtocolSAD
  $ pyannote-change-detection validate --gpu --from=0 --to=1000 --every=5 --chronological  ./bob/pyannote/config/scd/train/X.SpeakerDiarization.MetaProtocolSAD.train AMI.SpeakerDiarization.MixHeadset

Training and validation for SCD are the same as for VAD.


Training embeddings
-------------------

Training and validating embedding models are similar to VAD and SCD.
All the differences are inside the ``config.yml`` files, and for embeddings the template is located in
``bob/pyannote/config/emb/config.yml``::

  $ pyannote-speaker-embedding train --gpu --to=1000 ./bob/pyannote/config/scd X.SpeakerDiarization.MetaProtocolSAD
  $ pyannote-speaker-embedding validate --gpu --from=0 --to=1000 --every=5 --chronological  ./bob/pyannote/config/emb/train/X.SpeakerDiarization.MetaProtocolSAD.train AMI.SpeakerDiarization.MixHeadset


Applying the pre-trained models
-------------------------------

The pre-trained and validated models can be used to compute either the
speech changes (VAD task), speaker changes (SCD task),
or clustering can be done for embeddings (EMB task) for the test set of the desired database.

Assuming we did training and validation for VAD as in the section above,
the command to apply the selected model
to the, say, DIHARD evaluation data would be as following::

  $ pyannote-speech-detection apply ./bob/pyannote/config/vad/train/X.SpeakerDiarization.MetaProtocolSAD.train/validate/AMI.SpeakerDiarization.MixHeadset.development DIHARD2.SpeakerDiarization.All

The command will look inside
``./bob/pyannote/config/vad/train/X.SpeakerDiarization.MetaProtocolSAD.train/validate/AMI.SpeakerDiarization.MixHeadset.development``
for ``params.yml`` file,
which specifies the selected model located inside
``./bob/pyannote/config/vad/train/X.SpeakerDiarization.MetaProtocolSAD.train/weights`` folder.
The model will be used on all files in the test set of the DIHARD data and the
results will be written inside
``./bob/pyannote/config/vad/train/X.SpeakerDiarization.MetaProtocolSAD.train/validate/AMI.SpeakerDiarization.MixHeadset.development/apply/model_number/``

You may note that the scores files with ``Detection Error Rate`` are
available in ``DIHARD2.SpeakerDiarization.All.test.eval`` file inside the results folder. Also,
a corresponding RTTM file is also available with the speech changes for
all the files from test set of DIHARD dataset.

For SCD and EMB, the procedure is similar. Please run
``pyannote-change-detection --help`` and ``pyannote-speaker-embedding --help`` for more details.

Additional scripts
------------------

Folders ``./bob/pyannote/scripts`` and ``./bob/pyannote/bash_scripts`` contain
additional miscellaneous scripts for batch computing scores and for batch running
the above training and application commands for different evaluation
scenarios described in the accompanied report in ``./bob/pyannote/report``.

Converting NIST SRE Sphere files to WAV format
----------------------------------------------

First convert the Sphere files of the database to WAV files and then make sure
they are in 16K format.

* Convert Sphere files to WAV format. The best way to do it is
by using the NIST/LDC own software sph2pipe. Other approaches
(tried SPHFile, SoundFile, and audioop) are not always working,
mostly because of the confusing headers that Sphere files can have in
different LDC datasets.
So, download and compiled sph2pipe_v2.5 version and then run the
provided ``./bob/pyannote/bash_scripts/convert_sph2wav.sh`` script.

.. code:: sh

  $ ./convert_sph2wav.sh $(find path/to/sphere/formatted/database/ -name '*.sph')

* Use ffmpeg to upsample and convert the speech to a single channel
using the script ``./bob/pyannote/bash_scripts/convert_audio.sh``.

.. code:: sh

  $ ./convert_audio.sh $(find /path/to/converted/dabase/ -name '*.wav')


Contact
-------

For questions or reporting issues to this software package, contact our
development `mailing list`_.


.. Place your references here:
.. _idiap: http://www.idiap.ch
.. _bob: http://www.idiap.ch/software/bob
.. _installation: https://www.idiap.ch/software/bob/install
.. _mailing list: https://www.idiap.ch/software/bob/discuss
.. _pyannote: https://github.com/pyannote/
.. _pyannote.audio: https://github.com/pyannote/pyannote-audio/
.. _torch: http://torch.ch/
.. _conda: https://docs.conda.io/en/latest/
.. _CallHomeSRE: https://catalog.ldc.upenn.edu/LDC2001S97
.. _CallHome: https://ca.talkbank.org/browser/index.php?url=CallHome/eng/
.. _Librispeech: http://www.openslr.org/12
.. _AMI: http://groups.inf.ed.ac.uk/ami/download/
.. _REPERE: https://www.aclweb.org/anthology/L12-1410/
.. _ESTER: http://catalogue.elra.info/en-us/repository/browse/ELRA-S0241/
.. _ESTER2: http://catalog.elra.info/en-us/repository/browse/ELRA-S0338/
.. _DIHARD: https://coml.lscp.ens.fr/dihard/index.html
281
.. _ODESSA: https://www.idiap.ch/dataset/odessa/