.. -*- coding: utf-8 -*- .. image:: https://img.shields.io/badge/docs-stable-yellow.svg :target: https://www.idiap.ch/software/bob/docs/bob/bob.pyannote/stable/index.html .. image:: https://img.shields.io/badge/docs-latest-orange.svg :target: https://www.idiap.ch/software/bob/docs/bob/bob.pyannote/master/index.html .. image:: https://gitlab.idiap.ch/bob/bob.pyannote/badges/master/build.svg :target: https://gitlab.idiap.ch/bob/bob.pyannote/commits/master .. image:: https://gitlab.idiap.ch/bob/bob.pyannote/badges/master/coverage.svg :target: https://gitlab.idiap.ch/bob/bob.pyannote/commits/master .. image:: https://img.shields.io/badge/gitlab-project-0000c0.svg :target: https://gitlab.idiap.ch/bob/bob.pyannote .. image:: https://img.shields.io/pypi/v/bob.pyannote.svg :target: https://pypi.python.org/pypi/bob.pyannote ====================================== Bob's wrapper for pyannote framework ====================================== This package is part of the signal-processing and machine learning toolbox Bob_. The package provides the scripts and the documentation for speech diarization pipeline that relies on pyannote_ framework. Essentially, this package is a set of helper scripts and a documentation on how to run and reproduce the diarization experiments using pyannote_ diarization framework. This documentation let you setup the pyannote_ environment to reproduce training and evaluating several paynnote_ based models for voice activity detection (VAD), speaker change detection (SPD), and embeddings (EMB). Also, we provide several scripts for computing various metrics for measuring the performance of the trained models. In short, this package can be considered as one-stop quick summary for the diarization experiments with pyannote_. Installation ------------ First install Conda_ and create a new environment:: $ conda create --name pyannote python=3.6 $ conda source activate pyannote Then, follow bob's `installation`_ instructions. Then, install this package:: $ buildout Depending on which speaker database is used in the experiments, the installation instruction may vary, but the necessary step is to install pyannote.audio_ package together with torch_ framework:: $ conda install pip $ conda install pytorch torchvision -c pytorch Then, install the pyannote.audio_ package:: $ git clone https://github.com/pyannote/pyannote-audio.git $ cd pyannote-audio $ git checkout develop $ pip install . Report of the experiments ========================= The PDF of the report describing the experiments in more details and discussion of the evaluation results is available in ``bob/pyannote/report`` folder. Running the diarization experiments =================================== We provide the configuration database file ``database.yml`` inside ``bob/pyannote/config`` folder that can be placed in the ``~/.pyannote/`` to point pyannote.audio_ package how to read and process the databases that we present in the next session. Databases --------- In our experiments, we have used the following databases: - CallHomeSRE_, subset of NIST SRE 2000 of different language speakers, which was used as a standalone database for training. - CallHome_ English subset from CABank corpora, which was used for training, validation (development subset), and the selection of hyper-parameters (development subset). - SmallMeta - a smaller collection of databases, including CallHomeSRE_ (full database), CallHome_ (train subset), LibriSpeech_ (other-train subset), and AMI_ (train subset). This collection was used for training different models of diarization pipeline. - LargeMeta_ - a larger collection of databases, including SmallMeta plus REPERE_ (train subsets of phase 1 and 2), ESTER (train subsets of ESTER1_ and ESTER2_), and LibriSpeech_ (clean-train subset). This collection was used for training different models of diarization pipeline. - AMI_ corpus, which was used for models validation (development subset) and the selection of hyper-parameters (development subset). - DIHARD_ database for the DIHARD2 challenge of 2019. This dataset has a several subsets of challenging speech recordings of different types, including restaurant conversations, discussions with children, and recordings of clinical trials. This is one of the more challenging datasets as of 2019. - ODESSA_ VoIP database, which is the database of multi-party speech data over Internet telephony. The whole database was used to test all of the above pre-trained models and hyper-parameters. Almost all databases, have paynnote_ database package available, so that it is easy to use these databases (once the data itself is downloaded, which should be done separately with each databases copyright holders). Here is the list of corresponding database packages that can be installed: - CallHomeSRE: `https://github.com/yinruiqing/pyannote-db-callhome `_ - CallHome: `https://github.com/hbredin/pyannote-db-callhome `_ - LibriSpeech: `https://github.com/ironiksk/pyannote-db-librispeech `_ - ESTER: `https://github.com/pyannote/pyannote-db-ester `_ - REPERE: `https://github.com/pyannote/pyannote-db-repere `_ - AMI: `https://github.com/pyannote/pyannote-db-odessa-ami `_ - ODESSA: `https://github.com/pyannote/pyannote-db-odessa-ip `_ The configuration database file ``database.yml`` inside ``bob/pyannote/config`` shows all databases we use and it can be placed in the ``~/.pyannote/`` for pyannote.audio_ package to know how to read and process these databases. Just make sure you indicate correct paths to the actual database speech files. Training and validating the models ---------------------------------- For training any models, it is practically necessary to have a GPU available, so we assume that the training commands are run in the environment with GPU available and the NVIDIA drivers installed. Any model training with pyannote_ requires a configuration file that describes the components and parameters of the training. The templates for voice activity detection (VAD), speaker change detection (SPD), and embeddings (EMB) training models are available inside ``bob/pyannote/config`` folder. For instance ``bob/pyannote/config/vad/config.yml`` contains configuration for VAD model training. VAD training ------------ Once we have the folder for VAD training and the confi.yml placed inside it, the command for training voice activity detection model is as follows (assuming here that we are training using the train set of the protocol ``X.SpeakerDiarization.MetaProtocolSAD`` which is described in the database config ``bob/pyannote/config/database.yml``):: $ pyannote-speech-detection train --gpu --to=1000 ./bob/pyannote/config/vad X.SpeakerDiarization.MetaProtocolSAD The training will be done for 1000 epochs and the models for all of the epochs will be saved inside the folder ``./bob/pyannote/config/vad/train/X.SpeakerDiarization.MetaProtocolSAD.train/weights`` To validate the trained model on the validation/development set of a database/protocol (or the same one) and to select the model/epoch that results in the best detection error rate metric on the validation set, run the following:: $ pyannote-speech-detection validate --gpu --from=0 --to=1000 --every=5 --chronological ./bob/pyannote/config/vad/train/X.SpeakerDiarization.MetaProtocolSAD.train AMI.SpeakerDiarization.MixHeadset The validation will be done for every 5th trained model and the best one will be written (along with other parameters) inside the file ``./bob/pyannote/config/vad/train/X.SpeakerDiarization.MetaProtocolSAD.train/validate/AMI.SpeakerDiarization.MixHeadset.development/params.yml`` SCD training ------------ Similarly to VAD model, the commands for training and validating speaker change detection are as follows:: $ pyannote-change-detection train --gpu --to=1000 ./bob/pyannote/config/scd X.SpeakerDiarization.MetaProtocolSAD $ pyannote-change-detection validate --gpu --from=0 --to=1000 --every=5 --chronological ./bob/pyannote/config/scd/train/X.SpeakerDiarization.MetaProtocolSAD.train AMI.SpeakerDiarization.MixHeadset Training and validation for SCD are the same as for VAD. Training embeddings ------------------- Training and validating embedding models are similar to VAD and SCD. All the differences are inside the ``config.yml`` files, and for embeddings the template is located in ``bob/pyannote/config/emb/config.yml``:: $ pyannote-speaker-embedding train --gpu --to=1000 ./bob/pyannote/config/scd X.SpeakerDiarization.MetaProtocolSAD $ pyannote-speaker-embedding validate --gpu --from=0 --to=1000 --every=5 --chronological ./bob/pyannote/config/emb/train/X.SpeakerDiarization.MetaProtocolSAD.train AMI.SpeakerDiarization.MixHeadset Applying the pre-trained models ------------------------------- The pre-trained and validated models can be used to compute either the speech changes (VAD task), speaker changes (SCD task), or clustering can be done for embeddings (EMB task) for the test set of the desired database. Assuming we did training and validation for VAD as in the section above, the command to apply the selected model to the, say, DIHARD evaluation data would be as following:: $ pyannote-speech-detection apply ./bob/pyannote/config/vad/train/X.SpeakerDiarization.MetaProtocolSAD.train/validate/AMI.SpeakerDiarization.MixHeadset.development DIHARD2.SpeakerDiarization.All The command will look inside ``./bob/pyannote/config/vad/train/X.SpeakerDiarization.MetaProtocolSAD.train/validate/AMI.SpeakerDiarization.MixHeadset.development`` for ``params.yml`` file, which specifies the selected model located inside ``./bob/pyannote/config/vad/train/X.SpeakerDiarization.MetaProtocolSAD.train/weights`` folder. The model will be used on all files in the test set of the DIHARD data and the results will be written inside ``./bob/pyannote/config/vad/train/X.SpeakerDiarization.MetaProtocolSAD.train/validate/AMI.SpeakerDiarization.MixHeadset.development/apply/model_number/`` You may note that the scores files with ``Detection Error Rate`` are available in ``DIHARD2.SpeakerDiarization.All.test.eval`` file inside the results folder. Also, a corresponding RTTM file is also available with the speech changes for all the files from test set of DIHARD dataset. For SCD and EMB, the procedure is similar. Please run ``pyannote-change-detection --help`` and ``pyannote-speaker-embedding --help`` for more details. Additional scripts ------------------ Folders ``./bob/pyannote/scripts`` and ``./bob/pyannote/bash_scripts`` contain additional miscellaneous scripts for batch computing scores and for batch running the above training and application commands for different evaluation scenarios described in the accompanied report in ``./bob/pyannote/report``. Converting NIST SRE Sphere files to WAV format ---------------------------------------------- First convert the Sphere files of the database to WAV files and then make sure they are in 16K format. * Convert Sphere files to WAV format. The best way to do it is by using the NIST/LDC own software sph2pipe. Other approaches (tried SPHFile, SoundFile, and audioop) are not always working, mostly because of the confusing headers that Sphere files can have in different LDC datasets. So, download and compiled sph2pipe_v2.5 version and then run the provided ``./bob/pyannote/bash_scripts/convert_sph2wav.sh`` script. .. code:: sh $ ./convert_sph2wav.sh $(find path/to/sphere/formatted/database/ -name '*.sph') * Use ffmpeg to upsample and convert the speech to a single channel using the script ``./bob/pyannote/bash_scripts/convert_audio.sh``. .. code:: sh $ ./convert_audio.sh $(find /path/to/converted/dabase/ -name '*.wav') Contact ------- For questions or reporting issues to this software package, contact our development `mailing list`_. .. Place your references here: .. _idiap: http://www.idiap.ch .. _bob: http://www.idiap.ch/software/bob .. _installation: https://www.idiap.ch/software/bob/install .. _mailing list: https://www.idiap.ch/software/bob/discuss .. _pyannote: https://github.com/pyannote/ .. _pyannote.audio: https://github.com/pyannote/pyannote-audio/ .. _torch: http://torch.ch/ .. _conda: https://docs.conda.io/en/latest/ .. _CallHomeSRE: https://catalog.ldc.upenn.edu/LDC2001S97 .. _CallHome: https://ca.talkbank.org/browser/index.php?url=CallHome/eng/ .. _Librispeech: http://www.openslr.org/12 .. _AMI: http://groups.inf.ed.ac.uk/ami/download/ .. _REPERE: https://www.aclweb.org/anthology/L12-1410/ .. _ESTER: http://catalogue.elra.info/en-us/repository/browse/ELRA-S0241/ .. _ESTER2: http://catalog.elra.info/en-us/repository/browse/ELRA-S0338/ .. _DIHARD: https://coml.lscp.ens.fr/dihard/index.html .. _ODESSA: https://www.idiap.ch/dataset/odessa/