Start by creating a new conda environment:

$ conda create -n pyannote python=3.6 anaconda
$ source activate pyannote

Then, install pyannote-video and its dependencies:

$ pip install pyannote-video

Finally, download sample video and dlib models:

$ git clone
$ git clone
$ bunzip2 dlib-models/dlib_face_recognition_resnet_model_v1.dat.bz2
$ bunzip2 dlib-models/shape_predictor_68_face_landmarks.dat.bz2

To execute this notebook locally:

$ git clone
$ jupyter notebook --notebook-dir="pyannote-video/doc"
In [4]:
%pylab inline
Populating the interactive namespace from numpy and matplotlib

Shot segmentation

In [5]:
! --help
Video structure

The standard pipeline for is the following:

    shot boundary detection ==> shot threading ==> segmentation into scenes

Usage: shot [options] <video> <output.json> thread [options] <video> <shot.json> <output.json> scene [options] <video> <thread.json> <output.json> (-h | --help) --version

  --height=<n_pixels>    Resize video frame to height <n_pixels> [default: 50].
  --window=<n_seconds>   Apply median filtering on <n_seconds> window [default: 2.0].
  --threshold=<value>    Set threshold to <value> [default: 1.0].
  --min-match=<n_match>  Set minimum number of matches to <n_match> [default: 20].
  --lookahead=<n_shots>  Look at up to <n_shots> following shots [default: 24].
  -h --help              Show this screen.
  --version              Show version.
  --verbose              Show progress.
In [7]:
! shot --verbose ../../pyannote-data/TheBigBangTheory.mkv \
752frames [00:32, 23.2frames/s]                                                 

Detected shot boundaries can be visualized using pyannote.core notebook support:

In [8]:
from pyannote.core.json import load_from
shots = load_from('../../pyannote-data/TheBigBangTheory.shots.json')

Face processing

In [9]:
! --help
Face detection and tracking

The standard pipeline is the following

      face tracking => feature extraction => face clustering

  pyannote-face track [options] <video> <shot.json> <tracking>
  pyannote-face extract [options] <video> <tracking> <landmark_model> <embedding_model> <landmarks> <embeddings>
  pyannote-face demo [options] <video> <tracking> <output>
  pyannote-face (-h | --help)
  pyannote-face --version

General options:

  -h --help                 Show this screen.
  --version                 Show version.
  --verbose                 Show processing progress.

Face tracking options (track):

  <video>                   Path to video file.
  <shot.json>               Path to shot segmentation result file.
  <tracking>                Path to tracking result file.

  --min-size=<ratio>        Approximate size (in video height ratio) of the
                            smallest face that should be detected. Default is
                            to try and detect any object [default: 0.0].
  --every=<seconds>         Only apply detection every <seconds> seconds.
                            Default is to process every frame [default: 0.0].
  --min-overlap=<ratio>     Associates face with tracker if overlap is greater
                            than <ratio> [default: 0.5].
  --min-confidence=<float>  Reset trackers with confidence lower than <float>
                            [default: 10.].
  --max-gap=<float>         Bridge gaps with duration shorter than <float>
                            [default: 1.].

Feature extraction options (features):

  <video>                   Path to video file.
  <tracking>                Path to tracking result file.
  <landmark_model>          Path to dlib facial landmark detection model.
  <embedding_model>         Path to dlib feature extraction model.
  <landmarks>               Path to facial landmarks detection result file.
  <embeddings>              Path to feature extraction result file.

Visualization options (demo):

  <video>                   Path to video file.
  <tracking>                Path to tracking result file.
  <output>                  Path to demo video file.

  --height=<pixels>         Height of demo video file [default: 400].
  --from=<sec>              Encode demo from <sec> seconds [default: 0].
  --until=<sec>             Encode demo until <sec> seconds.
  --shift=<sec>             Shift result files by <sec> seconds [default: 0].
  --landmark=<path>         Path to facial landmarks detection result file.
  --label=<path>            Path to track identification result file.

Face tracking

In [10]:
! track --verbose --every=0.5 ../../pyannote-data/TheBigBangTheory.mkv \
                                              ../../pyannote-data/TheBigBangTheory.shots.json \
752frames [00:23, 32.0frames/s]                                                 

Face tracks can be visualized using demo mode:

In [12]:
! demo ../../pyannote-data/TheBigBangTheory.mkv \
                       ../../pyannote-data/TheBigBangTheory.track.txt \
[MoviePy] >>>> Building video ../../pyannote-data/TheBigBangTheory.track.mp4
[MoviePy] Writing audio in TheBigBangTheory.trackTEMP_MPY_wvf_snd.mp3
100%|████████████████████████████████████████| 664/664 [00:01<00:00, 425.86it/s]
[MoviePy] Done.
[MoviePy] Writing video ../../pyannote-data/TheBigBangTheory.track.mp4
100%|████████████████████████████████████████▉| 752/753 [00:08<00:00, 87.38it/s]
[MoviePy] Done.
[MoviePy] >>>> Video ready: ../../pyannote-data/TheBigBangTheory.track.mp4 

In [14]:
import io
import base64
from IPython.display import HTML
video ='../../pyannote-data/TheBigBangTheory.track.mp4', 'r+b').read()
encoded = base64.b64encode(video)
HTML(data='''<video alt="test" controls><source src="data:video/mp4;base64,{0}" type="video/mp4" /></video>'''.format(encoded.decode('ascii')))