Skip to content

Latest commit

 

History

History
66 lines (54 loc) · 5.78 KB

README.md

File metadata and controls

66 lines (54 loc) · 5.78 KB

BigGAN Audio Visualizer

biggan-visualizer-example

Description

This visualizer explores BigGAN (Brock et al., 2018) latent space by using pitch/tempo of an audio file to generate and interpolate between noise/class vector inputs to the model. Classes are chosen manually or optionally using semantic similarity on BERT encodings of a lyrics corpus.

Usage:

usage: visualizer.py [-h] -s SONG [-r {128,256,512}] [-d DURATION]
                     [-ps [200-295]] [-ts [0.05-0.8]]
                     [-c CLASSES [CLASSES ...]] [-n NUM_CLASSES] [-j [0-1]]
                     [-fl i*2^6] [-t [0.1-1]] [-sf [10-30]] [-bs BATCH_SIZE]
                     [-o OUTPUT_FILE] [--use_last_vectors]
                     [--use_last_classes] [--sort_pitch] [-l LYRICS]
                     [-e {sbert,doc2vec}] [-es {best,random,ransac}]
  • In order to speed up runtime, code can be run on Google Colab GPUs (or other cloud notebook providers) using biggan_music_visualizer.ipynb (hosted here).
  • The [-n NUM_CLASSES] parameter selects the number of classes to interpolate between.
  • Default behavior is to select [-n NUM_CLASSES] random classes. The [-c CLASSES [CLASSES ...]] parameter can be used to select specific ImageNet classes. A full list can be found here, and a list categorized by coarse descriptors here. Be sure to use the int ids and not the string labels, and set [-n NUM_CLASSES] to the number of chosen classes.
  • Use the [--sort_by_power] flag to map classes to the [-n NUM_CLASSES] highest power pitches. By default, classes are mapped to a chromatic scale.
  • The [-d DURATION] parameter can be useful to generate short videos while tweaking other parameters. Once the desired parameters are set, use the [--use_last_vector] flag and remove the [-d DURATION] parameter to generate the same video at full length.
  • Reducing the output resolution with [-r {128,256,512}] and/or increasing the frame length with [-fl i*2^6] can help reduce the runtime.
  • To compute classes through semantic similarity to a lyrics file, use the [-l LYRICS] parameter. The embedding technique and strategy for choosing classes can be set with [-e {sbert,doc2vec}] and [-es {best,random,ransac}] respectively.
  • Pitch and tempo sensitivity can be set with [-ps [200-295]] and [-ts [0.05-0.8]] respectively. Jitter, truncation and smooth factor can be set with [-j [0-1]], [-t [0.1-1]] and [-sf [10-30]] respectively.
  • See the help column of the arguments section for details on all parameters.

Arguments

short long default range help
-h --help show this help message and exit
-s --song path to input audio file [REQUIRED]
-r --resolution 512 {128,256,512} output video resolution
-d --duration None int output video duration
-ps --pitch_sensitivity 220 [200-295] controls the sensitivity of the class vector to changes in pitch
-ts --tempo_sensitivity 0.25 [0.05-0.8] controls the sensitivity of the noise vector to changes in volume and tempo
-c --classes None manually specify [--num_classes] ImageNet classes
-n --num_classes 12 [1-12] number of unique classes to use
-j --jitter 0.5 [0-1] controls jitter of the noise vector to reduce repitition
-fl --frame_length 512 i*2^6 number of audio frames to video frames in the output
-t --truncation 1 [0.1-1] BigGAN truncation parameter controls complexity of structure within frames
-sf --smooth_factor 20 [10-30] controls interpolation between class vectors to smooth rapid flucations
-bs --batch_size 20 int BigGAN batch_size
-o --output_file name of output file stored in output/, defaults to [--song] path base_name
--use_last_vectors False bool set flag to use previous saved class/noise vectors
--use_last_classes False bool set flag to use previous classes
--sort_pitches False bool set flag to sort pitches by the ordering of classes
-l --lyrics None path to lyrics file; setting [--lyrics LYRICS] computes classes by semantic similarity under BERT encodings
-e --encoding sbert {sbert,doc2vec} controls choice of sentence embeddings technique
-es --encoding_strategy None {random,best,ransac} controls strategy for choosing classes: [-e sbert] can use best or random while [-e doc2vec] can use ransac

Acknowledgments

Thanks to Matt Siegelman for providing the inspiration as well as a boilerplate for the project.

References