diff --git a/docs/source/faqs.rst b/docs/source/faqs.rst
index 8d1eeca8..bdf47b36 100644
--- a/docs/source/faqs.rst
+++ b/docs/source/faqs.rst
@@ -63,6 +63,8 @@ FAQs
     Note that you can also run the ``ffmpeg`` command directly from the command line.
 
 
+.. _faq_oom:
+
 .. dropdown:: What if I encounter a CUDA out of memory error?
 
     Model training can be GPU-memory-intensive, particularly when using unsupervised losses, the
@@ -82,6 +84,7 @@ FAQs
 
     See :ref:`The configuration file <config_file>` section for more information about the above parameters.
 
+
 .. dropdown:: Why does the network produce high confidence values for keypoints even when they are occluded?
 
     Generally, when a keypoint is briefly occluded and its location can be resolved by the network,
diff --git a/docs/source/user_guide/config_file.rst b/docs/source/user_guide/config_file.rst
index 33f6b9bc..c0e2df17 100644
--- a/docs/source/user_guide/config_file.rst
+++ b/docs/source/user_guide/config_file.rst
@@ -27,44 +27,84 @@ Data parameters
 All of these parameters except ``downsample_factor`` are dataset-specific and will need to be
 provided.
 
-* ``data.image_resize_dims.height/width`` (int): images (and videos) will be resized to the specified
-  height and width before being processed by the network.
+* ``data.image_resize_dims.height/width`` (*int*): images (and videos) will be resized to the
+  specified height and width before being processed by the network.
   Supported values are {64, 128, 256, 384, 512}.
   The height and width need not be identical.
   Some points to keep in mind when selecting these values:
   if the resized images are too small, you will lose resolution/details;
   if they are too large, the model takes longer to train and might not train as well.
 
-* ``data.data_dir/video_dir`` (str): update these to reflect your (absolute) local paths
+* ``data.data_dir/video_dir`` (*str*): update these to reflect your (absolute) local paths
 
-* ``data.csv_file`` (str): location of labels csv file; this should be relative to ``data_dir``
+* ``data.csv_file`` (*str*): location of labels csv file; this should be relative to
+  ``data.data_dir``
 
-* ``data.downsample_factor`` (int, default: 2): factor by which to downsample the heatmaps relative to ``image_resize_dims``
+* ``data.downsample_factor`` (*int, default: 2*): factor by which to downsample the heatmaps
+  relative to ``data.image_resize_dims``
 
-* ``data.num_keypoints`` (int): the number of body parts.
+* ``data.num_keypoints`` (*int*): the number of body parts.
   If using a mirrored setup, this should be the number of body parts summed across all views.
   If using a multiview setup, this number should indicate the number of keyponts per view
   (must be the same across all views).
 
-* ``data.keypoint_names`` (list): keypoint names should reflect the actual names/order in the csv file.
+* ``data.keypoint_names`` (*list*): keypoint names should reflect the actual names/order in the
+  csv file.
   This field is necessary if, for example, you are running inference on a machine that does not
   have the training data saved on it.
 
-* ``data.mirrored_column_matches`` (list): see the :ref:`Multiview PCA documentation <unsup_loss_pcamv>`
+* ``data.mirrored_column_matches`` (*list*): see the
+  :ref:`Multiview PCA documentation <unsup_loss_pcamv>`
 
-* ``data.columns_for_singleview_pca`` (list): see the :ref:`Pose PCA documentation <unsup_loss_pcasv>`
+* ``data.columns_for_singleview_pca`` (*list*): see the
+  :ref:`Pose PCA documentation <unsup_loss_pcasv>`
 
 
 Training parameters
 ===================
 
 The following parameters relate to model training.
-Reasonable defaults are provided, though parameters like the batch sizes may need modification
-depending on the size of the data and the available compute resources.
+Reasonable defaults are provided, though parameters like the batch sizes
+(``train_batch_size``, ``val_batch_size``, ``test_batch_size``)
+may need modification depending on the size of the data and the available compute resources.
+See the :ref:`FAQs <faq_oom>` for more information on memory management.
 
-* ``training.train_batch_size``: batch size for labeled data
+* ``training.imgaug`` (*str, default: dlc*): select from one of several predefined image/video
+  augmentation pipelines:
 
-* ``training.min_epochs`` / ``training.max_epochs``: length of training.
+  * default: resizing only
+  * dlc: imgaug pipeline implmented in DLC 2.0 package
+  * dlc-top-down: dlc augmentations plus additional vertical and horizontal flips
+
+* ``training.train_batch_size`` (*int, default: 16*): batch size for labeled data during training
+
+* ``training.val_batch_size`` (*int, default: 32*): batch size for labeled data during validation
+
+* ``training.test_batch_size`` (*int, default: 32*): batch size for labeled data during test
+
+* ``training.train_prob`` (*float, default: 0.95*): fraction of labeled data used for training
+
+* ``training.val_prob`` (*float, default: 0.05*): fraction of labeled data used for validation;
+  any remaining frames not assigned to train or validation sets are assigned to the test set
+
+* ``training.train_frames`` (*float or int, default: 1*): this parameter determines how many of the
+  frames assigned to to training data (using ``train_prob``) are actually used for training.
+  This option is generally more useful for testing new algorithms rather than training production
+  models.
+  If the value is a float between 0 and 1 then it is interpreted as the fraction of total train frames.
+  If the value is an integer greater than 1 then it is interpreted as the number of total train frames.
+
+.. _config_num_gpus:
+* ``training.num_gpus`` (*int, default: 1*): the number of GPUs for
+  :ref:`multi-GPU training <multi_gpu_training>`
+
+* ``training.num_workers`` (*int, default: 4*): number of cpu workers for data loaders
+
+* ``training.unfreezing_epoch`` (*int, default: 20*): epoch at which backbone network weights begin
+  updating. A value >0 allows the smaller number of parameters in the heatmap head to adjust to
+  the backbone outputs first.
+
+* ``training.min_epochs`` / ``training.max_epochs`` (*int, default: 300*): length of training.
   An epoch is one full pass through the dataset.
   As an example, if you have 400 labeled frames, and ``training.train_batch_size=10``, then your
   dataset is divided into 400/10 = 40 batches.
@@ -72,23 +112,55 @@ depending on the size of the data and the available compute resources.
   Therefore, 300 epochs, at 40 batches per epoch, is equal to 300*40=12k total batches
   (or iterations).
 
-.. _config_num_gpus:
-* ``training.num_gpus``: the number of GPUs for :ref:``multi-GPU training <multi_gpu_training>``.
+* ``training.log_every_n_steps`` (*int, default: 10*): frequency to log training metrics for
+  tensorboard (one step is one batch)
 
-* ``training.accumulate_grad_batches``: (experimental) number of batches to accumulate gradients
-  for before updating weights. Simulates larger batch sizes with memory-constrained GPUs. This
-  parameter is not included in the config by default and should be added manually to the
+* ``training.check_val_every_n_epochs`` (*int, default: 5*): frequency to log validation metrics
+  for tensorboard
+
+* ``training.ckpt_every_n_epochs`` (*int or null, default: null*): save model weights every n
+  epochs; must be divisible by ``training.check_val_every_n_epochs`` above.
+  If null, only the best weights will be saved after training, where "best" is defined as the
+  weights from the epoch with the lowest validation loss.
+
+* ``training.early_stopping`` (*bool, default: false*): if false, the default is to train for the
+  max number of epochs and save out the best model according to the validation loss; if true, early
+  stopping will exit training if the validation loss continues to increase for a given number of
+  validation checks (see ``training.early_stop_patience`` below).
+
+* ``training.early_stop_patience`` (*int, default: 3*): number of validation checks over which to
+  assess validation metrics for early stopping; this number, multiplied by
+  ``training.ckpt_every_n_epochs``, gives the number of epochs over which the validation loss must
+  increase before exiting.
+
+* ``training.rng_seed_data_pt`` (*int, default: 0*): rng seed for splitting labeled data into
+  train/val/test
+
+* ``training.rng_seed_model_pt`` (*int, default: 0*): rng seed for weight initialization of the head
+
+* ``training.lr_scheduler`` (*str, default: multisteplr*): reduce the learning rate by a certain
+  factor after a given number of epochs (see ``training.lr_scheduler_params.multisteplr`` below)
+
+* ``training.lr_scheduler_params.multistep_lr``: milestones: epochs at which to reduce learning
+  rate; gamma: factor by which to multiply learning rate at each milestone
+
+* ``training.uniform_heatmaps_for_nan_keypoints`` (*bool, default: true*): how to treat missing
+  hand labels; false to drop, true to force uniform heatmaps. True will lead to better confidence
+  values, while false allows for incompletely labeled data.
+
+* ``training.accumulate_grad_batches`` (*int, default: 1*): (experimental) number of batches to
+  accumulate gradients for before updating weights. Simulates larger batch sizes with
+  memory-constrained GPUs.
+  This parameter is not included in the config by default and should be added manually to the
   ``training`` section.
 
-* ``model.model_type``:
+Model parameters
+================
+
+The following parameters relate to model architecture and unsupervised losses.
 
-    * regression: model directly outputs an (x, y) prediction for each keypoint; not recommended
-    * heatmap: model outputs a 2D heatmap for each keypoint
-    * heatmap_mhcrnn: the "multi-head convolutional RNN", this model takes a temporal window of
-      frames as input, and outputs two heatmaps: one "context-aware" and one "static".
-      The prediction with the highest confidence is automatically chosen.
 
-* ``model.losses_to_use``: defines the unsupervised losses.
+* ``model.losses_to_use`` (*list, default: []*): defines the unsupervised losses.
   An empty list indicates a fully supervised model.
   Each element of the list corresponds to an unsupervised loss.
   For example, ``model.losses_to_use=[pca_multiview,temporal]`` will fit both a pca_multiview loss
@@ -98,12 +170,14 @@ depending on the size of the data and the available compute resources.
     * pca_singleview: penalize implausible body configurations
     * temporal: penalize large temporal jumps
 
-* ``model.checkpoint``: to initialize weights from an existing checkpoint, update this parameter
-  to the absolute path of a pytorch .ckpt file
+  See the :ref:`unsupervised losses<unsupervised_losses>` page for more details on the various
+  losses and their associated hyperparameters.
 
-* ``model.backbone``: a variety of pretrained backbones are available:
 
-    * resnet50_animal_ap10k (recommended): ResNet-50 pretrained on the AP-10k dataset (Yu et al 2021, AP-10k: A Benchmark for Animal Pose Estimation in the Wild)
+* ``model.backbone`` (*str, default: resnet50_animal_ap10k*): a variety of pretrained backbones are
+  available:
+
+    * resnet50_animal_ap10k: ResNet-50 pretrained on the AP-10k dataset (Yu et al 2021, AP-10k: A Benchmark for Animal Pose Estimation in the Wild)
     * resnet18: ResNet-18 pretrained on ImageNet
     * resnet34: ResNet-34 pretrained on ImageNet
     * resnet50: ResNet-50 pretrained on ImageNet
@@ -120,27 +194,72 @@ depending on the size of the data and the available compute resources.
     * efficientnet_b2: EfficientNet-B2 pretrained on ImageNet
     * vit_b_sam: Segment Anything Model (Vision Transformer Base)
 
-See the :ref:`Unsupervised losses <unsupervised_losses>` section for more details on the various
-losses and their associated hyperparameters.
+  Note: the file size for a single ResNet-50 network is approximately 275 MB.
+
+
+* ``model.model_type`` (*str, default: heatmap*):
+
+    * regression: model directly outputs an (x, y) prediction for each keypoint; not recommended
+    * heatmap: model outputs a 2D heatmap for each keypoint
+    * heatmap_mhcrnn: the "multi-head convolutional RNN", this model takes a temporal window of
+      frames as input, and outputs two heatmaps: one "context-aware" and one "static".
+      The prediction with the highest confidence is automatically chosen.
+      See the :ref:`Temporal Context Network<mhcrnn>` page for more information.
 
-A note on model checkpointing: by default the "best" model will be saved out according to the
-validation loss.
-If you would like to additionally save out checkpoints after a specified number of epochs, set the
-field ``training.ckpt_every_n_epochs``.
-The file size for a single ResNet-50 network is approximately 275 MB.
+* ``model.heatmap_loss_type`` (*str, default: mse*): (experimental) loss to compute difference
+  between ground truth and predicted heatmaps
 
-You may also utilize early stopping, in which model training exits early if the validation loss
-does not improve after a certain number of epochs, by setting ``training.early_stopping`` to true.
-Model checkpointing is still handled as described above.
+* ``model.model_name`` (*str, default: test*): directory name for model saving
+
+* ``model.checkpoint`` (*str or null, default: null*): to initialize weights from an existing
+  checkpoint, update this parameter to the absolute path of a pytorch .ckpt file
 
 
 Video loading parameters
 ========================
 
-Some arguments relate to video loading, both for semi-supervised models and when predicting new
-videos with any of the models:
+Some parameters relate to video loading, both for semi-supervised models and when predicting new
+videos with any of the models.
+The parameters may need modification depending on the size of the data and the available compute
+resources.
+See the :ref:`FAQs <faq_oom>` for more information on memory management.
+
+* ``dali.base.train.sequence_length`` (*int, default: 32*): number of unlabeled frames per batch in
+  "regression" and "heatmap" models (i.e. "base" models that do not use temporal context frames)
+* ``dali.base.predict.sequence_length`` (*int, default: 96*): batch size when predicting on a new
+  video with a base model
+* ``dali.context.train.batch_size`` (*int, default: 16*): number of unlabeled frames per batch in
+  heatmap_mhcrnn model (i.e. "context" models that utilize temporal context frames)
+* ``dali.context.predict.sequence_length`` (*int, default: 96*): batch size when predicting on a
+  new video with a "context" model
+
+Evaluation
+==========
+
+The following parameters are used for general evaluation.
+
+* ``eval.predict_vids_after_training`` (*bool, default: true*): if true, after training (when using
+  scripts/train_hydra.py) run inference with the best model on all videos located in
+  ``eval.test_videos_directory`` (see below)
+
+* ``eval.test_videos_directory`` (*str, default: null*): absolute path to a video directory
+  containing videos for prediction; used in scripts/train_hydra.py and scripts/predict_new_vids.py
+
+* ``eval.save_vids_after_training`` (*bool, default: false*): save out an mp4 file with predictions
+  overlaid after running inference; used in scripts/train_hydra.py and scripts/predict_new_vids.py
+
+* ``eval.colormap`` (*str, default: cool*): colormap options for labeled videos; options include
+  sequential colormaps (viridis, plasma, magma, inferno, cool, etc) and diverging colormaps (RdBu,
+  coolwarm, Spectral, etc)
+
+* ``eval.confidence_thresh_for_vid`` (*float, default: 0.9*): predictions with confidence below this
+  value will not be plotted in the labeled videos
+
+* ``eval.hydra_paths`` (*list, default: []*): absolute paths to hydra output folders for use with
+  scripts/predict_new_vids.py (see :ref:`inference <inference>` docs) and
+  scripts/create_fiftyone_dataset.py (see :ref:`FiftyOne <fiftyone>` docs)
+
+* ``eval.fiftyone.dataset_name`` (*str, default: test*): name of the FiftyOne dataset
 
-* ``dali.base.train.sequence_length`` - number of unlabeled frames per batch in ``regression`` and ``heatmap`` models (i.e. "base" models that do not use temporal context frames)
-* ``dali.base.predict.sequence_length`` - batch size when predicting on a new video with a "base" model
-* ``dali.context.train.batch_size`` - number of unlabeled frames per batch in ``heatmap_mhcrnn`` model (i.e. "context" models that utilize temporal context frames); each frame in this batch will be accompanied by context frames, so the true batch size will actually be larger than this number
-* ``dali.context.predict.sequence_length`` - batch size when predicting on a new video with a "context" model
+* ``eval.fiftyone.model_display_names`` (*list, default: [test_model]*): shorthand name for each of
+  the models specified in ``hydra_paths``
diff --git a/docs/source/user_guide_advanced/context_frames.rst b/docs/source/user_guide_advanced/context_frames.rst
index 04b0d805..c59db2f7 100644
--- a/docs/source/user_guide_advanced/context_frames.rst
+++ b/docs/source/user_guide_advanced/context_frames.rst
@@ -1,3 +1,5 @@
+.. _mhcrnn:
+
 ########################
 Temporal Context Network
 ########################
diff --git a/lightning_pose/utils/scripts.py b/lightning_pose/utils/scripts.py
index 10fe13d3..c2e64909 100644
--- a/lightning_pose/utils/scripts.py
+++ b/lightning_pose/utils/scripts.py
@@ -756,5 +756,5 @@ def export_predictions_and_labeled_video(
             ys_arr=ys_arr,
             mask_array=mask_array,
             filename=labeled_mp4_file,
-            colormap=colormap=cfg.eval.get("colormap", "cool")
+            colormap=cfg.eval.get("colormap", "cool")
         )
diff --git a/scripts/configs/config_default.yaml b/scripts/configs/config_default.yaml
index 67c67f9c..b263dc31 100644
--- a/scripts/configs/config_default.yaml
+++ b/scripts/configs/config_default.yaml
@@ -27,9 +27,9 @@ data:
 
 training:
   # select from one of several predefined image/video augmentation pipelines
-  # default- resizing only
-  # dlc- imgaug pipeline implemented in DLC 2.0 package
-  # dlc-top-down- dlc augmentations plus vertical and horizontal flips
+  # default: resizing only
+  # dlc: imgaug pipeline implemented in DLC 2.0 package
+  # dlc-top-down: dlc augmentations plus vertical and horizontal flips
   imgaug: dlc
   # batch size of labeled data during training
   train_batch_size: 16
@@ -148,36 +148,36 @@ losses:
     prob_threshold: 0.05
 
 eval:
-  # paths to the hydra config files in the output folder, OR absolute paths to such folders.
-  # used in scripts/predict_new_vids.py and scripts/create_fiftyone_dataset.py
-  hydra_paths: [" "]
   # predict? used in scripts/train_hydra.py
   predict_vids_after_training: true
+  # str with an absolute path to a directory containing videos for prediction.
+  # set to null to skip automatic video prediction from train_hydra.py script
+  # used in scripts/train_hydra.py and scripts/predict_new_vids.py
+  test_videos_directory: null
   # save labeled .mp4? used in scripts/train_hydra.py and scripts/predict_new_vids.py
   save_vids_after_training: false
+  # matplotlib sequential or diverging colormap name for prediction visualization
+  # sequential options: viridis, plasma, magma, inferno, cool, etc.
+  # diverging options: RdBu, coolwarm, Spectral, etc.
+  colormap: "cool"
+  # confidence threshold for plotting a vid
+  confidence_thresh_for_vid: 0.90
+
+  # paths to the hydra config files in the output folder, OR absolute paths to such folders.
+  # used in scripts/predict_new_vids.py and scripts/create_fiftyone_dataset.py
+  hydra_paths: [" "]
+
   fiftyone:
-    # will be the name of the dataset (Mongo DB) created by FiftyOne. for video dataset, we will append dataset_name + "_video"
+    # will be the name of the dataset (Mongo DB) created by FiftyOne
     dataset_name: test
     # if you want to manually provide a different model name to be displayed in FiftyOne
     model_display_names: ["test_model"]
     # whether to launch the app from the script (True), or from ipython (and have finer control over the outputs)
     launch_app_from_script: false
-
     remote: true # for LAI, must be False
     address: 127.0.0.1 # ip to launch the app on.
     port: 5151 # port to launch the app on.
 
-  # str with an absolute path to a directory containing videos for prediction.
-  # set to null to skip automatic video prediction from train_hydra.py script
-  # used in scripts/train_hydra.py and scripts/predict_new_vids.py
-  test_videos_directory: null
-  # matplotlib sequential or diverging colormap name for prediction visualization
-  # sequential options: viridis, plasma, magma, inferno, cool, etc.
-  # diverging options: RdBu, coolwarm, Spectral, etc.
-  colormap: "cool"
-  # confidence threshold for plotting a vid
-  confidence_thresh_for_vid: 0.90
-
 callbacks:
   anneal_weight:
     attr_name: total_unsupervised_importance