LucasAlegre · ffelten · Oct 31, 2024 · Oct 22, 2024
diff --git a/docs/algos/performances.md b/docs/algos/performances.md
@@ -12,7 +12,7 @@ For single-policy algorithms, the metric used will be the scalarized return of t
 
 ### Multi-policy algorithms
 For multi-policy algorithms, we propose to rely on various metrics to assess the quality of the **discounted** Pareto Fronts (PF) or Convex Coverage Set (CCS). In general, we want to have a metric that is able to assess the convergence of the PF, a metric that is able to assess the diversity of the PF, and a hybrid metric assessing both. The metrics are implemented in `common/performance_indicators`. We propose to use the following metrics:
-* (Diversity) Sparsity: average distance between each consecutive point in the PF. From the PGMORL paper [1]. Keyword: `eval/sparsity`.
+* **[Do not use]** (Diversity) Sparsity: average distance between each consecutive point in the PF. From the PGMORL paper [1]. Keyword: `eval/sparsity`.
 * (Diversity) Cardinality: number of points in the PF. Keyword: `eval/cardinality`.
 * (Convergence) IGD: a SOTA metric from Multi-Objective Optimization (MOO) literature. It requires a reference PF that we can compute a posteriori. That is, we do a merge of all the PFs found by the method and compute the IGD with respect to this reference PF. Keyword: `eval/igd`.
 * (Hybrid) Hypervolume: a SOTA metric from MOO and MORL literature. Keyword: `eval/hypervolume`.

diff --git a/morl_baselines/common/evaluation.py b/morl_baselines/common/evaluation.py
@@ -16,7 +16,6 @@
     hypervolume,
     igd,
     maximum_utility_loss,
-    sparsity,
 )
 from morl_baselines.common.weights import equally_spaced_weights
 
@@ -156,7 +155,6 @@ def log_all_multi_policy_metrics(
 
     Logged metrics:
     - hypervolume
-    - sparsity
     - expected utility metric (EUM)
     If a reference front is provided, also logs:
     - Inverted generational distance (IGD)
@@ -172,14 +170,12 @@ def log_all_multi_policy_metrics(
     """
     filtered_front = list(filter_pareto_dominated(current_front))
     hv = hypervolume(hv_ref_point, filtered_front)
-    sp = sparsity(filtered_front)
     eum = expected_utility(filtered_front, weights_set=equally_spaced_weights(reward_dim, n_sample_weights))
     card = cardinality(filtered_front)
 
     wandb.log(
         {
             "eval/hypervolume": hv,
-            "eval/sparsity": sp,
             "eval/eum": eum,
             "eval/cardinality": card,
             "global_step": global_step,

diff --git a/morl_baselines/common/performance_indicators.py b/morl_baselines/common/performance_indicators.py
@@ -42,6 +42,9 @@ def igd(known_front: List[np.ndarray], current_estimate: List[np.ndarray]) -> fl
 def sparsity(front: List[np.ndarray]) -> float:
     """Sparsity metric from PGMORL.
 
+    (!) This metric only considers the points from the PF identified by the algorithm, not the full objective space.
+    Therefore, it is misleading (e.g. learning only one point is considered good) and we recommend not using it when comparing algorithms.
+
     Basically, the sparsity is the average distance between each point in the front.
 
     Args: