yardstick has reached 1.0.0 status and is stable.
For the majority of users, there are 3 types of metrics, each of which has an internal class that is defined through new_metric()
:
-
Numeric metrics
-
Class metrics
-
Class probability metrics
yardstick is a bit unique in that the actual functions it exports, like accuracy()
, have extra classes and attributes attached to them.
This allows them to be used in metric_set()
, which has to decide whether or not two metric functions are allowed to be combined in the same metric set or not.
For example, two numeric metric functions can be combined, but you can't combine a numeric metric with a class metric.
The only exception here is that you can combine a class metric with a class probability metric - in the resulting function you get back from metric_set()
, the class metric will use the estimate
interface and the class probability metric will use the …
one.
For the current public facing API, I don't see any major changes needing to be made. I'm fairly happy with how the 3 core metric classes work. I think most of the work for yardstick could be done on improving the internal helpers (see below in Known issues), or with extending yardstick with new metric class types (see Future directions). It is likely that the internal helpers will have to be improved first before you can add new metric class types, because they are quite complex as it is, making extending yardstick fairly difficult.
There is a very similar problem with metric_vec_template()
.
It currently tries to handle validation and function calling for all of the different types of metrics.
This makes it extremely complex, hard to extend, and probably a bit brittle.
In particular validate_truth_estimate_checks()
does some fairly complex S3 dispatch to perform its validation (kind of a home grown double dispatch on truth
and estimate
) which might be able to be rewritten in a cleaner way if we had separate metric_vec_template()
functions for the different kinds of metric types.
There are a few issues where this high cognitive overhead comes into play, making it hard to add these features:
The complexity of validate_truth_estimate_checks()
could be reduced by instead creating a few check_*()
helpers that we force the metric writers to call themselves.
If we provide useful ones, then they would just call them in their metric_vec()
function themselves and we'd avoid the double dispatch altogether because they'd be in charge of calling the correct check_*()
function based on the type of truth
and estimate
that their metric works with.
Something like check_factor_truth_factor_estimate(truth, estimate)
.
That would probably help with #305.
-
We'd like to officially hard deprecate support for
yardstick.event_first
, a global option that has been soft deprecated for awhile in favor of an explicitevent_level
argument. You should definitely run revdeps after removing it, but I would be fairly aggressive about getting rid of it at this point. You can probably remove it in the next minor release. #173 -
A number of people seem interested in calibration metrics and calibration curves. This might be combined with the probably package in some way #150.
-
Fairness metrics seem fairly popular and might fit in yardstick, but we don't have a clear sense of how they'd be implemented. #176
-
We know that we want to add survival metrics to yardstick at some point. There is a POC pull request that adds some of the basic infrastructure, but it isn't fully fleshed out #222. Adding this to yardstick is a bit challenging for a few reasons:
-
What metrics do we actually care about adding? It isn't entirely clear if we have a list of metrics we really care about. I think it will be critical to scope our survival support narrowly, for example, the PR only supports right censoring right now. Trying to add everything at once will be hard.
-
What does the API for this look like? Right now, there is a well-defined API for how yardstick functions work. Most of them take
truth
andestimate
vectors, which are columns in a data frame. The probability metrics take…
instead of a singleestimate
column when they need to supply multiple estimate columns. For survival metrics, things are a bit more complex because we have to consider the censorship aspect. The PR linked above requires thattruth
be aSurv
object, which has the censorship baked into it. For theestimate
column, the PR requires that it is a list-of tibbles (the same length astruth
), i.e. the output of thepredict()
function in censored. This makes the API rather complex for normal usage, but I think the goal was to align nicely for usage within tidymodels. -
The metric in the PR is a dummy/naive roc auc curve for survival analysis. I imagine that other "real" survival metrics might need to take some kind of
time
argument, which would probably be a required argument, which might make the API a little tricky for use with tune and inmetric_set()
s. -
The metric in the PR also adds a new
.time
column to the output of the metric. No other yardstick functions have this. -
Because of the special nature of the function signature for survival metrics, and the extra
.time
column in the output, they probably can't be combined with non-survival metrics inmetric_set()
. For this reason, survival metric functions should get their own class, i.e. fromnew_metric()
, andmetric_set()
should use only allow metrics with that class to be combined together. -
I think the
_vec()
variant in the PR is a little strange. All other_vec()
variants return a single numeric value, but this one would return a data frame. It might just be how this function would have to work, but we might also consider not exporting a_vec()
variant for survival metrics. -
The other thing to remember about adding new
truth
types likeSurv
is that the validation and estimator helpers will have to gain new S3 methods to handle it. For examplefinalize_estimator_default()
andvalidate_truth_estimate_types()
are internal generics that needed new S3 methods to handleSurv
(they are in the PR).
-