Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

document definitions of column values (pvalue, zscore, etc.) #25

Open
7 tasks
bob-carpenter opened this issue Jul 12, 2019 · 11 comments
Open
7 tasks

document definitions of column values (pvalue, zscore, etc.) #25

bob-carpenter opened this issue Jul 12, 2019 · 11 comments

Comments

@bob-carpenter
Copy link

The example in the README.md shows output columns

  • term
  • estimate
  • sd
  • zscore
  • lower
  • upper
  • pvalue

I could not find doc for them other than pvalue, which has doc in R/utils.R that says, "@return A number indicating the p-value."

I hate to double-up issues like this, but I would also suggest changing some of these column names to be more informative at first glance.

  • lower and upper are presumably quantiles, but the column headers don't say which ones. In my experience, users want to be able to control the quantiles, for instance to access the median or follow Richard McElreath's quirky lead in reporting prime number intervals in Statistical Rethinking.

  • estimate doesn't say which estimate the package is using is it a posterior mean (minimize expected square error), posterior median (minimize expected absolute error), or something else? I'm guessing it would be posterior mean since sd is one of the other columns.

  • pvalue, zscore don't mean anything to me in this context, other than for use in posterior predictive checks. I would have guessed that the pvalue is just the standard normal cdf of the zscore, but that's not the case from looking at the values. This is the proximate cause for me opening the issue---to ask what these meant.

P.S. I actually came here because we're looking for general output analysis packages to recommend for use with (R)Stan.

@joethorley
Copy link
Member

@bob-carpenter Thanks for your interest in mcmcr.

I will add more documentation as to what the columns return.

FWIW The lower and upper are quantiles with the confidence level set by the user - by default they are the 95% quantiles.

coef(object, conf_level = 0.95,
  estimate = stats::median, ...)

The estimate is defined by the user and by default is the median although the user could of course define it to be the mean or mode.

The zscore is sd/mean.

As you surmised the pvalue is not the standard normal pdf for the zscore. In this context its the smallest credible interval that includes 0. I'm considering replacing it with the equivalent svalue.
I find it useful as a quick and dirty model selection tool for nuisance parameters.

I understand that information is lost through the use of lower and upper rather than lower95 and upper95 and estimate rather than median. However this is intentional. I want my code to tabularize or plot estimates to be independent of the method used to construct the credible intervals or estimate. Also I don't want my coefficient tables to have an overwhelming number of columns.

I'm a big fan of (R)Stan and the teams work including your own.

@bob-carpenter
Copy link
Author

bob-carpenter commented Jul 23, 2019 via email

@joethorley
Copy link
Member

We're struggling with mean vs. median in reporting.

I prefer the median due to its insensitivity to skew in the posterior distribution and because choosing the 50% quantile is logically similar to using the 2.5% and 97.5% quantiles for the lower and upper bounds. I used to use the mean by default until I encountered a posterior probability distribution that was so skewed the mean was greater than the the 97.5% quantile! Now obviously this model had serious problems but it was nonetheless enlightening.

The zscore is sd/mean.

Why print this?

So the user can get the mean if they are so inclined (as the sd is provided) plus its provided in a form that may have some meaning to users who are more familiar with maximum likelihood based methods.

Also I should add that I use the same coefficients table for both Bayesian and Maximum-Likelihood based methods and the ML users typically want the zscore. This is another reason why I use lower instead of lower95.

How can there be a smallest credible interval that contains 0? If 0 is in the posterior support and it's continuous, there's no smallest posterior interval containing 0.

We may be talking at cross purposes here but to my mind the interval 0 - 1 includes 0.

@bob-carpenter
Copy link
Author

bob-carpenter commented Jul 23, 2019 via email

@joethorley
Copy link
Member

I like to think of this more decision-theoretically. The posterior median is the point estimate that minimizes expected absolute error, whereas the posterior mean minimizes expected squared error (just like an average minimizes the sum of squared distances to all points). The MLE doesn't have a simple decision-theoretic interpretation that I know.

Fair enough however I also prefer the median over the mean due to its insensitivity to transformation of the posterior probability distribution. This is a more important property to me than the decision-theoretic interpretation.

In some cases, we need the posterior mean. For example, if I want to compute an event probability, it's defined as the posterior mean of an indicator function for the event, e.g.,

Pr[theta > 0 | y] = INTEGRAL_0^infty I[theta > 0] * p(theta | y) d.theta

which is the posterior mean of p(theta | y). This is why I've been pushing back against those in the Stan project who want to go all in on medians.

I agree that in this instance the mean is required - hence the z-score (see below). However, in my experience such situations are relatively rare and therefore do not justify exclusive use of the mean. Also I think it's important to distinguish between primary parameters that are being estimated and derived parameters that can be determined from the posterior probability distributions for the primary parameters.

https://github.com/poissonconsulting/mcmcderive

I suspect event probabilities are always derived parameters but would be interested to hear of a counter example.

Medians are also tricky for other discrete parameters.

I suspect the same arguments as above apply.

What's this number mean to an MLE method? I went from an ML world where nobody quantified uncertainty to a Bayesian world, so I don't know what a frequentist might do sd/mean, which is why I was asking. A reference is fine---I don't need a long explanation.

Sorry its mean/sd (not sd/mean as I erroneously state above)! It was what I came up with (no reference) as a Bayesian equivalent to the frequentist z-score (or z-statistic). I know that the mode/sd might be closer to the frequentist form however mean/sd has the advantage of allowing the user to quickly extract the mean estimate given the sd.

A reference for the z-score/z-statistic is of course

Millar, R.B. 2011. Maximum likelihood estimation and inference: with examples in R, SAS, and ADMB. Wiley, Chichester, West Sussex.

The interval [0, 1] = { x : 0 <= x <= 1 } contains 0, but (0, 1) = { x : 0 < x < 1 } does not.

True - I was being imprecise.

My point was just that if [0, 1] is an interval that contains 0, there's no smallest such interval because [0, epsilon] contains 0 for epsilon = 1, 1/2, 1/4, 1/8, ... There's no smallest such epsilon.

Point taken. I should have said something like the smallest percentile-based credible interval that includes 0.

@bob-carpenter
Copy link
Author

bob-carpenter commented Jul 23, 2019 via email

@joethorley
Copy link
Member

It's not insensitive to transformation. If I have y = (1, 2, 3), median(y) = 2, median(y^2) = 4, and median(y + 1) = 3.

OK I'm being imprecise again. What I mean is that median(log(x)) = log(median(x)) but this is not the case with the mean. I think this property is highly desirable.

> median(log(1:100))
[1] 3.921924
> log(median(1:100))
[1] 3.921973
> mean(log(1:100))
[1] 3.637394
> log(mean(1:100))
[1] 3.921973

I didn't mean to urge exclusive use of the mean---I was suggesting that which one you want will depend on the decision-theoretic (or summarization) task at hand.

I agree - I just find that in general the median seems to be the estimate that I want.

Why? Often the quantities of interest are derived. For example, I may build a model where the overall risk score for a disease is a derived parameter. In Stan, we have probabilistically derived predictive parameters (in the generated quantities block) that may depend on parameters plus some randomness, but do not affect estimation.

Because they slow model fitting and make models harder to understand. Also after fitting a model that takes a long time to converge the user might think of a derived parameter they want that they forgot to include in the model. Finally, one might want to combine parameters from separate models.

They're usually derived. But if you have theta ~ beta(5, 2) and y ~ bernoulli(theta), then Pr[y = 1] = theta.

Although in this case doesn't the median (or mean) on theta give you what you want with the advantage of informative credible intervals (without having to get involved with y)?

I'm thinking of discrete parameters like cut points in a change-point model or responsibilities in a mixture model or latent counts in a missing-discrete-data model. All of these look like parameters in general. They'll be derived in expectation in Stan because the parameters are marginalized out, but could also be coded inefficiently as parameters in JAGS or PyMC3.

These are good examples with which I am familiar. However my depth of understanding of STAN is insufficient to fully understand your statement They'll be derived in expectation in Stan because the parameters are marginalized out, but could also be coded inefficiently as parameters in JAGS or PyMC3.

I'm not quite curious enough to buy a book. A quick web search revealed
something called the "coefficient of variation":

https://en.wikipedia.org/wiki/Coefficient_of_variation

The coefficient of variation is the inverse of the z-score (also know as z-statistic or standard score) where x = 0

https://en.wikipedia.org/wiki/Standard_score

which it describes as a "standardized measure of dispersion". I'm more surprised that any one person would be familiar with R, SAS, and ADMB.

I don't use SAS but ADMB is close to the frequentist equivalent to STAN (uses C++ to define models) but also does MCMC sampling. I prefer the newer TMB which is an R package.
I have written prototypes of a family of packages that use the same functions to interface with STAN, JAGS and TMB. We use them all the time in our work.

https://github.com/poissonconsulting/mbr

You might also be interested in the following package (I've never used it)

https://cran.r-project.org/web/packages/tmbstan/index.html

Is there a worked example somewhere where I don't have to read R code?

No but the R function is relatively straightforward (where x is a vector of the MCMC samples for a single term)

pvalue <- function(x) {
  n <- length(x)
  d <- sum(x >= 0)
  p <- min(d, n - d) * 2
  p <- max(p, 1)
  round(p / n, 4)
}

See https://github.com/poissonconsulting/mcmcr/blob/master/R/utils.R

@bob-carpenter
Copy link
Author

bob-carpenter commented Jul 24, 2019 via email

@joethorley
Copy link
Member

That's only strictly true if size(x) is odd. There's a lot of latitude in how medians are calculated if there are even numbers of entries. Here's the default in R:

median(log(c(3, 7, 110, 111))) == 3.3
log(median(c(3, 7, 110, 111))) == 4.1

You are correct although in the case of MCMC samples which tend to have very similar values around the 50% quantile the problem is likely to be negligible. Plus generally users tend to generate even numbers.

To try to deal with the other issues (complex models, deciding to do other posterior analysis later), we just added a feature in 2.19 that lets you take the parameter draws from one run and then rerun them to recompute generated quantities. It's in CmdStan now and should be out soon in CmdStanPy and maybe even in RStan or PyStan.

Good to know.

The posterior mean of theta is the Pr[y = 1]. The median is different because it's an asymmetric beta distribution.

OK I understand.

There's a nice paper by Cole Monahaan where he shows how to do the marginalizations in JAGS, too, where it greatly speeds up JAGS models.

Thanks for the reference.

Thanks for the link. I just got the pun in your name now that I'm thinking about fisheries models.

It amuses me.

Thanks for the interesting and informative discussion.

I'm thinking that the best way forward will be for me to introduce a function coef_stan() that returns a coefficient table that is consistent with the table output by STAN?

@bob-carpenter
Copy link
Author

bob-carpenter commented Jul 24, 2019 via email

@joethorley
Copy link
Member

The even-sized sequences are where you need interpolation for medians.

Face palm.

I think it'd make more sense for your package to make it self-consistent rather than consistent with Stan. Now if it's possible to have a mode of output that looks like our output, that might be helpful for users of all interfaces.

What I am proposing is to keep the current coef.mcmcr() function but add a second one coef_stan.mcmcr() that includes the mean and different labelled quantiles. I would make it as close as possible to the Stan output.

The obstacle is that we haven't even settled on a unified output design. All of our interfaces provide different default output. We haven't even been able to do that among just the Stan developers yet.

I guess I would try and make it as close possible to what I considered most Stanish.

The biggest development we want to make going forward around this is to use ragged chains of different lengths. Right now, all of our R-hat and ESS calculations rely on the chains being the same length. The trick's going to be weighting them, with obvious extremes being by-draw and by-chain. We need to avoid the problem of just throwing away a chain if it gets stuck, as that can introduce bias into the posterior draws. The motivation's running a bunch of parallel chains for a fixed time, rather than a fixed number of iterations.

Interesting... I guess mcmcr could be expanded to pad out shorter chains with missing values... I need to think this over.

P.S. "Stan" isn't an acronym---it was named after Stanislaw Ulam with a wink at the Eminem song.

Nice

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants