Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bumbl submission #2

Open
mpadge opened this issue Sep 23, 2020 · 23 comments
Open

bumbl submission #2

mpadge opened this issue Sep 23, 2020 · 23 comments

Comments

@mpadge
Copy link
Member

mpadge commented Sep 23, 2020

Thanks @Aariq for volunteering to submit your bumbl package for trial "soft submission" to rOpenSci's new system for peer review of statistical software. This issue includes output from our automated assement and reporting tools developed as part of this new system. These currently function as the following two distinct components (which will be integrated later):

packgraph

library(packgraph)
packageVersion ("packgraph")
#> [1] '0.0.0.7'
package <- "/<local>/<path>/<to>/bumbl"

g <- pg_graph (package, plot = FALSE)
pg_report (g)
#> ══ bumbl ═══════════════════════════════════════════════════════════════════════
#> 
#> The package has 3 exported functions, and 2 non-exported funtions. The exported
#> functions are structured into the following 0 primary cluster containing
#> function
#> 
#> There are also 3 isolated functions:
#> 
#> 
#> |  n|name       | loc|
#> |--:|:----------|---:|
#> |  1|bumbl      | 115|
#> |  2|bumbl_plot |  42|
#> |  3|bumbl.nb   | 115|
#> 
#> ── Documentation of non-exported functions ─────────────────────────────────────
#> 
#> 
#> |value  | doclines| cmtlines|
#> |:------|--------:|--------:|
#> |mean   |       20|        7|
#> |median |       20|        7|

autotest

package <- "/<local>/<path>/<to>/bumbl"
library (autotest)
packageVersion ("autotest")
#> [1] '0.0.1.230'
x <- autotest_package (package)
#> Loading bumbl
#> ✔ [1 / 1]: bumbl
x$yaml_hash <- substring (x$yaml_hash, 1, 6)
knitr::kable (x, row.names = TRUE)
type fn_name parameter operation content yaml_hash
1 diagnostic bumbl data tabular structure with new class structure Function [bumbl] should error when class structure of data.frame input is removed. 0f67b3
2 error bumbl formula (unquoted) formula param as (quoted) character $ operator is invalid for atomic vectors 0f67b3
3 diagnostic bumbl augment substitute integer values for logical parameter Parameter augment of function [bumbl] is assumed to be logical, but responds to general integer values. 0f67b3
4 diagnostic bumbl augment substitute character values for logical parameter Parameter augment of function [bumbl] is assumed to be logical, but responds to character input 0f67b3
5 diagnostic bumbl augment length 2 vector for single-length parameter parameter [augment] is assumed to be a single value of logical type, yet admits vectors of length > 1 0f67b3
6 diagnostic bumbl NA compare class of return value with description Function returns an object of class [data.frame] yet documentation describes class of value as [dataframe] 0f67b3
7 diagnostic bumbl NA compare class of return value with description Function returns an object of primary class [bumbldf] yet documentation says value is of class [dataframe] 0f67b3
summary (x)
#> autotesting package [bumbl, v0.0.1.9000] generated 7 rows of output of the following types:
#>      1 error
#>      0 warnings
#>      0 messages
#>      6 other diagnosticss
#> That corresponds to 7 messages per documented function (which has examples)
#> 
#>   fn_name num_errors num_warnings num_messages num_diagnostics
#> 1   bumbl          1           NA           NA               6
#> 
#>     git hash for package as analysed here:
#>     [1bf981190f6d17a98e866b31138a1b10127942bc]

Created on 2020-09-27 by the reprex package (v0.3.0)

The output of autotest includes a column yaml_hash. This in turn refers to the yaml specification used to generate the autotests, which can be generated locally by running examples_to_yaml (<path>/<to>/<package>). Those contain the yaml_hash values, and finding the matching value will show you the base code used to trigger the diagnostic messages. The operation column should then provide a sufficient description of what has been mutated with regard to the structure defined in the yaml.

As there is only one primary function here, you should be able to address most of the aspects raised in the autotest output.

@Aariq
Copy link

Aariq commented Sep 25, 2020

I'm not sure I understand tests #2 or 6. For 2, is it suggesting that you should be able to pass a formula as a character vector? For 6, I have documented what is returned with @return, but maybe it doesn't pick this up because it uses an itemized list? Or am I just misunderstanding? Similarly, bumbl_plot does have an example.

@mpadge
Copy link
Member Author

mpadge commented Sep 27, 2020

Thanks for the response @Aariq. This should help clarify:

For 2, is it suggesting that you should be able to pass a formula as a character vector?

Not necessarily, but it should certainly be interpreted as indicating an unhelpful error message, and a corresponding implication that either formulae as character vectors should be appropriately processed, or that appropriately informative errors be issued when that is done.

For 6, I have documented what is returned with @return, but maybe it doesn't pick this up because it uses an itemized list? Or am I just misunderstanding?

Thanks for that - this workflow is literally being test-run by, and developed in response to these submissions. The checking of return values has been updated in the output above, and should now make your suggested course of action clearer.

Similarly, bumbl_plot does have an example.

That was also a glitch which has now been rectified, and no longer appears in your output. Please feel free to ask here (or via private slack channels) about any further aspects which we or the code might have failed to clarify. Thanks!

@mpadge
Copy link
Member Author

mpadge commented Oct 13, 2020

Thanks @Aariq for your submission and improvements that have already been made to the package. The few remaining diagnostic message produced by autotest can safely be ignored. (autotest will soon allow user control of which tests are actually performed, so the messages currently given for bumbl will be able to be switched off, and you should get a perfectly clean result. The switching-off will also enable you to explain why, with that information potentially used to inform reviewers.)

That means we are now able to proceed to the formal review stage, for which members of the project's advisory board @rkillick and @Paula-Moraga have kindly agreed to review your package. They are now requested to perform a two-stage review, the first part involving assessment of the package against the standards as they are currently drafted, with the second being a more "traditional" review. We hope, by the time we proceed to this second component, that many aspects which might otherwise have arisen within a "traditional" unstructured review will already have been covered, and will thereby make the review process notably easier.

Our review system will ultimately perform much of the preceding automated assessment prior to actual submission, and reviewers will be provided with a high-level interactive graphical overview of the package's functions and their inter-relationships. Fortunately here, the package has sufficiently few functions that the graph can just be pasted here in static form:

image

(That can be reproduced by running packgraph::pg_graph("/<local>/<path>/<to>/bumbl").)


Instructions for review

@rkillick and @Paula-Moraga, could you please now asses the bumbl package with respect to the current General Standards for Statistical Software, and the category-specific standards for Time Series Software.

Please do this in two phases:

  • Review the package as it stood at the time of initial assessment (git hash 1bf981, as given above).
  • Having done that, checkout package in current status, and note any differences.

In each case, please only note those standards which you judge the package not to conform to, along with a description of what you would expect this particular software package to do in order to conform to each standard. When you do that, please provide sufficient information on which standard you are referring to. (The standards themselves are all enumerated, but not yet at a necessarily stable state, so please provide enough information for anyone to clearly know which standard you are referring to regardless of potential changes in nomenclature.) Please also note as a separate list all those standards which you think should not apply to this package, along with brief explanations of why.

Importantly, to aid us in refining the standards which will ultimately guide the peer review of statistical software, we also ask you to please consider whether you perceive any aspects of software (design, functionality, algorithmic implementations or applications, testing, and any other aspects you can think of) which you think might be able to be addressed by standards, and yet which are not addressed by our standards in their current form.

To sum up, please post the following in this issue:

  1. Your 2 itemized lists of standards which the software initially failed to meet, along with an indication of which of that subset of standards the most recent version then actually meets. Please provide git hashes for the repository head in each case.
  2. An indication of any aspects of the software which you think are not addressed by the current standards yet could (potentially) be.

Once you've done that, we'll ask to you proceed to a more general review of the software, for which we'll provide more detail at that time. Thanks all for agreeing to be part of this!

@mpadge
Copy link
Member Author

mpadge commented Oct 16, 2020

@rkillick, @Paula-Moraga Neglected one piece of important information:

Due date

We would like to have this review phase completed within 4 weeks, which we'll time from today, so by the 13th of November 2020. We accordingly suggest that you aim to have the first of the two tasks completed within two weeks, by the 30th October.

Could you both please also record approximately how much time you have spent on each review stage. Thank you!

@mpadge
Copy link
Member Author

mpadge commented Oct 20, 2020

Update for reviewers @rkillick, @Paula-Moraga, note that this repo now includes an R package which enables you to get a pre-formatted checklist for your reviews (inspired by, and with gratitude to, co-board member @stephaniehicks) by running the following lines:

remotes::install_github("ropenscilabs/statistical-software-review")
library(statsoftrev) # the name of the package
rssr_standards_checklist (category = "time-series")

That will produce a markdown-formatted checklist in your clipboard ready to paste where you like, or you can use a filename parameter to specify a local file.

ping @noamross so you'll be notified of these conversations.

@rkillick
Copy link

rkillick commented Oct 27, 2020

Initial Review of Bumbl (git hash 1bf981):
I've not used this package previously so this is a review as a newbie.

Applying General Software Standards:

  • G1.0 Statistical Software should list at least one primary reference from published academic literature.
    Not met as no paper reference in the documentation or package description

  • G1.1 All statistical terminology should be clarified and unambiguously defined.
    Not met. As one example the bumbl function states "Fits models" with no reference in the details for what model forms or assumptions are made - mathematically speaking.

  • G1.3 Software should include all code necessary to reproduce results which form the basis of performance claims made in associated publications.
    Not met. When I build the vignette I get errors on the convergence of the glm fit.

  • G2.1 Provide explicit secondary documentation of expectations on data types of all vector inputs (see the above list).
    Not met. It is not clear that the columns of the data frame for input into the bumbl function need to be numeric

  • G2.10 Statistical Software should implement appropriate checks for missing data as part of initial pre-processing prior to passing data to analytic algorithms.
    Not met. Checks are there for missing colony names and missing output but not for missing data in rows.

  • G2.11 Where possible, all functions should provide options for users to specify how to handle missing (NA) data
    Not met. No options for user-defined NA handling exist.

  • G2.12 Functions should never assume non-missingness, and should never pass data with potential missing values to any base routines with default na.rm = FALSE-type parameters
    Not met. Mutate function called on line 324 of colony-growth.R function with na.rm=TRUE.

  • G2.13 All functions should also provide options to handle undefined values (e.g., NaN, Inf and -Inf), including potentially ignoring or removing such values.
    Not met. No handling of this in functions.

Test coverage:
bumbl Coverage: 89.96%
R/plotting.R: 76.67%
R/colony-growth.R: 91.63%

  • G4.2b Explicit tests should demonstrate conditions which trigger every one of those messages, and should compare the result with expected values.
    Not met Example: Warning "Some taus were not used because they were outside of range of" not covered.

  • G4.4 Correctness tests to test that statistical algorithms produce expected results to some fixed test data sets
    Not met, no correctness tests in testing, just class and no warnings. Skipping the rest of the guidelines for parameter correctness etc. as none are given. All not met.

  • G4.5 Correctness tests should be run with a fixed random seed
    Not met. No seed in the sample command for the plotting example

  • G4.8 Edge condition tests to test that these conditions produce expected behaviour such as clear warnings or errors when confronted with data with extreme properties
    Not met. None included.

  • G4.9 Noise susceptibility tests Packages should test for expected stochastic behaviour
    Not met. None included, I would expect to see small perturbations in the input data using jitter() to give the same output for tau and very similar parameter estimates.

  • G.4.10-12 Extended testing.
    All not met as no extended tests are provided. For this package longer datasets will cause problems with the default "taus" argument behaviour.

Applying Time Series software standards:

  • TS1.0 Time Series Software should explicitly document the types and classes of input data able to be passed to each function
    Not met. Type assumptions in df are not described.

  • TS1.1 Time Series Software should accept input data in as many time series specific classes as possible.
    No time series classes are supported. As such the user could enter the rows in the "wrong" unnatural time order and this is not checked by the software. This drastically affects all output and plotting.

  • TS1.2 Time Series Software should implement validation routines to confirm that inputs are of acceptable classes
    Not met. No validation of inputs completed.

  • TS1.7 Accept inputs defined via the units package for attributing SI units to R vectors.
    Not clear if this is supported or whether it should be.

  • TS1.8 Where time intervals or periods may be days or months, be explicit about the system used to represent such, particularly regarding whether a calendar system is used, or whether a year is presumed to have 365 days, 365.2422 days, or some other value.
    Documentation for bumbl function is woolly on this, needs clarification.

  • TS2.0 Appropriate checks for missing data, and associated transformation routines, should be performed as part of initial pre-processing prior to passing data to analytic algorithms.
    No checks for missing data are completed.

  • TS2.1 Time Series Software which presumes or requires regular data should only allow explicit* missing values, and should issue appropriate diagnostic messages, potentially including errors, in response to any implicit missing values.*
    No handling as no checks are made.

  • TS2.2 Where possible, all functions should provide options for users to specify how to handle missing data,
    No options are given to the user

  • TS2.3 Functions should never assume non-missingness, and should never pass data with potential missing values to any base routines with default na.rm = FALSE-type parameters
    Not met. Example: Mutate function called on line 324 of colony-growth.R function with na.rm=TRUE.

  • TS2.5 Explicitly document all assumptions and/or requirements of stationarity
    Not met. Assumptions of independence also required to be clarified.

  • TS2.6 Implement appropriate checks for all relevant forms of stationarity,
    Not met. In this instance, some model diagnostics would be appropriate.

  • TS5.0 Implement default plot methods for any implemented class system.
    Not met. Should use the plot method.

  • TS5.5 Provide options to determine whether plots of data with missing values should generate continuous or broken lines
    No handling of missing values so not met.

@rkillick
Copy link

NOTE: all my issues still arise in the new version (git hash ba1371), just the reference to line numbers is different.

@rkillick
Copy link

Not covered in standards but should be:

  • Visualizations being appropriate; this should be in the general standards section. Within the review the only bit currently covered was that the time axes should be labelled appropriately and should usually be on the x-axis. This isn't enough.
  • Mapping some of the specific standards to 1-d input variables onto tabular input. For instance, 1-d input specifies the type should be checked. This should also apply to the type for tabular input (necessarily each column for data frames).

@mpadge
Copy link
Member Author

mpadge commented Oct 28, 2020

Thanks @rkillick both for the checklist of points, and those final suggestions for improving the standards. It would be great if you could update current standards by inserting any new aspects which you think could or should be inserted - there are explicit instructions on the source repo for the standards.

@Aariq the next steps for you will be to make any modifications you deem necessary in response both to @rkillick's points above, and those of @Paula-Moraga. It's up to you whether you decide to already respond to these comments, or wait for equivalent comments from @Paula-Moraga.

@Paula-Moraga
Copy link

Standards that package fails to meet

git hash ba1371a

Documentation

  • G1.2 Software should use roxygen to documentation all functions.

Documentation should be improved. For example, the documentation of functions bumbl() and bumbl.nb() is the same and the user cannot understand the difference between the two functions unless they read the R code. The documentation says bumbl.nb(): passes arguments to MASS::glm.nb() rather than glm(). This is not enough to understand the differences between the two functions. The same happens with brkpt() and brkpt.nb(). The documentation should clearly state the purpose of each function.

Also the documentation of brkpt() says the value is "a tibble with a column for the winning tau and a column for the winning model". This description can be improved with an explanation of what is tau and what is the winning model.

  • G1.3 Software should include all code necessary to reproduce results which form the basis of performance claims made in associated publications.

The package does not provide the code to reproduce the results given in the reference provided (Crone and Williams (2016)).

Input Structures

  • G2.3a Use match.arg() or equivalent where applicable to only permit expected values.

Functions should check the type of arguments. For example, right now we can call bumbl() by providing a character value for taus.

bumbl(bombus, colonyID = colony, t = week, formula = d.mass ~ week, taus = "a")
  • G2.10 Statistical Software should implement appropriate checks for missing data as part of initial pre-processing prior to passing data to analytic algorithms.

Missing data are not checked.

  • G2.11 Where possible, all functions should provide options for users to specify how to handle missing (NA) data.

The package does not provide options to handle missing data.

  • G2.12 Functions should never assume non-missingness, and should never pass data with potential missing values to any base routines with default na.rm = FALSE-type parameters (such as mean(), sd() or cor().

There are functions that could pass data with missing values to base routines with default na.rm = FALSE. For example, in
line 45 of https://github.com/Aariq/bumbl/blob/master/R/colony-growth.R
bumbl() executes

tvec <- data[[tvar]]
taus <- seq(min(tvec), max(tvec), length.out = 50)

where data[[tvar]] may have missing values:

bombus2 <- bombus
bombus2$week[1] <- NA 
out <- bumbl(bombus2, colonyID = colony, t = week, formula = d.mass ~ week)
  • G2.13 All functions should also provide options to handle undefined values (e.g., NaN, Inf and -Inf), including potentially ignoring or removing such values.

The package does not provide options to handle undefined data.

Input data structures and validation

  • TS1.0 Time Series Software should explicitly document the types and classes of input data able to be passed to each function.

This is not well documented. For example, bumbl() documentation says argument data should be a data.frame but does not describe what should be the columns of this data,frame.

  • TS1.1 Time Series Software should accept input data in as many time series specific classes as possible.

This is not done. The documentation does not explain the type of data that should be passed, and the R code does not check the arguments passed.

  • TS1.2 Time Series Software should implement validation routines to confirm that inputs are of acceptable classes (or represented in otherwise appropriate ways for software which does not use class systems).

This is not done. For example, one can call bumbl() with data that has a character in argument t and this error is not handled.

bombus2 <- bombus
bombus2$week[1] <- "a" 
out <- bumbl(bombus2, colonyID = colony, t = week, formula = d.mass ~ week)
  • TS1.3 Time Series Software should implement a single pre-processing routine to validate input data, and to appropriately transform it to a single uniform type to be passed to all subsequent data-processing functions (the tsbox package provides one convenient approach for this).

This is not done. There is some validation code at the beginning of the bumbl() function. Then other functions such as brkpt() do not execute this validation code.

Pre-processing and Variable Transformation

  • TS2.0 Appropriate checks for missing data, and associated transformation routines, should be performed as part of initial pre-processing prior to passing data to analytic algorithms.

The package does not check for missing data.
For example, one can execute bumbl() where one d.mass value is NA.

bombus2 <- bombus
bombus2$d.mass[1] <- NA
out <- bumbl(bombus2, colonyID = colony, t = week, formula = d.mass ~ week)
  • TS2.2 Where possible, all functions should provide options for users to specify how to handle missing data.:

The package does not provide functions to specify how to handle missing data.

  • TS2.3 Functions should never assume non-missingness, and should never pass data with potential missing values to any base routines with default na.rm = FALSE-type parameters (such as mean(), sd() or var()).

There are functions that could pass data with missing values to base routines with default na.rm = FALSE. For example, in
line 45 of https://github.com/Aariq/bumbl/blob/master/R/colony-growth.R
bumbl() executes

tvec <- data[[tvar]]
taus <- seq(min(tvec), max(tvec), length.out = 50)

where data[[tvar]] may have missing values:

bombus2 <- bombus
bombus2$week[1] <- NA 
out <- bumbl(bombus2, colonyID = colony, t = week, formula = d.mass ~ week)
  • TS2.5 Explicitly document all assumptions and/or requirements of stationarity

Documentation should be improved to explain the type of arguments to be passed and possible assumptions.

Visualization

  • TS5.5 Provide options to determine whether plots of data with missing values should generate continuous or broken lines.

bumbl_plot() does not provide options to determine how to visualize missing data.

  • TS5.8 By default provide clear visual distinction between model (input) values and forecast (output) values.

Plots clearly distinguish between data which are represented by points and model predictions which are represented with red lines, but a legend would help in understanding the plot better.

Other aspects not covered in the current standards

  • It would be helpful to include in the documentation a description of the problems that the package can solve and a summary of the methods implemented without the need to look at the reference provided. This description should be written in a way that is understandable for general users. For example, in this package terms such as "gyne" could be unknown for some users.

  • There is one reference to the method implemented but it is behind a paywall. If possible the reference should be accessible by everyone.

  • I think repeated code should be avoided. In this package, functions brkpt() and brkpt.nb(), and functions bumbl() and bumbl.nb() are very similar.

@mpadge
Copy link
Member Author

mpadge commented Oct 29, 2020

Thanks so much @Paula-Moraga! Could you please give an estimate of how long you spent on that stage of the review?

And @Aariq this checklist was intentionally compared with your package at the time this issue was opened, not with current version. Could you please indicate whether current version has addressed all of those issues, or if not, then please notify here once you have either addressed all issues, or otherwise explained why you deem those issues not to be relevant. Thanks!

@Paula-Moraga
Copy link

@mpadge I took me around 3 hours to read the paper and review the package.

@rkillick
Copy link

@mpadge Similar amount of time for me, around 3.5 hours for the first pass and then an hour for the second pass. Faster the second time as I'd just done the first and was now familiar with the package!

@Aariq
Copy link

Aariq commented Nov 4, 2020

Thank you @rkillick and @Paula-Moraga for the comments! I made a lot of improvements (I think) based on your suggestions and I'll try to post a response to your comments in the next few days. Since there are a lot of comments, I think I will address both of your suggestions together, but break up my responses by the standard (e.g. G1, G2, G4, TS1, etc.)

@Aariq
Copy link

Aariq commented Nov 4, 2020

G1

G1.0 Statistical Software should list at least one primary reference from published academic literature.

@rkillick:
Not met as no paper reference in the documentation or package description

  • added a literature reference to bumbl documentation

G1.1 All statistical terminology should be clarified and unambiguously defined.

@rkillick:
Not met. As one example the bumbl function states "Fits models" with no reference in the details for what model forms or assumptions are made - mathematically speaking.

  • added "generalized linear models" to the function description for bumbl
  • "Details" section now refers users to the vignette which has the math laid out.
  • Clarified meaning of "decay" in output of bumbl in documentation.

G1.2 Software should use roxygen to documentation all functions.

@Paula-Moraga:
Documentation should be improved. For example, the documentation of functions bumbl() and bumbl.nb() is the same and the user cannot understand the difference between the two functions unless they read the R code. The documentation says bumbl.nb(): passes arguments to MASS::glm.nb() rather than glm(). This is not enough to understand the differences between the two functions.

  • I've removed bumbl.nb() and brkpt.nb() and instead made it so when family = "negbin", glm.nb() is used.
  • I've clarified the behavior and meaning of the family argument in the documentation.

@Paula-Moraga:
The same happens with brkpt() and brkpt.nb(). The documentation should clearly state the purpose of each function. Also the documentation of brkpt() says the value is "a tibble with a column for the winning tau and a column for the winning model". This description can be improved with an explanation of what is tau and what is the winning model.

  • brkpt is not exported, so I didn't spent much time on the documentation of this function. I've added a note that this function is typically only used internally by bumbl(). bumbl() has better documentation on what tau and the winning model are.

G1.3 Software should include all code necessary to reproduce results which form the basis of performance claims made in associated publications.

@rkillick:
Not met. When I build the vignette I get errors on the convergence of the glm fit.

  • For me these are warnings, not errors, and I can build the vignette just fine.
  • I've added a column to the output of bumbl() that shows whether the final model converged or not
  • It turns out these warnings were not coming from any of the winning models (with the tau that maximizes likelihood).
  • I've silenced all warnings and messages from these intermediate models in brkpt(). Convergence warnings in the "winning" model should still show up.

@Paula-Moraga:
The package does not provide the code to reproduce the results given in the reference provided (Crone and Williams (2016)).

  • The reference did extensive comparative testing of several methods for modeling bumblebee colony growth. The bumbl package is intended to be a simplified adaptation of the code used in Crone and WIlliams (2016) that implemented the best of those methods.
  • The bumbl() function is based on glm() while Crone and Williams (2016) used generalized linear mixed effects models with the lme4 package. Therefore I can't reproduce their results with bumbl()

@Aariq
Copy link

Aariq commented Nov 5, 2020

G2

G2.1 Provide explicit secondary documentation of expectations on data types of all vector inputs (see the above list).

@rkillick:
Not met. It is not clear that the columns of the data frame for input into the bumbl function need to be numeric

  • added description of minimum requirements of dataframe passed in as
    data

G2.3a Use match.arg() or equivalent where applicable to only permit expected values.

@Paula-Moraga:
Functions should check the type of arguments. For example, right now we can call bumbl() by providing a character value for taus.

  • bumbl() should now check that taus are numeric

G2.10 Statistical Software should implement appropriate checks for missing data as part of initial pre-processing prior to passing data to analytic algorithms.

@rkillick:
Not met. Checks are there for missing colony names and missing output but not for missing data in rows.
@Paula-Moraga:
Missing data are not checked.

  • I've added a check that there are no missing values in the time variable. However, it doesn't make sense to me for bumbl to check for missing data in other variables because it passes data to glm which has methods for handling missing data.

G2.11 Where possible, all functions should provide options for users to specify how to handle missing (NA) data

@rkillick:
Not met. No options for user-defined NA handling exist.
@Paula-Moraga:
The package does not provide options to handle missing data.

  • The user can pass na.action to glm or glm.nb to control handling of NAs

G2.12 Functions should never assume non-missingness, and should never pass data with potential missing values to any base routines with default na.rm = FALSE-type parameters

@rkillick:
Not met. Mutate function called on line 324 of colony-growth.R function with na.rm=TRUE.

  • I removed this particular na.rm=TRUE because I'm not sure it was doing anything. It was not removing missing values from the data, just ensuring that the maximum value returned by predict() was numeric and not NA. But I don't know why predict would ever return an NA.

@Paula-Moraga:
There are functions that could pass data with missing values to base routines with default na.rm = FALSE. For example, in line 45 of https://github.com/Aariq/bumbl/blob/master/R/colony-growth.R

  • Thank you for pointing me to this specific error. I added a check for missing values in the time variable. I can't think of a reason that time should ever contain NAs in the context of measuring bumblebee colony growth.

G2.13 All functions should also provide options to handle undefined values (e.g., NaN, Inf and -Inf), including potentially ignoring or removing such values.

@rkillick:
Not met. No handling of this in functions.

@Paula-Moraga:
The package does not provide options to handle undefined data.

  • I've added a check for undefined values in the time variable (with is.finite())
  • Otherwise, I think I'll leave handling of these values to glm().

@Aariq
Copy link

Aariq commented Nov 5, 2020

G4

G4.2b Explicit tests should demonstrate conditions which trigger every one of those messages, and should compare the result with expected values.

@rkillick:
Not met Example: Warning "Some taus were not used because they were outside of range of" not covered.

  • I added a test for this

G4.4 Correctness tests to test that statistical algorithms produce expected results to some fixed test data sets.

@rkillick:
Not met, no correctness tests in testing, just class and no warnings. Skipping the rest of the guidelines for parameter correctness etc. as none are given. All not met.

  • I've added a function sim_colony() to simulate colony growth and decay data with the parameters used in the simulation returned as attributes.
  • sim_colony() is used to test that expected parameters are returned within some tolerance.

G4.5 Correctness tests should be run with a fixed random seed.

@rkillick:
Not met. No seed in the sample command for the plotting example

  • added set.seed() to example
  • new correctness test run with fixed seed

G4.8 Edge condition tests to test that these conditions produce expected behaviour such as clear warnings or errors when confronted with data with extreme properties.

@rkillick:
Not met. None included.

  • Data is passed to glm() with only slight modification, so I'm letting glm() handle those edge cases. I'll think about better ways to pass glm() errors to the user (e.g. additional columns in the output containing errors or warnings for each colony's GLM?).

G4.9 Noise susceptibility tests. Packages should test for expected stochastic behaviour.

@rkillick:
Not met. None included, I would expect to see small perturbations in the input data using jitter() to give the same output for tau and very similar parameter estimates.

  • I added theses suggested tests

G.4.10-12 Extended testing.

@rkillick:
All not met as no extended tests are provided. For this package longer datasets will cause problems with the default "taus" argument behaviour.

  • I'm not sure what is meant by "extended tests". I suspect the bombus dataset I included in the package is a typical size for bumblebee colony growth data.
  • In terms of length of the timeseries, I can see how only 50 values for tau might not be enough to find the maximum likelihood estimate for the switchpoint correctly. I've added a note to the documentation saying as much.

@Aariq
Copy link

Aariq commented Nov 5, 2020

TS1

TS1.0 Time Series Software should explicitly document the types and classes of input data able to be passed to each function

@rkillick:
Not met. Type assumptions in df are not described.

@Paula-Moraga:
This is not well documented. For example, bumbl() documentation says argument data should be a data.frame but does not describe what should be the columns of this data,frame.

  • Minimum assumptions added to documentation.

TS1.1 Time Series Software should accept input data in as many time series specific classes as possible.

@rkillick:
No time series classes are supported. As such the user could enter the rows in the "wrong" unnatural time order and this is not checked by the software. This drastically affects all output and plotting.

  • I don't think this is true, and I've added a relevant test to show that row order doesn't matter.

@Paula-Moraga:
This is not done. The documentation does not explain the type of data that should be passed, and the R code does not check the arguments passed.

  • bumbl() currently doesn't support time series.

  • I don't plan to implement support for time series because I'm not very familiar with using time series with glm()

TS1.2 Time Series Software should implement validation routines to confirm that inputs are of acceptable classes

@rkillick:
Not met. No validation of inputs completed.

@Paula-Moraga:
This is not done. For example, one can call bumbl() with data that has a character in argument t and this error is not handled.

  • Added a check that the time variable, t, is numeric.
  • Other checks done by glm() (e.g. you can't have a character response variable).

TS1.3 Time Series Software should implement a single pre-processing routine to validate input data, and to appropriately transform it to a single uniform type to be passed to all subsequent data-processing functions (the tsbox package provides one convenient approach for this).

@Paula-Moraga:
This is not done. There is some validation code at the beginning of the bumbl() function. Then other functions such as brkpt() do not execute this validation code.

  • This is actually how I would interpret this standard (but correct me if I'm wrong @mpadge). I feel OK doing validation in bumbl() and not in brkpt(). This way it's only done once for the whole dataset, not one time per colony.
  • I can't pre-process the data in bumbl() because the pre-processing (primarily the creation of the .post column, representing time past the switchpoint, tau) is done iteratively for each colony separately as part of model fitting.

TS1.7 Accept inputs defined via the units package for attributing SI units to R vectors.

@rkillick:
Not clear if this is supported or whether it should be.

  • This doesn't seem appropriate to me as the units for time, colony growth, and covariates are not important.

TS1.8 Where time intervals or periods may be days or months, be explicit about the system used to represent such, particularly regarding whether a calendar system is used, or whether a year is presumed to have 365 days, 365.2422 days, or some other value.

@rkillick:
Documentation for bumbl function is woolly on this, needs clarification.

  • Time intervals are arbitrary, in the sense that the output tau will be in whatever units are used for time in the model.

  • The interpretation of tau is noted in the documentation already.

  • I now notice that if Dates were supplied, they were silently coerced to numeric by an ifelse() statement. To be safe I've restricted the time column to numeric for now, and opened an issue to properly deal with date input.

@Aariq
Copy link

Aariq commented Nov 5, 2020

TS2

TS2.0 Appropriate checks for missing data, and associated transformation routines, should be performed as part of initial pre-processing prior to passing data to analytic algorithms.

@rkillick:
No checks for missing data are completed.

  • I'm not sure this is relevant as glm() handles the missing data (see previous comments)

@Paula-Moraga:
The package does not check for missing data.
For example, one can execute bumbl() where one d.mass value is NA.

  • This is intended. Missing values are common in ecological data, and users of bumbl() should be used to understanding how glm() handles missing data. I want users to be able to have missing values in their data frame.

TS2.1 Time Series Software which presumes or requires regular data should only allow explicit* missing values, and should issue appropriate diagnostic messages, potentially including errors, in response to any implicit missing values.*

@rkillick:
No handling as no checks are made.

  • bumbl does not presume or require regular data, so I don't think this standard applies.

TS2.2 Where possible, all functions should provide options for users to specify how to handle missing data,

@rkillick:
No options are given to the user

@Paula-Moraga:
The package does not provide functions to specify how to handle missing data.

  • I've noted in the documentation that that can be used to pass arguments to glm(). This includes arguments for handling missing data.

TS2.3 Functions should never assume non-missingness, and should never pass data with potential missing values to any base routines with default na.rm = FALSE-type parameters

  • See comments in response to G2.12

TS2.5 Explicitly document all assumptions and/or requirements of stationarity

@rkillick:
Not met. Assumptions of independence also required to be clarified.

@Paula-Moraga:
Documentation should be improved to explain the type of arguments to be passed and possible assumptions.

  • The assumptions are the same as a GLM, so homogeneity of variance I think covers stationarity (a new term to me).

  • I now mention that GLM assumptions apply in the documentation.

TS2.6 Implement appropriate checks for all relevant forms of stationarity,

@rkillick:
Not met. In this instance, some model diagnostics would be appropriate.

  • I've added a keep.model argument that when TRUE returns the GLMs fit for each colony as a list-column. In the documentation for bumbl() I've referenced the vignette from tidyr for working with models in list-columns. Users should be able to follow this to perform model diagnostics.
  • In the updated vignette for this package, I try to emphasize that model validation is up to the user, just as it is with any GLM and show one simple example of using the list-column of glms to get R^2 values for each colony's model.

@Aariq
Copy link

Aariq commented Nov 5, 2020

TS5

TS5.0 Implement default plot methods for any implemented class system.

@rkillick:
Not met. Should use the plot method.

  • I've removed bumbl_plot() and replaced it with plot() and autoplot() methods.

TS5.5 Provide options to determine whether plots of data with missing values should generate continuous or broken lines

@rkillick:
No handling of missing values so not met.

@Paula-Moraga:
bumbl_plot() does not provide options to determine how to visualize missing data.

  • Users can obtain data for making plots by running bumbl() with augment = TRUE and handle plotting of missing values.

  • Now that time can no longer contain NAs, I think this issue is resolved.

TS5.8 By default provide clear visual distinction between model (input) values and forecast (output) values.

@Paula-Moraga:
Plots clearly distinguish between data which are represented by points and model predictions which are represented with red lines, but a legend would help in understanding the plot better.

  • I've described what the lines and points mean in the documentation for plot.bumbldf() and autoplot.bumbldf() (as well as links to those functions from the bumbl() documenation).

@Aariq
Copy link

Aariq commented Nov 5, 2020

Other Comments

It would be helpful to include in the documentation a description of the problems that the package can solve and a summary of the methods implemented without the need to look at the reference provided.

  • The revised vignette contains a summary of the mathematical model implemented by bumbl() and the methods it uses.

This description should be written in a way that is understandable for general users. For example, in this package terms such as "gyne" could be unknown for some users.

  • I've clarified that the switch is from producing workers to producing reproductive individuals.

There is one reference to the method implemented but it is behind a paywall. If possible the reference should be accessible by everyone.

  • Unfortunately, I don't know of another reference that implements this technique. This is the paper that created this method.

I think repeated code should be avoided. In this package, functions brkpt() and brkpt.nb(), and functions bumbl() and bumbl.nb() are very similar.

  • I agree. I originally did this to mirror the way glm() and MASS::glm.nb() exist/work, but now I've removed bumbl.nb() and running bumbl() with family = "negbin" now eventually calls glm.nb(). All other values for family get passed to glm()

@rkillick
Copy link

rkillick commented Nov 6, 2020

G4

G4.2b Explicit tests should demonstrate conditions which trigger every one of those messages, and should compare the result with expected values.

@rkillick:
Not met Example: Warning "Some taus were not used because they were outside of range of" not covered.

  • I added a test for this

G4.4 Correctness tests to test that statistical algorithms produce expected results to some fixed test data sets.

@rkillick:
Not met, no correctness tests in testing, just class and no warnings. Skipping the rest of the guidelines for parameter correctness etc. as none are given. All not met.

  • I've added a function sim_colony() to simulate colony growth and decay data with the parameters used in the simulation returned as attributes.
  • sim_colony() is used to test that expected parameters are returned within some tolerance.

G4.5 Correctness tests should be run with a fixed random seed.

@rkillick:
Not met. No seed in the sample command for the plotting example

  • added set.seed() to example
  • new correctness test run with fixed seed

G4.8 Edge condition tests to test that these conditions produce expected behaviour such as clear warnings or errors when confronted with data with extreme properties.

@rkillick:
Not met. None included.

  • Data is passed to glm() with only slight modification, so I'm letting glm() handle those edge cases. I'll think about better ways to pass glm() errors to the user (e.g. additional columns in the output containing errors or warnings for each colony's GLM?).

G4.9 Noise susceptibility tests. Packages should test for expected stochastic behaviour.

@rkillick:
Not met. None included, I would expect to see small perturbations in the input data using jitter() to give the same output for tau and very similar parameter estimates.

  • I added theses suggested tests

G.4.10-12 Extended testing.

@rkillick:
All not met as no extended tests are provided. For this package longer datasets will cause problems with the default "taus" argument behaviour.

  • I'm not sure what is meant by "extended tests". I suspect the bombus dataset I included in the package is a typical size for bumblebee colony growth data.
  • In terms of length of the timeseries, I can see how only 50 values for tau might not be enough to find the maximum likelihood estimate for the switchpoint correctly. I've added a note to the documentation saying as much.

For the extended testing maybe you could include a catch in the function that returns a warning if you think that 50 values will be too small for larger datasets. In the longer term you may want to look at other packages such as changepoint and consider how your cost function may work inside their search functions. As the creator of changepoint I haven't thought about extensions to GLMs but would be interested in pursuing this separately if you are interested too. We have LM functionality currently being tested so would be a natural extension.

@Aariq
Copy link

Aariq commented Dec 2, 2020

For the extended testing maybe you could include a catch in the function that returns a warning if you think that 50 values will be too small for larger datasets. In the longer term you may want to look at other packages such as changepoint and consider how your cost function may work inside their search functions. As the creator of changepoint I haven't thought about extensions to GLMs but would be interested in pursuing this separately if you are interested too. We have LM functionality currently being tested so would be a natural extension.

I agree that the way bumbl finds the changepoint is not great, but I feel like improving it is beyond the scope of what I'm trying to do right now. I like the idea of adding a warning for larger datasets to increase the number of values to test. I will think about what the threshold for that warning should be.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants