Skip to content
This repository has been archived by the owner on May 19, 2021. It is now read-only.

Code Inspection with Non-Standard Evaluation (NSE) #14

Open
HenrikBengtsson opened this issue Mar 3, 2016 · 11 comments
Open

Code Inspection with Non-Standard Evaluation (NSE) #14

HenrikBengtsson opened this issue Mar 3, 2016 · 11 comments

Comments

@HenrikBengtsson
Copy link

Is it possible to provide metadata to R functions that use non-standard evaluation (NSE) in order to help static code inspection to identify global/unknown variables? For instance, consider

dataS <- subset(data, subset = {x < 3})

In this piece of code the expression {x < 3} is ambiguous. For instance, here we know from experience/documentation/manual code inspection that x could be either (i) a global variable or (ii) an element of the data object, i.e. data$x;

    e <- substitute(subset)
    r <- eval(e, x, parent.frame())

In other words, expression {x < 3} is basically evaluated as eval({x < 3}, envir=data) such that if data$x exists then that is used for x, otherwise a global x is searched for.

However, without this "human" knowledge, any static code inspection can really only assume the former, i.e. it will identify x as a global/unknown object.

Some thoughts:

  • How could we declare (e.g. via attributes) that argument subset of subset() should only be interpreted as a non-evaluated expression?
  • How could we declare that any globals/unknown found in the expression could also be an object/field part of the data argument?
  • What can be inferred if we know what type data is up front? And what if we know data contains field x?
  • Could some of this even be inferred from code inspection of subset() itself?
@maelle
Copy link
Member

maelle commented Mar 4, 2016

What difference is there between using subset and dplyr filter? If you use filter you can use its SE version filter_.

@HenrikBengtsson
Copy link
Author

Sorry, I should have been more clear - subset() was just an example. There are functions out there using NSE and I'm looking for a way to give some metadata guidance for static/semi-static code inspection to be aware of these functions and how they might possibly work. Other examples are:

## sum(data$x)
y <- with(data, sum(x))

## aes(mtcars$wt, mtcars$mpg)
p <- ggplot(mtcars, aes(wt, mpg)) + geom_point()

What I personally prioritizing is the challenge of identifying global (aka unknown) objects in R expressions. Being able to identify global objects is important in areas such as:

  • Memoization - Identifying all input objects that dictates the value of an expression. If exact same set of objects/values were used previously, look up value is internal cache and return instantly.
  • Distributed processes - Identify and exporting/serialize global variables to another R session/process such that the expression can be evaluated there instead of in the main R process.
  • Code validation - Identify coding errors, e.g. undefined or misspelled variable names, cf. R CMD check.

Although the above cannot be done perfectly during static code inspection, I do think it can be improved if we the code inspector has some extra information to go by. Also, for tasks such as memoization and distributed processing, we're in the borderland between static and run-time code inspection, i.e. when it comes to identify globals in an expression that is to be evaluated elsewhere we know the state of R and its objects at time point which allows us to better infer what the global variables are. For instance, if we at run-time know that data contains x and we know that subset(data, x < 3) will make use of that, then we don't have to worry about the global variable x that may or may not exists; all we need to export is data.

I'm not sure if this fall under "Contract Programming" (with terms like preconditions, postconditions, errors, and invariants).

@maelle
Copy link
Member

maelle commented Mar 8, 2016

Would one notice some problems related to this using plotProfileCallGraph (Rprof function but I couldn't get this example http://www.r-bloggers.com/profiling-r-code/ working because of the graph dependency)?

@jcheng5
Copy link

jcheng5 commented Mar 9, 2016

cc @kevinushey

@gmbecker
Copy link

@HenrikBengtsson My fork of Duncan Temple Lang's CodeDepends package (here: https://github.com/gmbecker/CodeDepends) has facilities for dealing with piping and non-standard evaluation in a way that, while not fully automatic, is specifiable at the function level and has defaults for most

> library(CodeDepends)

> x = "df = data.frame(x=1:10, y = 21:30); subset(df, x<sqrt(y))"
> scr = readScript(txt=x)
> scr
An object of class "Script"
[[1]]
df = data.frame(x = 1:10, y = 21:30)

[[2]]
subset(df, x < sqrt(y))

Slot "location":
[1] NA

> getInputs(scr)
An object of class "ScriptInfo"
[[1]]
An object of class "ScriptNodeInfo"
<snip>

Slot "inputs":
character(0)

Slot "outputs":
[1] "df"

Slot "updates":
character(0)

Slot "functions":
data.frame          : 
     FALSE      FALSE 

Slot "removes":
character(0)

Slot "nsevalVars":
character(0)

Slot "sideEffects":
character(0)

Slot "code":
df = data.frame(x = 1:10, y = 21:30)


[[2]]
An object of class "ScriptNodeInfo"
<snip>
Slot "inputs":
[1] "df"

Slot "outputs":
character(0)

Slot "updates":
character(0)

Slot "functions":
subset      <   sqrt 
 FALSE  FALSE  FALSE 

Slot "removes":
character(0)

Slot "nsevalVars":
[1] "x" "y"

Slot "sideEffects":
character(0)

Slot "code":
subset(df, x < sqrt(y))

Note the nsevalVars slot (and apologies for the lack of a pretty printing method for the objects).

@HenrikBengtsson
Copy link
Author

@gmbecker, this looks very interesting. I haven't looking at the code, but are you saying that your version of CodeDepends is doing code inspection of subset() itself to infer that argument subset (x < sqrt(y) in your example) is undergoing non-standard evaluation?

Here's my toy examples (but does not seem to do what I expect):

> subset2 <- function(x, subset, select, drop = FALSE ,...) { x[subset,] }
> code <- "df <- data.frame(x=1:10, y = 21:30); subset2(df, x < sqrt(y))"
> script <- readScript(txt=code)
> getInputs(script)
[...]
Slot "nsevalVars":
character(0)

Slot "sideEffects":
character(0)

Slot "code":
subset2(df, x < sqrt(y))

Below I would expect to pick up something in @nsevalVars:

> subset3 <- function(x, subset, ...) { rows <- eval(substitute(subset), x); x[rows,] }
> code <- "df <- data.frame(x=1:10, y = 21:30); subset3(df, x < sqrt(y))"
> script <- readScript(txt=code)
> getInputs(script)
[...]
Slot "nsevalVars":
character(0)

Slot "sideEffects":
character(0)

Slot "code":
subset3(df, x < sqrt(y))

@gmbecker
Copy link

@HenrikBengtsson Sorry i was a bit unclear before, as I wrote that message quickly.

(my version of) CodeDepends does not currently delve into function definitions to attempt to detect non-standard evaluation, but it is parameterized so that customizing default (all function) or specific function behavior is (relatively) easy.

As it works now, it "knows" that subset, the dplyr verbs, etc have non-standard evaluation and I've written handlers for the type of non-standard eval they do, which are the defaults for those functions.

So it's easy to tell it that your function has nse, and have it do the right thing (I need to export and doc nseafterfirst and other specialized handlers, it's a research-stage package atm):

> subset3 <- function(x, subset, ...) { rows <- eval(substitute(subset), x); x[rows,] }
> code <- "df <- data.frame(x=1:10, y = 21:30); subset3(df, x < sqrt(y))"
> collector = inputCollector(subset3 = CodeDepends:::nseafterfirst)
> scr = readScript(txt = code)
> getInputs(scr, collector)
An object of class "ScriptInfo"
[[1]]
An object of class "ScriptNodeInfo"
<snip>
Slot "code":
df <- data.frame(x = 1:10, y = 21:30)


[[2]]
An object of class "ScriptNodeInfo"
<snip>

Slot "inputs":
[1] "df"

Slot "outputs":
character(0)

<snip>

Slot "functions":
subset3       <    sqrt 
  FALSE   FALSE   FALSE 

<snip>

Slot "nsevalVars":
[1] "x" "y"

<snip>

Slot "code":
subset3(df, x < sqrt(y))

It's also "easy" to hack together a function that detects straightforward instances of nse given a function object. (note this is nowhere near production grade, I wrote it in like 3 minutes as an illustrative example)

Here we set a specialized handler for the substitute function that collects up all the things that are passed to it.

> detectEasyNSE = function(fun) {
+     args = formals(fun)
+     e = body(fun)
+     nseargsdetected = character()
+     subst_detect = function(e, collector, basedir, input, formulaInputs, update, pipe = FALSE, nseval = FALSE, ...) {
+         nseargsdetected <<- c(nseargsdetected, as.character(e[[2]]))
+     }
+     coll = inputCollector(substitute=subst_detect)
+     res = getInputs(e, coll)
+     nseargsdetected
+ }
> detectEasyNSE(subset3)
[1] "subset"

(recall that subset is the name of the argument that was non-standardly evaluated using eval(substitute)).

It doesn't do that on it's own in one step now though, though it's possible it could be made to by default.

P.S. so they know we're talking about this @duncantl @nick-ulle

@kevinushey
Copy link

FWIW, RStudio does something similar when detecting whether a function argument is used in an NSE-way -- we just see if it's present in a call to e.g. substitute() and other NSE primitives in the function's body, and if so, we just turn off diagnostics. This obviously doesn't catch all cases, but turns out to be good enough most of the time.

@kevinushey
Copy link

As an aside, I think the simplest way to offload this validation work would be to allow functions to have an attribute, e.g. validate, which itself would be a function that validates a particular call to that function, e.g.

fn <- function(x) {
    # do some NSE
}

attr(fn, "validate") <- function(call) {
    # perform validation
}

Environments embedding R could supply the current call (as a string, or as a call object) and that function could return a list of diagnostic objects that the environment hosting R could present as appropriate.

Unfortunately, this does become more complicated when considering S3 / S4 dispatch since the host environment also needs to figure out what method would actually be dispatched to. :/

@gmbecker
Copy link

@kevinushey I wouldn't think it is possible/coherent to use nse for arguments that are dispatched on. They seem mutually exclusive. Are there examples of this being done that you know of?

@HenrikBengtsson
Copy link
Author

For the records, the codetools package (used by R CMD check) handles NSE with custom usage-handler functions for each known case. For instance, the reason tools in a call library(tools) is not identified as a global/unknown variable is because in codetools there is:

addCollectUsageHandler("library", "base", function(e, w) {
    w$enterGlobal("function", "library", e, w)
    if (length(e) > 2)
        for(a in dropMissings(e[-(1:2)])) walkCode(a, w)
})

I've started my own discussion on this over at futureverse/globals#12

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants