Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preprocessor and executable specifications #823

Open
yav opened this issue Jan 8, 2025 · 5 comments
Open

Preprocessor and executable specifications #823

yav opened this issue Jan 8, 2025 · 5 comments
Labels
cn CN-exec Related to CN executable spec generation, called using `cn instrument`

Comments

@yav
Copy link
Collaborator

yav commented Jan 8, 2025

There are some unfortunate interactions between the C pre-processor and CN's testing infrastructure (e.g., see #393, #654).
This ticket aims to describe the underlying problem and some potential design choices that we may want to consider.

Status Quo

The following diagram shows the steps that CN takes to produce executable specifications:

graph LR
  input.c --> |cpp| input.i
  input.i --> |parser| AIL
  AIL --> |semantics| core
  core --> |synthesize| checks
  checks --> |inject| output.c
  input.c --> |inject| output.c
Loading

Here AIL is a data structure, which is an abstract representation of C code, and the checks is a bunch of C code (currently represented as a mixture of AIL and raw strings) that needs to be added to the original program to validate user assertions. The checks are added to the original program by the inject algorithm.

Problems and Considerations

Incorrect Column Numbers Note that because the parser runs after the pre-processor has been applied, the location information we have is not quite right. We are using an external pre-processor, which annotates its output with file+line information, however column information is lost, so the locations that are stored in the AIL have correct collumns for input.i but NOT for input.c

The effect of this is two-fold:

  1. Error messages may contain location information where the column is wrong
  2. The inject algorithm gets confused, resulting in incorrect ouput.c

Multiple Files Because input.c may contain #include directives, when we perform injection we may have to perform injection in multiple files. We do have some code that attempts to do this, but it doesn't seem to work quite correctly. This is a tricky problem, because generally we don't want to modify the original files, but rather a copy of the files. We then have to ensure that those modified files are correctly included in output.c when building it, but we have no easy way to modify the #includes, so we have to achieve it only by changing the flags to the pre-processor. In principle this will never be fully correct, for example, if a file #included and absolute path, we can't change that. Fortunately, this is not something that usually happens in normal programs.

Potential Solutions

Here are a few possibilities we may want to consider:

(A) Modify the inejct algorithm to be line-based and not consider columns. This will only work if the injections can always go on new lines, which---in general---is not the case. Consider, for example, a program that has multiple statements on the same line: we may have to inject stuff in-between those statements.

(B) Modify the inject algorithm to inject in input.i (i.e., the pre-processed filed) rather than the original C source. To make this work, we'd need to modify the parser to keep track of 2 locations per construct: the location in the original C file, which we'd use for error reporting, and the location in the actual pre-processed file, which we'd use for injection. This should solve the issue of incorrect injections and having to deal with #include. We would still report incorrect columns error messages, however. Also, if someone wanted to debug the generated code, they'd be looking at the pre-processed file rather than the original code.

(C) Modify the inject algorithm to inject directly in the AIL AST and then pretty-print that. This has similar trade-offs to option (B) but it wouldn't need to track 2 source locations for things. The trade off is that the generated code would be not only pre-processed but also somewhat desugared and miss things like comments, for example. It is unclear if the resulting code would be better or worse than option (B), however, because pretty printing the code would at least result in nice regular structure, while injecting directly into the text often leads to code that is not properly indented and is quite hard to read. We may also be able to preserve some of the comments in the source if we added a "comment" constructor to AIL, although all consumers of AIL would then have to deal with it...

(D) Write our own pre-processor, which manipulates tokens appropriately so that we have correct column information. This seems to be the only option that would report correct column information. In theory, this should also make it possible to do injection on non-pre-processed source file. We would still have to deal with the issues of #includes, but if we had our own pre-processor, we might be able to also use it to modify the original source program to just change the #include in the output.c. One other benefit to having a custom pre-processor is that we could apply it to the CN specifications (the magic comments) to expand constants too (although this has its own issues, for examples numbers in C and numbers in CN do not use the same notation at the moment, so that feature won't be so useful without some further changes to CN, which we may want to do anyway).

@yav yav added cn CN-exec Related to CN executable spec generation, called using `cn instrument` labels Jan 8, 2025
@bcpierce00
Copy link
Collaborator

Couple questions...

For (A), if we were willing to rewrite programs so that multiple statements never appear on the same line, would the problems then be completely solved? (I guess this is asking: Can the inject algorithm really be modified to work only with lines?)

Do you have a sense which of (B) and (C) is easier to implement?

For (D), what does "manipulates tokens appropriately" mean?

@yav
Copy link
Collaborator Author

yav commented Jan 10, 2025

On further thought and inspection, here are some more thoughts:

  • I don't really think (A) is a viable option, because we also inject things in expressions. For example, if x is a local variable, in the executable spec this becomes CN_LOAD(x), and I don't think it is reasonable to have each expression on its own line.

  • I think injecting in the original non-preprocessor source seems really hard, even if we had our own pre-processor.
    Consider, for example, the following code snippet:

int x = 0;
int y = 1;
#define CMP x < y
if (CMP) { return 1; } else { return 2; }

The issue here is that if we wanted to rewrite x and y it is not at all clear where one should do it. Perhaps one should modify the macro, but I suspect there are certainly cases where this would not be obvious especially if the macro is concatenating tokens, or if it was one of the side-effecting expressions (e.g. ++x).

  • I don't have a good feeling of which, between (B) and (C) would be easier to do, I am planning to look at AIL a bit more to see.

  • I think either (B) or (C) will fix the injection issues, but not the problem of reporting incorrect column numbers in error messages.

@kmemarian do you have some thoughts on what might be a good way to proceed?

@bcpierce00
Copy link
Collaborator

Personally I can live with wrong column information in error messages. There is a danger here of letting the best be the enemy of the good...

@yav
Copy link
Collaborator Author

yav commented Jan 10, 2025

Agreed. I think I am leaning toward option (B), but I'll setup a meeting with the other developers to see what they think, and if there's some other options.

@cp526
Copy link
Collaborator

cp526 commented Jan 14, 2025

Miscellaneous comments:

  • (A) Indeed, as in Iavor's second comment, this seems to not be an option. It's not only the (quite unusual) multiple statements per line that would break this: inject also instruments all memory accesses in the code (to do the appropriate ownership checks), and those can be arbitrarily nested within compound expressions, so they're usually not at line boundaries.

  • (C) There is potentially also option (C.2), which is to use Cabs rather than Ail to do the injection. Cabs ("C abstract syntax") comes before the desugaring to Ail in the pipeline and may (@kmemarian ?) retain all structure of the original program (minus, perhaps, comments)? One could extend Cabs to remember comments, do the injection into that, and build a pretty-printer for Cabs, if that doesn't currently exist. (Comments in the Cabs AST could be "forgotten" by the translation to Ail, and so not much of the pipeline would have to change.)

    RE Readability of injected checks: we have never optimised this for readability, and there is probably a lot of scope for making this better. One option could be to have the injection define macros for function pre-and post-conditions, and inject macro "invocations" rather than inline pre/post-condition checks into the generated code (and so, to produce a file that has to be pre-processed again after the injection).

  • (D) As in Iavor's comment, this is the only option that also lets us refer to macros in CN specifications. Changes to sufficiently align CN+C syntax for typical cases may be desirable anyway. The obvious downside is that the custom pre-processor has to be quite good in matching the behaviour of Clang and GCC to fit into existing project's build machinery.

  • The CMP example, with the #define in the middle of a block looks quite unnatural, and hopefully that's something that doesn't actually happen in practice? Although, any assumptions about the code we end up relying on (such as, potentially, the absence of this), we should have corresponding checks that notify the user about code breaking those assumptions.

  • AIUI, Dhruv has also been looking into options for how to invoke the existing preprocessor to record column information (e.g. [CN] error message locations do not account for the preprocessor #393 (comment) ). Dhruv also links to relevant VeriFast publications.

  • We should understand what Frama-C does. E.g. in their manual, Section 5.2 "Preprocessing the Source Files": This allows to have macros in ACSL annotations while using a non-GNU-like preprocessor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cn CN-exec Related to CN executable spec generation, called using `cn instrument`
Projects
None yet
Development

No branches or pull requests

3 participants