Preprocessor and executable specifications #823

yav · 2025-01-08T22:12:59Z

There are some unfortunate interactions between the C pre-processor and CN's testing infrastructure (e.g., see #393, #654).
This ticket aims to describe the underlying problem and some potential design choices that we may want to consider.

Status Quo

The following diagram shows the steps that CN takes to produce executable specifications:

graph LR
  input.c --> |cpp| input.i
  input.i --> |parser| AIL
  AIL --> |semantics| core
  core --> |synthesize| checks
  checks --> |inject| output.c
  input.c --> |inject| output.c

Here AIL is a data structure, which is an abstract representation of C code, and the checks is a bunch of C code (currently represented as a mixture of AIL and raw strings) that needs to be added to the original program to validate user assertions. The checks are added to the original program by the inject algorithm.

Problems and Considerations

Incorrect Column Numbers Note that because the parser runs after the pre-processor has been applied, the location information we have is not quite right. We are using an external pre-processor, which annotates its output with file+line information, however column information is lost, so the locations that are stored in the AIL have correct collumns for input.i but NOT for input.c

The effect of this is two-fold:

Error messages may contain location information where the column is wrong
The inject algorithm gets confused, resulting in incorrect ouput.c

Multiple Files Because input.c may contain #include directives, when we perform injection we may have to perform injection in multiple files. We do have some code that attempts to do this, but it doesn't seem to work quite correctly. This is a tricky problem, because generally we don't want to modify the original files, but rather a copy of the files. We then have to ensure that those modified files are correctly included in output.c when building it, but we have no easy way to modify the #includes, so we have to achieve it only by changing the flags to the pre-processor. In principle this will never be fully correct, for example, if a file #included and absolute path, we can't change that. Fortunately, this is not something that usually happens in normal programs.

Potential Solutions

Here are a few possibilities we may want to consider:

(A) Modify the inejct algorithm to be line-based and not consider columns. This will only work if the injections can always go on new lines, which---in general---is not the case. Consider, for example, a program that has multiple statements on the same line: we may have to inject stuff in-between those statements.

(B) Modify the inject algorithm to inject in input.i (i.e., the pre-processed filed) rather than the original C source. To make this work, we'd need to modify the parser to keep track of 2 locations per construct: the location in the original C file, which we'd use for error reporting, and the location in the actual pre-processed file, which we'd use for injection. This should solve the issue of incorrect injections and having to deal with #include. We would still report incorrect columns error messages, however. Also, if someone wanted to debug the generated code, they'd be looking at the pre-processed file rather than the original code.

(C) Modify the inject algorithm to inject directly in the AIL AST and then pretty-print that. This has similar trade-offs to option (B) but it wouldn't need to track 2 source locations for things. The trade off is that the generated code would be not only pre-processed but also somewhat desugared and miss things like comments, for example. It is unclear if the resulting code would be better or worse than option (B), however, because pretty printing the code would at least result in nice regular structure, while injecting directly into the text often leads to code that is not properly indented and is quite hard to read. We may also be able to preserve some of the comments in the source if we added a "comment" constructor to AIL, although all consumers of AIL would then have to deal with it...

(D) Write our own pre-processor, which manipulates tokens appropriately so that we have correct column information. This seems to be the only option that would report correct column information. In theory, this should also make it possible to do injection on non-pre-processed source file. We would still have to deal with the issues of #includes, but if we had our own pre-processor, we might be able to also use it to modify the original source program to just change the #include in the output.c. One other benefit to having a custom pre-processor is that we could apply it to the CN specifications (the magic comments) to expand constants too (although this has its own issues, for examples numbers in C and numbers in CN do not use the same notation at the moment, so that feature won't be so useful without some further changes to CN, which we may want to do anyway).

The text was updated successfully, but these errors were encountered:

bcpierce00 · 2025-01-10T03:18:18Z

Couple questions...

For (A), if we were willing to rewrite programs so that multiple statements never appear on the same line, would the problems then be completely solved? (I guess this is asking: Can the inject algorithm really be modified to work only with lines?)

Do you have a sense which of (B) and (C) is easier to implement?

For (D), what does "manipulates tokens appropriately" mean?

yav · 2025-01-10T18:58:52Z

On further thought and inspection, here are some more thoughts:

I don't really think (A) is a viable option, because we also inject things in expressions. For example, if x is a local variable, in the executable spec this becomes CN_LOAD(x), and I don't think it is reasonable to have each expression on its own line.
I think injecting in the original non-preprocessor source seems really hard, even if we had our own pre-processor.
Consider, for example, the following code snippet:

int x = 0;
int y = 1;
#define CMP x < y
if (CMP) { return 1; } else { return 2; }

The issue here is that if we wanted to rewrite x and y it is not at all clear where one should do it. Perhaps one should modify the macro, but I suspect there are certainly cases where this would not be obvious especially if the macro is concatenating tokens, or if it was one of the side-effecting expressions (e.g. ++x).

I don't have a good feeling of which, between (B) and (C) would be easier to do, I am planning to look at AIL a bit more to see.
I think either (B) or (C) will fix the injection issues, but not the problem of reporting incorrect column numbers in error messages.

@kmemarian do you have some thoughts on what might be a good way to proceed?

bcpierce00 · 2025-01-10T20:11:27Z

Personally I can live with wrong column information in error messages. There is a danger here of letting the best be the enemy of the good...

yav · 2025-01-10T21:43:20Z

Agreed. I think I am leaning toward option (B), but I'll setup a meeting with the other developers to see what they think, and if there's some other options.

cp526 · 2025-01-14T14:00:41Z

Miscellaneous comments:

(A) Indeed, as in Iavor's second comment, this seems to not be an option. It's not only the (quite unusual) multiple statements per line that would break this: inject also instruments all memory accesses in the code (to do the appropriate ownership checks), and those can be arbitrarily nested within compound expressions, so they're usually not at line boundaries.
(C) There is potentially also option (C.2), which is to use Cabs rather than Ail to do the injection. Cabs ("C abstract syntax") comes before the desugaring to Ail in the pipeline and may (@kmemarian ?) retain all structure of the original program (minus, perhaps, comments)? One could extend Cabs to remember comments, do the injection into that, and build a pretty-printer for Cabs, if that doesn't currently exist. (Comments in the Cabs AST could be "forgotten" by the translation to Ail, and so not much of the pipeline would have to change.)

RE Readability of injected checks: we have never optimised this for readability, and there is probably a lot of scope for making this better. One option could be to have the injection define macros for function pre-and post-conditions, and inject macro "invocations" rather than inline pre/post-condition checks into the generated code (and so, to produce a file that has to be pre-processed again after the injection).
(D) As in Iavor's comment, this is the only option that also lets us refer to macros in CN specifications. Changes to sufficiently align CN+C syntax for typical cases may be desirable anyway. The obvious downside is that the custom pre-processor has to be quite good in matching the behaviour of Clang and GCC to fit into existing project's build machinery.
The CMP example, with the #define in the middle of a block looks quite unnatural, and hopefully that's something that doesn't actually happen in practice? Although, any assumptions about the code we end up relying on (such as, potentially, the absence of this), we should have corresponding checks that notify the user about code breaking those assumptions.
AIUI, Dhruv has also been looking into options for how to invoke the existing preprocessor to record column information (e.g. [CN] error message locations do not account for the preprocessor #393 (comment) ). Dhruv also links to relevant VeriFast publications.
We should understand what Frama-C does. E.g. in their manual, Section 5.2 "Preprocessing the Source Files": This allows to have macros in ACSL annotations while using a non-GNU-like preprocessor.

yav added cn CN-exec Related to CN executable spec generation, called using `cn instrument` labels Jan 8, 2025

septract mentioned this issue Jan 17, 2025

[CN-test-gen] Crash when using accesses and --with-static-hack on a static global array #831

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preprocessor and executable specifications #823

Preprocessor and executable specifications #823

yav commented Jan 8, 2025

bcpierce00 commented Jan 10, 2025

yav commented Jan 10, 2025 •

edited

Loading

bcpierce00 commented Jan 10, 2025

yav commented Jan 10, 2025

cp526 commented Jan 14, 2025

Preprocessor and executable specifications #823

Preprocessor and executable specifications #823

Comments

yav commented Jan 8, 2025

Status Quo

Problems and Considerations

Potential Solutions

bcpierce00 commented Jan 10, 2025

yav commented Jan 10, 2025 • edited Loading

bcpierce00 commented Jan 10, 2025

yav commented Jan 10, 2025

cp526 commented Jan 14, 2025

yav commented Jan 10, 2025 •

edited

Loading