Skip to content

How To Contribute CodeQL Queries For Vert.X

Andrew Eisenberg edited this page Aug 18, 2023 · 2 revisions

How To Contribute CodeQL Queries For Vert.X

The following article aims to clarify the most commonly used terms in CodeQL are and how you can contribute to building and improving our support for Vert.X in CodeQL.

TLDR;

  • The code can be found here:
  • Terminology
    • Database : A collection of code and data that has been indexed and prepared for analysis.
    • Predicate : A named function that resolves to a set of tuples that can be constants or values in the database.
    • Source : A potential origin of insecure or untrusted input in a codebase. Examples are request handlers or reading from a socket.
    • Sink : A location in the codebase where untrusted data should not go. Examples are database queries or writing to the file system.
    • Tainting : The process of tracking the flow of potentially untrusted or insecure data from its origin (sources) through various parts of the codebase until it reaches sensitive operations or sinks where it may cause security vulnerabilities
  • How to implement a query

CodeQL Terminology

Below you can find a list of the most common terms used in CodeQL with a bit more detailed explanations.

What is a database?

A "database" refers to a collection of code and data that has been indexed and prepared for analysis. It serves as the foundation for performing code analysis and vulnerability detection using CodeQL queries.

When performing static analysis on a codebase, CodeQL first creates a database by indexing the source code and associated metadata. For compiled languages, like Java, this process involves hooking into the compiler and extracting the program's abstract syntax tree (AST), types, and references as each file is compiled. This structure is inserted into the database. From here, higher level information such as control flow graphs (CFG-s), data flow graphs (DFG-s), and static single assignment (SSA) are created as needed.

Once the database is constructed, it serves as a powerful, structured representation of the codebase. It allows developers and security analysts to perform complex and efficient queries to search for patterns, identify potential vulnerabilities, and gain insights into the behavior of the code.

You can create a CodeQL database in the root of a repository like this:

codeql database create ${THE_NAME_OF_THE_DATABASE} --language=java

What is a predicate?

A predicate is a fundamental concept used for defining queries and expressing patterns over source code or databases. It plays a crucial role in CodeQL queries, which are used for code analysis and vulnerability discovery. A predicate describes logical relations between tuples of constants and values in a CodeQL database. It is analagous to a method in Java.

In other words, predicates allow you to define rules or conditions that you want to check for in your codebase or database. These rules can be as simple or complex as needed, and they serve as building blocks for creating more complex queries.

Here's a brief overview of how predicates work in CodeQL:

  1. Definition: Predicates are defined using the predicate keyword followed by a unique name and a list of parameters in parentheses. The predicate's body contains the logic that evaluates the desired condition.

  2. Parameters: Predicates can take one or more input parameters, and these parameters represent elements of your code or database that the predicate operates on. For example, a predicate may take a variable as a parameter and check if it's uninitialized or if it contains certain values.

  3. Boolean function: Predicates return a boolean value (true or false) based on whether the condition they define is satisfied or not. (@carlspring: strcitly speaking, predicates don't "return", they resolve to a set of values of a certain type. Some predicates can resolve to booleans, but predicates can resolve any type.)

  4. Re-usability: Predicates can be reused in multiple queries, making it easier to express complex patterns by combining simpler predicates together.

This is a basic example of a predicate in CodeQL that checks if a given variable is assigned a constant value:

predicate isAssignedConstant(Expr exp) {
  exists(Expr constant | constant = exp.getAnOperand() and constant.isConstant())
}

In this example, isAssignedConstant is the predicate name, and it takes an expression (Expr) as a parameter. It uses the exists keyword to check if there exists an operand of the given expression (exp) that is a constant value.

Predicates provide a powerful way to abstract and modularize your CodeQL queries, making them easier to read, write and maintain. By creating reusable predicates, you can build a library of code analysis rules that can be applied across different projects and codebases.

What are sources and sinks?

A "source" refers to a particular kind of data that represents a potential origin of insecure or untrusted input in a codebase. Sources are an essential concept in code analysis and vulnerability detection as they help identify potential points where external data might enter a system, which could lead to security vulnerabilities if not properly handled.

For example, consider a web application that takes user input through HTTP requests. The user input, such as data submitted through web forms or query parameters in URLs, could be considered a source of potential security issues, if not handled correctly. This input data could be used in various parts of the application, such as database queries or command execution, and if not properly sanitized or validated, it could lead to security vulnerabilities like, for example, SQL injections or command injections.

CodeQL allows you to define and track sources in your codebase by creating "source sinks" definitions. A "sink" is a location in the codebase where the data from a source is used in a potentially unsafe manner. Sink locations represent places in the code where external or untrusted data is processed or used without proper validation, sanitization, or handling, making the application susceptible to security vulnerabilities.

Identifying sources and sinks is crucial for understanding how data flows through the application and how it is processed. By connecting sources to sinks, CodeQL can pin-point potential security issues and vulnerabilities in the codebase.

Here's a simplified example of a source and a corresponding sink in CodeQL:

// Define a source that represents user input from HTTP requests
class HttpRequest extends Source {
  HttpRequest() {
    this.hasTypeName("javax.servlet.http.HttpServletRequest");
  }
}

// Define a sink that represents a potentially unsafe usage of the user input (e.g., SQL query)
class DatabaseQuerySink extends Sink {
  DatabaseQuerySink() {
    this.hasMethodName("executeQuery");
  }
}

// Connect the source to the sink to detect potential security issues
from Source source, Sink sink
where
  source.hasParameter(sink)
select sink, source

Here, we define a source HttpRequest that represents an instance of javax.servlet.http.HttpServletRequest, which is commonly used to handle incoming HTTP requests in web applications. We also define a sink DatabaseQuerySink that represents a method call to executeQuery, which could be a database query operation.

The from clause connects the source to the sink, and the where clause checks if the source is used as a parameter for the sink. If CodeQL finds any matches for this pattern, it would indicate a potential security issue where user input from an HTTP request is directly used in a database query without proper validation or sanitization.

By identifying sources and their potential unsafe usages (sinks), CodeQL helps developers and security analysts detect and fix security vulnerabilities and prevent attacks (such as, for example, SQL injections, command injections, etc) in their codebases.

What is tainting?

"Tainting" refers to the process of tracking the flow of potentially untrusted or insecure data from its origin (sources) through various parts of the codebase until it reaches sensitive operations or sinks where it may cause security vulnerabilities. The concept of tainting is a fundamental technique used in security analysis to identify potential security risks and vulnerabilities related to data flow in a software system.

Here's how tainting works in CodeQL:

  • Source Taints: At the start of the analysis, specific data sources (e.g., user input, environment variables, file input) are marked or "tainted" to indicate that they might contain untrusted data.

  • Propagation: As the analysis proceeds, the taint information associated with a tainted source spreads through the data flow in the codebase. If a tainted value is used in a computation, assignment, or passed as an argument to a function, the result of that operation is also marked as tainted.

  • Sink Detection: The analysis identifies sinks where the tainted data may cause security issues if it is not handled properly. Sinks are places in the code where external data is used in a potentially unsafe manner, such as in database queries, command executions, or web output.

  • Vulnerability Identification: The goal of tainting is to trace the flow of tainted data to sinks. If a tainted value reaches a sink without proper validation, sanitization, or escaping, it indicates a potential security vulnerability, such as SQL injection, cross-site scripting (XSS), or command injection.

By tracing the taint flow in the codebase and connecting tainted sources to sinks, CodeQL enables security analysts to perform comprehensive security analysis, identify potential security issues, and assist developers in fixing these vulnerabilities.

Here's a simple example of tainting in CodeQL:

Consider the following code snippet:

String userInput = getUserInput();
String sanitizedInput = sanitize(userInput);
executeQuery("SELECT * FROM table WHERE column = " + sanitizedInput);

In this example, getUserInput() is considered a source, as it takes user input, and executeQuery() is considered a sink, as it uses the input in a database query. The variable sanitizedInput is used to store the result of the sanitize() function, which presumably removes any potentially dangerous characters from the user input.

During taint analysis, the tainted status of userInput propagates to sanitizedInput because it's used as an argument to the executeQuery() function. If the sanitize() function is inadequate or improperly implemented, the tainted data may reach the database query, potentially leading to a SQL injection vulnerability.

By understanding the taint flow in the code, CodeQL can identify this potential security issue and help developers ensure that the user input is appropriately handled and sanitized before being used in sensitive operations like database queries.

What IDE-s Support CodeQL?

The only IDE that currently supports CodeQL, is VSCode.

To write or test custom queries in VSCode, you will need to:

  • Install the CodeQL extension.
  • Create a database for the repository you would like to test against, (for example, vertx-vulns).
  • Add the CodeQL database.

If you would like to get support for CodeQL in Idea, you may consider voting for it under the IDEA-281216: Add support for CodeQL to IntelliJ Idea issue.

How To Test A CodeQL Query Via The CLI

  1. Create a CodeQL database in the root of the repository you would like to test against, by executing the following, (for example. inside the root of the vertx-vulns repository):
codeql database create vertx-vulns --language=java
  1. To execute a single CodeQL query called InsecureHttpServer.ql against the database, execute:
codeql query run -v --database=/java/vertx-vulns/vertx-vulns/ InsecureHttpServer.ql

Contributing CodeQL Queries For Vert.X To carlspring/vertx-codeql-queries

The code for the custom CodeQL query pack for Vert.X can be found in the carlspring/vertx-codeql-queries Github repository.

We are collecting sample code for anti-patterns, as well as illustrations of what the correct way to implement things should actually look like. If you would like to contribute such examples, you can do so in the carlspring/vertx-vulns Github repository.

If you would like to help expand the support for Vert.X, you can:

For each query, there needs to be:

  • An example of the "bad" (anti-pattern) code
  • An example of the "good" code
  • A CodeQL query
  • A CodeQL query help file
  • At least one test case

Adding Query Help Files

A query help file (.qhelp) is used to explain what a given security vulnerability or anti-pattern is and how to fix it, if it is discovered in the scanned code. It should contain one or more examples of "bad" and "good" code.

Query help files are written in a custom XML format which is then converted to Markdown during the build. These help files should include:

  • Examples of bad code (these should be references to actual source files).
  • Examples of proper implementations (these should be references to actual source files).
  • References to articles with further explanations, if possible.

To generate the Markdown for the .qhelp files, execute the following in the directory where your queries and help files are located:

codeql generate query-help *.qhelp --format=markdown -o .

The actual Markdown files should not be added to Git, as they should be produced by the CI/CD pipeline instead.

You can find more about query help files in the CodeQL: Query Help Files article, as well as real-life examples of qhelp files here.

Writing Test Cases

Test cases should be placed under the vertx-codeql-queries/ql/test/query-tests directory.

When you initially set up the tests, you will need to create:

  • A directory with the same name of the CodeQL query to store the test's resources, so that they are isolated from those of other tests.
  • An options file (check for more details) that looks something like this:
//semmle-extractor-options: --javac-args -cp ${testdir}/../../stubs/ -source 17
  • A pom.xml file that defines your third-party dependencies. (Please, not that only Maven pom.xml files are supported at present time). This build file is only used once to generate stubs for the sources of your third-party dependencies. What this means is that the third-party dependencies are not downloaded as source artifacts, but rather -- CodeQL generates no-op methods. This was done with the idea of reducing processing time. To generate the stubs, invoke the makeStubs to generate stubs for the required sources of your dependencies, like this:
carlspring@carlspring:/java/vertx-codeql-queries/vertx-codeql-queries/ql/test$ mkdir stubs
carlspring@carlspring:/java/vertx-codeql-queries/vertx-codeql-queries/ql/test$ cd InsecureCorsHttpOrigin
carlspring@carlspring:/java/vertx-codeql-queries/vertx-codeql-queries/ql/test/query-tests/InsecureCorsHttpOrigin$ python3 \
   /java/codeql/java/ql/src/utils/stub-generator/makeStubs.py . ../../stubs/ pom.xml 
  • An .expected file which should have the same name as the test and directory name. This should contain the output the test is expecting to be produced. This is the actual "test". You can leave this file blank the first time you execute the tests and wait for CodeQL to generate an .actual file for you, as it will contain the actual output. If this looks correct, you can copy this over to the .expected file. The .actual files should not be added to Git.
  • A .qlref file which should have the same name as the test and directory name. This should contain the name of the respective CodeQL query that is being tested. For example, if your query is called InsecureCorsHttpOrigin.ql it should look like this:
InsecureCorsHttpOrigin.ql
  • An example of a source file with bad code. (Obviously, the test SHOULD identify this as bad code).
  • An example of a source file with good code. (Obviously, the test SHOULD NOT identify this as bad code).

To execute the tests:

carlspring@carlspring:/java/vertx-codeql-queries/ql/test$ codeql test run . --additional-packs=../src
Executing 3 tests in 3 directories.
Extracting test database in /java/vertx-codeql-queries/ql/test/query-tests/InsecureHttpServer.
Compiling queries in /java/vertx-codeql-queries/ql/test/query-tests/InsecureHttpServer.
Executing tests in /java/vertx-codeql-queries/ql/test/query-tests/InsecureHttpServer.
[1/3 comp 9.5s eval 943ms] PASSED /java/vertx-codeql-queries/ql/test/query-tests/InsecureHttpServer/InsecureHttpServer.qlref
Extracting test database in /java/vertx-codeql-queries/ql/test/query-tests/InsecureCorsHttpOrigin.
Compiling queries in /java/vertx-codeql-queries/ql/test/query-tests/InsecureCorsHttpOrigin.
Executing tests in /java/vertx-codeql-queries/ql/test/query-tests/InsecureCorsHttpOrigin.
[2/3 comp 124ms eval 459ms] PASSED /java/vertx-codeql-queries/ql/test/query-tests/InsecureCorsHttpOrigin/InsecureCorsHttpOrigin.qlref
Extracting test database in /java/vertx-codeql-queries/ql/test/query-tests/InsecureCorsWildcardOrigin.
Compiling queries in /java/vertx-codeql-queries/ql/test/query-tests/InsecureCorsWildcardOrigin.
Executing tests in /java/vertx-codeql-queries/ql/test/query-tests/InsecureCorsWildcardOrigin.
[3/3 comp 104ms eval 476ms] PASSED /java/vertx-codeql-queries/ql/test/query-tests/InsecureCorsWildcardOrigin/InsecureCorsWildcardOrigin.qlref
All 3 tests passed.

A complete working example of this can be found here. You can also use carlspring/vertx-codeql-queries#2 as inspiration.

Useful Links