Tech Hub

@ Solution Architecture Works

Advanced Security on GitHub – Part 2 of 2

Prepare a Database for CodeQL

Estimated reading: 7 minutes 73 views

CodeQL treats code as data. You create a database by extracting queryable data from your codebase. Then, you can run CodeQL queries on this database to identify security vulnerabilities, bugs, and other errors. You can write your own queries or use those provided by GitHub researchers and the community.

In this unit, you will learn how to create a database. This step is necessary before you can analyze your code. You must create a CodeQL database containing all the data required to run queries on your code.

CodeQL analysis relies on extracting relational data from your code to build a CodeQL database. These databases contain all the important information about a codebase.

You can use the CodeQL command-line tool (CLI) to analyze the code and generate a database representation. Once the database is ready, you can query it or run a suite of queries to generate a set of results in SARIF format (Static Analysis Results Interchange Format).

Preparing the Database for CodeQL
Before generating a CodeQL database, you need to install and configure the CodeQL CLI. Then, you must retrieve the version of the code you want to analyze.

For compiled languages, the directory must be ready to compile, with all dependencies installed. CodeQL starts by extracting a relational representation of each source file to create the database.
For interpreted languages, the extractor runs directly on the source code, allowing an accurate representation of the codebase and dependency resolution.
Source file extraction works by monitoring the normal compilation process. CodeQL copies each source file whenever the compiler is invoked, collecting all relevant information.

Configuring the CLI
Here are the steps to configure the CodeQL CLI:

  • Download the .zip archive of the CodeQL CLI bundle
    It is recommended to download the full bundle (CLI + queries) to ensure compatibility and better performance.
    The bundle contains: the CodeQL CLI, compatible versions of queries and libraries from the GitHub CodeQL repository, and precompiled versions of the included queries.
  • Go to the Releases page of the public CodeQL repository.
  • Download the bundle specific to your platform under the Assets section.
    You can also download codeql-bundle.tar.gz for all platforms.
  • Extract the .zip archive
    On Linux, Windows, or macOS, extract the archive into the directory of your choice.
    macOS Catalina (or later) users must follow additional steps (see the CodeQL documentation).

Running CodeQL Processes
After extraction, you can:

  • Run <extraction-path>/codeql/codeql
  • Or add <extraction-path>/codeql to your environment PATH variable to simply run codeql.

Verifying CLI Configuration

  • Run codeql resolve packs (or the full path if not added to PATH) to display available CodeQL packs.
  • Run codeql resolve languages to see the languages supported by default.

Creating the Database
Create a CodeQL database by running this command from the root of the cloned project:

codeql database create <database> --language=<language-identifier>

In the command:
Replace <database> with the path to the new database to be created.
Replace <language-identifier> with the identifier of the language you are using to create the database. You can use this identifier with --db-cluster to accept a comma-separated list, or specify it multiple times.

You can also specify the following options, depending on the location of the source files, whether your code needs to be compiled, or if you want to create CodeQL databases for multiple languages:

  • Use --source-root to indicate the root folder of the main source files for database creation.
  • Use --db-cluster for multilingual codebases when you want to create databases for multiple languages.
  • Use --command when creating a database for one or more compiled languages. This option is not required if you are only using Python or JavaScript.
  • Use --no-run-unnecessary-builds with --db-cluster to avoid running the build command for languages where the CodeQL CLI does not need to monitor compilation.

After successfully creating the database, a new directory appears at the location specified in the command. If you used the --db-cluster option to create multiple databases, a subdirectory is created for each language.

Each CodeQL database directory contains several subfolders, including the relational data used for analysis and a source archive. This archive is a copy of the source files at the time of database creation, used by CodeQL to display analysis results.

Extractors
An extractor is a tool that produces relational data and references to sources for each input file, from which a CodeQL database can be built. Each language supported by CodeQL has its own extractor, ensuring extraction is as accurate as possible.

Each extractor defines its own set of configuration options.
Running the following command:

{
    "extractor_root" : "/home/user/codeql/java",
    "extractor_options" : {
        "option1" : {
            "title" : "Java extractor option 1",
            "description" : "An example string option for the Java extractor.",
            "type" : "string",
            "pattern" : "[a-z]+"
        },
        "group1" : {
            "title" : "Java extractor group 1",
            "description" : "An example option group for the Java extractor.",
            "type" : "object",
            "properties" : {
                "option2" : {
                    "title" : "Java extractor option 2",
                    "description" : "An example array option for the Java extractor",
                    "type" : "array",
                    "pattern" : "[1-9][0-9]*"
                }
            }
        }
    }
}

To view the available options for your language’s extractor
Run one of the following commands:

codeql resolve languages --format=betterjson

or

codeql resolve extractor --format=betterjson

The betterjson output format also provides the extractor’s root path along with other language-specific options.

Data in a CodeQL Database

A CodeQL database is a single directory containing all the data required for analysis. This data includes:

  • Relational data
  • A copy of the source files
  • A language-specific database schema that defines relationships between the data

CodeQL imports this data after extraction.

CodeQL databases provide a snapshot of queryable language data extracted from a codebase. This data hierarchically represents the entire code, including:

  • The Abstract Syntax Tree (AST)
  • The Data Flow Graph
  • The Control Flow Graph

For multilingual codebases, databases are generated language by language, each with its own schema. The schema serves as an interface between the initial lexical analysis and the complex analysis performed by CodeQL.

A CodeQL database contains two main tables:

  • expressions: one row for each expression parsed in the source code
  • statements: one row for each statement parsed in the source code

The CodeQL library defines classes that provide an abstraction layer over these tables, including auxiliary tables such as Expr and Stmt.

Potential Limitations of CodeQL

Database creation as part of code scanning may have certain limitations, especially with the GitHub CodeQL Action:

  • You must use a language matrix so that autobuild compiles each listed compiled language. This allows jobs for multiple versions of a language, operating system, or tool.
  • Without a matrix, autobuild attempts to compile the compiled language with the most source files. This often fails (except for Go) if you do not provide an explicit build command before analysis.
  • The behavior of the autobuild step varies by operating system. It tries to automatically detect a build method, which can lead to unreliable results or failures.

Recommendation:

Configure a build step in your workflow file before analysis instead of relying on autobuild to compile automatically. This makes the analysis more reliable and tailored to your project.

You can refer to the CodeQL documentation on autobuild for more details by language.

VS Code Extension

You can use Visual Studio Code (version 1.39 or later) with the CodeQL extension to compile and run queries.

The extension uses the installed CLI (if it is in the PATH). Otherwise, it automatically manages access to the executable, ensuring compatibility with the extension.
Download the extension from the Visual Studio Code Marketplace or via the CodeQL VSIX file.

Share this Doc

Prepare a Database for CodeQL

Or copy link

CONTENTS