semgrep-core
contributing
The following explains how to build semgrep-core
so you can make and test changes to the OCaml code. Once you have semgrep-core
installed, you can refer to semgrep-contributing to see how to build and run the Semgrep application.
Build semgrep-core
This document assumes you are building on MacOS and have already installed the Homebrew package manager. Installation commands and package names for different OSes may vary slightly.
Check out the code
Begin by cloning the Semgrep repo from Git. Each parser's tree-sitter code is managed as a separate submodule, so pass --recurse-submodules
to ensure they are cloned as well.
git clone --recurse-submodules https://github.com/semgrep/semgrep
cd semgrep
If you have already cloned without submodules, you can check them out as a second separate step from the root of the repository:
git submodule update --init --recursive
Prerequisites
semgrep-core
is written primarily in OCaml. You must install OCaml and its package manager OPAM, and pin the current compiler version. On MacOS, it is done through the following steps:
brew install opam
opam init
opam switch create semgrep 5.3.0
eval $(opam env)
Next, install some base packages required for setup and compilation.
brew install pkg-config bash
Lastly, you will almost certainly want the Python environment for semgrep-cli
configured before proceeding. Please refer to the Set up the environment documentation.
Once you've returned here, ensure that your shell is able to enter the Python virtual environment.
cd cli; pipenv shell # enter the virtual environment
cd .. # from within the virtual environment, return to the repo root
First-time installation
The root Makefile
contains targets that take care of building the
right things. It is commented. Please refer to it and keep it
up-to-date.
To install all necessary dependencies, run
make setup
Next, to install semgrep-core
, run
make core
Finally, test the installation with
bin/semgrep-core -help
If you would like to finish the Semgrep installation, return to the Python-side instructions.
Rebuild after a change
Unless there is a significant dependency change, you won't need to run make dev-setup
again.
The Semgrep team has provided useful targets to help you build and link the entire semgrep project, including both semgrep-core
and semgrep
. You may find these helpful.
To install the latest OCaml binaries and semgrep
binary after pulling source code changes from Git, run:
make rebuild
To install after you make a change locally, run
make build # or just `make`
After making either of these targets, semgrep
runs with all your local changes, OCaml and Python both.
Because this updates the `semgrep` binary, if you do not have your Python environment configured properly, you will encounter errors when running these commands. Follow the procedure under [Development](#development)
Development
In practice, it is not always convenient to use make build
or make rebuild
. make rebuild
will update everything within the project; make build
will compile and install all the binaries. You can do this yourself in a more targeted fashion.
Below is a flow appropriate for frequent developers of semgrep-core
After you pull, run
git submodule update --recursive
This will update internal dependencies. (We suggest aliasing it to uu
)
After tree-sitter
is updated, you may need to reconfigure it. If so, run
make config
Develop semgrep-core
If you are developing semgrep-core
, Use Makefile
in the repository root for core
and core-test
targets; the code is primarily in src/
.
The following assumes you are in the repository root.
After you pull or make a change, compile using
make
This will build an executable for semgrep-core
in _build/default/src/main/Main.exe
(we suggest aliasing this to sc
). Try it out by running
_build/default/src/main/Main.exe -help
When you are done, test your changes with
make core-test
Finally, to update the semgrep-core
binary used by semgrep
, run
make copy-core-for-cli
Test semgrep-core
make test
in the repository root directory will run tests that check code is correctly parsed
and patterns perform as expected. To add a test in an appropriate language subdirectory, tests/patterns/[LANG]
, create a target file (expected file extension given language) and a .sgrep file with a pattern. The testing suite will check that all places with a comment with ERROR
were matches found by the .sgrep file. See existing tests for more clarity.
If you are diagnosing test failures, it is time-consuming to re-run the entire test suite.
make retest
will only re-run tests that failed.
Development environment
OCaml installations include a language server that most modern editors like Neovim and Emacs support out of the box.
You can also use Visual Studio Code (vscode) to edit the code of Semgrep. The reason-vscode Marketplace extension adds support for OCaml/Reason.
The OCaml and Reason IDE extension by @freebroccolo is another valid extension, but it seems not as actively maintained as reason-vscode.
The source of Semgrep contains also a .vscode/ directory at its root containing a task file to automatically build Semgrep from vscode.
Note that dune and ocamlmerlin must be in your PATH for vscode to correctly build and provide cross-reference on the code. In case of problems, do:
cd /path/to/semgrep
eval $(opam env)
dune --version # just checking dune is in your PATH
ocamlmerlin -version # just checking ocamlmerlin is in your PATH
code .
Test Semgrep performance
Explore results from a slow run of Semgrep
Interpret the result object
For full timing information, run Semgrep with --time
and --json
flags. In addition, you can add time
at the beginning of the command to get the true wall time. The --json
argument produces a large amount of output, so redirecting the output to a file with -o
is recommended.
See the following example for the full command:
time semgrep --config=auto --time --json -o result.json PATH/TO/SRC
Substitute the optional placeholder PATH/TO/SRC
with the path to your source code.
Here is an example result object.
{ "results": [],
"paths": {},
"errors": [],
"time": {
"max_memory_bytes": 48693248,
"profiling_times": {
"config_time": 0.0624239444732666,
"core_time": 0.11341428756713867,
"ignores_time": 0.00017690658569335938,
"total_time": 0.17628788948059082
},
"rules": [
{
"id": "test-rule"
}
],
"rules_parse_time": 0.0013418197631835938,
"targets": [
{
"match_times": [
5.9604644775390625e-06
],
"num_bytes": 340,
"parse_times": [
0.0071868896484375
],
"path": "test_functions.java",
"run_time": 0.011521100997924805
}
],
"total_bytes": 340
}
}
All the information about timing is contained under time
.
The first section is profiling_times
. This contains wall time durations of various relevant steps:
- Getting the rule config files (
config_time
) - Running the main engine (
core_time
) - Processing the ignores (
ignores_time
)
The total_time
field represents the sum of these steps.
The remaining fields report engine performance. Together, rule_parse_time
and targets
should capture all the time spent running semgrep-core
.
rule_parse_time
is straightforward. It records the time spent parsing the rules file.
targets
poses more difficulty. Since files are run in parallel, the amount of time spent parsing (parse_times
) and matching (match_times
) will inevitably be meaningless compared against total_time
or core_time
. Therefore, the total run time (run_time
) of each target for each rule is taken within the parallel run. This helps contextualize the time spent parsing and matching each target. The sum of the run times thus can (and usually should) be longer than the total time.
The lists match_times
and parse_times
are in the same order as rules
. That is, the match time of rule rules[0]
is match_times[0]
.
Note that parse_times
is given for each rule, but a file should only be parsed once (the first number). Afterwards, the parse time represents the time spent retrieving the file's AST from the cache.
Negative values in the metrics
When a time is not measured, by default it has the value -1. It is common to a have a normal runtime, but -1 for the parse time or match time; this indicates an error in parsing.
Tips for exploring Semgrep results
There are several scripts already written to analyze and summarize these timing data. Find them in scripts/processing-output
. If you have a timing file, you can run
python read_timing.py [your_timing_file]
You may need to adjust the line result_times = results
based on whether you have a timing file or the full results (in which case this should be result_times = results["time"]
)