name: semgrep description: 运行Semgrep静态分析以进行快速安全扫描和模式匹配。当需要扫描代码、编写自定义YAML规则、快速查找漏洞、使用污染模式或在CI/CD管道中设置Semgrep时使用。 allowed-tools:

Bash
Read
Glob
Grep

Semgrep静态分析

何时使用Semgrep

理想场景：

快速安全扫描（分钟级，而非小时级）
基于模式的错误检测
强制编码标准和最佳实践
查找已知漏洞模式
单文件分析，无需复杂数据流
深层工具之前的初步分析

考虑使用CodeQL的场景：

需要跨文件的程序间污染跟踪
需要复杂的数据流分析
分析自定义专有框架

何时不应使用

请勿在以下场景使用此技能：

复杂的程序间数据流分析（改用CodeQL）
二进制分析或没有源代码的编译代码
需要AST/CFG遍历的自定义深度语义分析
需要跨多个函数边界跟踪污染时

安装

# pip
python3 -m pip install semgrep

# Homebrew
brew install semgrep

# Docker
docker run --rm -v "${PWD}:/src" returntocorp/semgrep semgrep --config auto /src

# 更新
pip install --upgrade semgrep

核心工作流程

1. 快速扫描

semgrep --config auto .                    # 自动检测规则
semgrep --config auto --metrics=off .      # 为专有代码禁用遥测

2. 使用规则集

semgrep --config p/<规则集> .             # 单个规则集
semgrep --config p/security-audit --config p/trailofbits .  # 多个规则集

规则集	描述
`p/default`	通用安全和代码质量
`p/security-audit`	全面的安全规则
`p/owasp-top-ten`	OWASP Top 10漏洞
`p/cwe-top-25`	CWE Top 25漏洞
`p/r2c-security-audit`	r2c安全审计规则
`p/trailofbits`	Trail of Bits安全规则
`p/python`	Python特定
`p/javascript`	JavaScript特定
`p/golang`	Go特定

3. 输出格式

semgrep --config p/security-audit --sarif -o results.sarif .   # SARIF格式
semgrep --config p/security-audit --json -o results.json .     # JSON格式
semgrep --config p/security-audit --dataflow-traces .          # 显示数据流

4. 扫描特定路径

semgrep --config p/python app.py           # 单个文件
semgrep --config p/javascript src/         # 目录
semgrep --config auto --include='**/test/**' .  # 包含测试（默认排除）

编写自定义规则

基本结构

rules:
  - id: hardcoded-password
    languages: [python]
    message: "检测到硬编码密码：$PASSWORD"
    severity: ERROR
    pattern: password = "$PASSWORD"

模式语法

语法	描述	示例
`...`	匹配任何内容	`func(...)`
`$VAR`	捕获元变量	`$FUNC($INPUT)`
`<... ...>`	深度表达式匹配	`<... user_input ...>`

模式运算符

运算符	描述
`pattern`	匹配精确模式
`patterns`	所有必须匹配（AND）
`pattern-either`	任何匹配（OR）
`pattern-not`	排除匹配
`pattern-inside`	仅在上下文中匹配
`pattern-not-inside`	仅在上下文外匹配
`pattern-regex`	正则表达式匹配
`metavariable-regex`	对捕获值进行正则表达式匹配
`metavariable-comparison`	比较值

组合模式

rules:
  - id: sql-injection
    languages: [python]
    message: "潜在的SQL注入"
    severity: ERROR
    patterns:
      - pattern-either:
          - pattern: cursor.execute($QUERY)
          - pattern: db.execute($QUERY)
      - pattern-not:
          - pattern: cursor.execute("...", (...))
      - metavariable-regex:
          metavariable: $QUERY
          regex: .*\+.*|.*\.format\(.*|.*%.*

污染模式（数据流）

简单模式匹配找到明显案例：

# 模式 `os.system($CMD)` 捕获此：
os.system(user_input)  # 找到

但错过间接流：

# 相同模式错过此：
cmd = user_input
processed = cmd.strip()
os.system(processed)  # 错过 - 没有直接匹配

污染模式通过赋值和转换跟踪数据：

源：不受信任数据输入处（user_input）
传播器：如何流动（cmd = ..., processed = ...）
净化器：使其安全的内容（shlex.quote()）
汇：变得危险处（os.system()）

rules:
  - id: command-injection
    languages: [python]
    message: "用户输入流向命令执行"
    severity: ERROR
    mode: taint
    pattern-sources:
      - pattern: request.args.get(...)
      - pattern: request.form[...]
      - pattern: request.json
    pattern-sinks:
      - pattern: os.system($SINK)
      - pattern: subprocess.call($SINK, shell=True)
      - pattern: subprocess.run($SINK, shell=True, ...)
    pattern-sanitizers:
      - pattern: shlex.quote(...)
      - pattern: int(...)

完整规则与元数据

rules:
  - id: flask-sql-injection
    languages: [python]
    message: "SQL注入：用户输入流向未经参数化的查询"
    severity: ERROR
    metadata:
      cwe: "CWE-89: SQL Injection"
      owasp: "A03:2021 - 注入"
      confidence: HIGH
    mode: taint
    pattern-sources:
      - pattern: request.args.get(...)
      - pattern: request.form[...]
      - pattern: request.json
    pattern-sinks:
      - pattern: cursor.execute($QUERY)
      - pattern: db.execute($QUERY)
    pattern-sanitizers:
      - pattern: int(...)
    fix: cursor.execute($QUERY, (params,))

测试规则

测试文件格式

# test_rule.py
def test_vulnerable():
    user_input = request.args.get("id")
    # ruleid: flask-sql-injection
    cursor.execute("SELECT * FROM users WHERE id = " + user_input)

def test_safe():
    user_input = request.args.get("id")
    # ok: flask-sql-injection
    cursor.execute("SELECT * FROM users WHERE id = ?", (user_input,))

semgrep --test rules/

CI/CD集成（GitHub Actions）

name: Semgrep

on:
  push:
    branches: [main]
  pull_request:
  schedule:
    - cron: '0 0 1 * *'  # 每月

jobs:
  semgrep:
    runs-on: ubuntu-latest
    container:
      image: returntocorp/semgrep

    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0  # 用于差异感知扫描

      - name: 运行Semgrep
        run: |
          if [ "${{ github.event_name }}" = "pull_request" ]; then
            semgrep ci --baseline-commit ${{ github.event.pull_request.base.sha }}
          else
            semgrep ci
          fi
        env:
          SEMGREP_RULES: >-
            p/security-audit
            p/owasp-top-ten
            p/trailofbits

配置

.semgrepignore

tests/fixtures/
**/testdata/
generated/
vendor/
node_modules/

抑制误报

password = get_from_vault()  # nosemgrep: hardcoded-password
dangerous_but_safe()  # nosemgrep

性能

semgrep --config rules/ --time .    # 检查规则性能
ulimit -n 4096                       # 为大型代码库增加文件描述符

规则中的路径过滤

rules:
  - id: my-rule
    paths:
      include: [src/]
      exclude: [src/generated/]

第三方规则

pip install semgrep-rules-manager
semgrep-rules-manager --dir ~/semgrep-rules download
semgrep -f ~/semgrep-rules .

需拒绝的合理化

捷径	为什么错误
“Semgrep什么都没找到，代码很干净”	Semgrep基于模式；无法跟踪跨函数的复杂数据流
“我写了规则，所以有覆盖”	规则需要使用 `semgrep --test` 测试；假阴性是无声的
“污染模式捕获注入”	仅当正确定义了所有源、汇和净化器时
“专业规则是全面的”	专业规则好但不详尽；补充自定义规则以覆盖代码库
“太多发现 = 嘈杂的工具”	高发现数通常意味着真实问题；调整规则，而不是禁用它们

资源

注册表：https://semgrep.dev/explore
游乐场：https://semgrep.dev/playground
文档：https://semgrep.dev/docs/
Trail of Bits规则：https://github.com/trailofbits/semgrep-rules
博客：https://semgrep.dev/blog/