Name: 提取技能Skill
Rating: 5 (12 reviews)
Author: starlake

name: extract description: 从JDBC源提取模式和数据

提取技能

在单个命令中结合模式提取和数据提取。首先将数据库模式元数据提取为Starlake YAML文件，然后将实际数据提取到文件中。这是一个便利命令，依次运行extract-schema和extract-data。

用法

starlake extract [选项]

选项

结合了extract-schema和extract-data的所有选项。

模式提取选项

--config <值>: 数据库表及连接信息
--outputDir <值>: 输出YML文件的路径
--tables <值>: 要提取的数据库表
--connectionRef <值>: JDBC连接引用
--all: 提取所有模式和表
--external: 将YML文件输出到外部文件夹
--parallelism <值>: 并行级别
--snakecase: 对列名应用蛇形命名法

数据提取选项

--limit <值>: 记录数限制
--numPartitions <值>: 分区并行度
--ignoreExtractionFailure: 在提取失败时继续
--clean: 在提取前清理目标文件
--incremental: 仅导出自上次提取以来的新数据
--includeSchemas <值>: 要包含的域
--excludeSchemas <值>: 要排除的域
--includeTables <值>: 要包含的表
--excludeTables <值>: 要排除的表
--reportFormat <值>: 报告输出格式：console、json或html

配置上下文

提取命令使用配置文件（metadata/extract/{name}.sl.yml）来定义要提取的模式和表：

# metadata/extract/externals.sl.yml
version: 1
extract:
  connectionRef: "duckdb"
  jdbcSchemas:
    - schema: "starbake"
      tables:
        - name: "*"              # "*" 提取所有表
      tableTypes:
        - "TABLE"

高级提取配置

# metadata/extract/source_db.sl.yml
version: 1
extract:
  connectionRef: "source_postgres"
  jdbcSchemas:
    - schema: "sales"
      tableTypes:
        - "TABLE"
        - "VIEW"
      tables:
        - name: "orders"
          fullExport: false          # 增量提取
          partitionColumn: "id"      # 并行提取的列
          numPartitions: 4           # 并行级别
          timestamp: "updated_at"    # 增量跟踪列
          fetchSize: 1000            # JDBC获取大小
        - name: "customers"
          fullExport: true

连接配置

提取配置中引用的连接必须在application.sl.yml中定义：

# metadata/application.sl.yml
version: 1
application:
  connections:
    source_postgres:
      type: jdbc
      options:
        url: "jdbc:postgresql://{{PG_HOST}}:{{PG_PORT}}/{{PG_DB}}"
        driver: "org.postgresql.Driver"
        user: "{{DATABASE_USER}}"
        password: "{{DATABASE_PASSWORD}}"

示例

提取模式和数据

starlake extract --config externals --outputDir metadata/load

增量模式提取

starlake extract --config source_db --outputDir /tmp/output --incremental

提取特定表

starlake extract --config source_db --tables sales.orders,sales.customers

提取技能Skill extract