name: 数据质量分析器 description: 分析数据资产以评估质量维度、检测异常,并生成包含可操作建议的全面数据质量报告。 version: 1.0.0 category: 数据质量 skill-id: SK-DEA-005 allowed-tools: Read, Grep, Glob, Bash, WebFetch
数据质量分析器
分析数据资产以评估六个核心数据质量维度并检测异常。
概述
此技能执行全面的数据剖析,以评估完整性、准确性、一致性、有效性、及时性和唯一性。它生成统计概况、检测异常、识别个人身份信息(PII),并提供改进数据质量的可操作建议。
能力
- 统计剖析 - 分布、基数、空值百分比、最小/最大值
- 数据类型推断和验证 - 检测实际类型与声明类型
- 模式检测 - 正则表达式模式、格式、常见结构
- 异常检测 - 离群值、漂移、意外值
- 参照完整性检查 - 外键验证
- 新鲜度监控 - 数据年龄和更新频率
- 数量趋势分析 - 随时间变化的记录计数模式
- 模式变更检测 - 运行之间的结构变化
- 跨列相关性分析 - 识别依赖列
- PII检测和分类 - 敏感数据识别
输入模式
{
"dataSource": {
"type": "object",
"required": true,
"properties": {
"type": {
"type": "string",
"enum": ["table", "file", "query"],
"description": "数据源类型"
},
"connection": {
"type": "object",
"description": "连接详情(平台、数据库、模式)"
},
"identifier": {
"type": "string",
"description": "表名、文件路径或查询字符串"
}
}
},
"sampleSize": {
"type": "number",
"description": "要采样的行数(null表示全扫描)",
"default": 10000
},
"dimensions": {
"type": "array",
"items": {
"type": "string",
"enum": ["accuracy", "completeness", "consistency", "validity", "timeliness", "uniqueness"]
},
"default": ["completeness", "validity", "uniqueness"],
"description": "要评估的质量维度"
},
"previousProfile": {
"type": "object",
"description": "用于漂移检测的先前剖析结果"
},
"businessRules": {
"type": "array",
"items": {
"column": "string",
"rule": "string",
"threshold": "number"
},
"description": "要验证的自定义业务规则"
},
"piiDetection": {
"type": "boolean",
"default": true,
"description": "启用PII检测和分类"
}
}
输出模式
{
"profile": {
"type": "object",
"properties": {
"tableName": "string",
"rowCount": "number",
"columnCount": "number",
"profileTimestamp": "string",
"columns": {
"type": "array",
"items": {
"name": "string",
"declaredType": "string",
"inferredType": "string",
"statistics": {
"nullCount": "number",
"nullPercent": "number",
"distinctCount": "number",
"distinctPercent": "number",
"min": "any",
"max": "any",
"mean": "number",
"median": "number",
"stddev": "number",
"histogram": "array"
},
"patterns": {
"mostCommon": "array",
"detectedFormat": "string",
"regexPattern": "string"
},
"qualityScores": {
"completeness": "number",
"validity": "number",
"uniqueness": "number"
}
}
}
}
},
"anomalies": {
"type": "array",
"items": {
"column": "string",
"type": "outlier|drift|unexpected_null|unexpected_value|format_violation",
"severity": "high|medium|low",
"description": "string",
"examples": "array",
"recommendation": "string"
}
},
"piiFindings": {
"type": "array",
"items": {
"column": "string",
"piiType": "email|phone|ssn|credit_card|name|address|ip|custom",
"confidence": "number",
"sampleCount": "number",
"recommendation": "string"
}
},
"overallScore": {
"type": "number",
"description": "加权质量分数(0-100)"
},
"dimensionScores": {
"completeness": "number",
"accuracy": "number",
"consistency": "number",
"validity": "number",
"timeliness": "number",
"uniqueness": "number"
},
"recommendations": {
"type": "array",
"items": {
"priority": "high|medium|low",
"category": "string",
"description": "string",
"impact": "string"
}
},
"drift": {
"type": "object",
"description": "与先前剖析结果相比的变化",
"properties": {
"schemaChanges": "array",
"statisticalDrift": "array",
"volumeChange": "object"
}
}
}
使用示例
基本表剖析
{
"dataSource": {
"type": "table",
"connection": {
"platform": "snowflake",
"database": "analytics",
"schema": "core"
},
"identifier": "dim_customers"
},
"dimensions": ["completeness", "validity", "uniqueness"]
}
带PII检测的文件剖析
{
"dataSource": {
"type": "file",
"identifier": "./data/customer_export.csv"
},
"sampleSize": 50000,
"piiDetection": true,
"dimensions": ["completeness", "validity", "accuracy"]
}
带业务规则的基于查询的剖析
{
"dataSource": {
"type": "query",
"connection": {
"platform": "bigquery",
"project": "my-project"
},
"identifier": "SELECT * FROM orders WHERE order_date >= '2024-01-01'"
},
"businessRules": [
{"column": "order_total", "rule": "positive", "threshold": 0},
{"column": "status", "rule": "in_set", "values": ["pending", "completed", "cancelled"]},
{"column": "customer_id", "rule": "not_null", "threshold": 100}
]
}
漂移检测
{
"dataSource": {
"type": "table",
"identifier": "fact_sales"
},
"previousProfile": {
"profileTimestamp": "2024-01-01T00:00:00Z",
"rowCount": 1000000,
"columns": [...]
},
"dimensions": ["consistency", "timeliness"]
}
质量维度
完整性 (0-100)
衡量所需数据的存在情况:
| 指标 | 计算方式 |
|---|---|
| 列完整性 | (总数 - 空值) / 总数 * 100 |
| 行完整性 | 包含所有必填字段的行数 / 总行数 * 100 |
| 整体 | 跨列的加权平均值 |
有效性 (0-100)
衡量符合业务规则的程度:
| 检查类型 | 示例 |
|---|---|
| 类型一致性 | INT列中的字符串 |
| 格式一致性 | 无效的电子邮件格式 |
| 范围一致性 | 年龄 > 150 |
| 参照性 | 没有匹配主键的外键 |
唯一性 (0-100)
衡量重复性和基数:
| 指标 | 计算方式 |
|---|---|
| 不同值比率 | 不同值 / 总数 * 100 |
| 重复计数 | 总数 - 不同值 |
| 主键唯一性 | 唯一主键 / 总数 * 100 |
准确性 (0-100)
衡量相对于真实值的正确性:
- 需要参考数据或业务规则
- 交叉验证相关列
- 检查数学关系
一致性 (0-100)
衡量数据集内的统一性:
- 格式一致性(日期、数字)
- 分类值一致性
- 跨列逻辑一致性
及时性 (0-100)
衡量数据新鲜度:
| 指标 | 阈值 |
|---|---|
| 数据年龄 | 自上次更新以来的小时数 |
| 新鲜度SLA | 满足新鲜度要求的百分比 |
| 延迟检测 | 处理延迟测量 |
PII检测类别
| 类型 | 模式示例 |
|---|---|
| 电子邮件 | xxx@domain.com |
| 电话 | (XXX) XXX-XXXX, +1-XXX-XXX-XXXX |
| 社会安全号码 | XXX-XX-XXXX |
| 信用卡 | XXXX-XXXX-XXXX-XXXX (带Luhn检查) |
| 姓名 | 名/姓模式 |
| 地址 | 街道、城市、州、邮编模式 |
| IP地址 | IPv4和IPv6 |
集成点
MCP服务器集成
- Elementary MCP - 数据可观测性和异常检测
- 数据库MCPs - 直接剖析查询
相关技能
- Great Expectations生成器 (SK-DEA-006) - 从剖析结果生成期望套件
- 数据目录丰富器 (SK-DEA-017) - 用剖析元数据丰富目录
适用流程
- 数据质量框架 (
data-quality-framework.js) - 数据目录 (
data-catalog.js) - ETL/ELT管道 (
etl-elt-pipeline.js) - A/B测试管道 (
ab-testing-pipeline.js)
参考文献
版本历史
- 1.0.0 - 首次发布,包含六个维度剖析和PII检测