name: distribution-fitter description: 用于仿真和分析中输入建模的统计分布拟合技能。 allowed-tools: Bash(*) Read Write Edit Glob Grep WebFetch metadata: author: babysitter-sdk version: “1.0.0” category: simulation backlog-id: SK-IE-006
distribution-fitter
您是 distribution-fitter - 一个专门用于为仿真和分析中的输入建模而将统计分布拟合到数据的技能。
概述
此技能支持AI驱动的分布拟合,包括:
- 拟合优度检验(卡方、K-S、Anderson-Darling)
- 最大似然估计
- 分布参数估计
- 到达间隔时间分析
- 服务时间分布拟合
- 经验分布构建
- 分布比较与选择
先决条件
- Python 3.8+ 并安装 scipy, fitter
- 统计分析库
- 概率分布的理解
能力
1. 自动化分布拟合
from fitter import Fitter
import numpy as np
def fit_distribution(data, distributions=None):
"""
拟合多个分布并选择最佳拟合
"""
if distributions is None:
distributions = ['norm', 'expon', 'gamma', 'lognorm',
'weibull_min', 'beta', 'uniform', 'triang']
f = Fitter(data, distributions=distributions)
f.fit()
# 获取摘要
summary = f.summary()
# 最佳分布
best = f.get_best(method='sumsquare_error')
return {
"best_distribution": list(best.keys())[0],
"parameters": best,
"summary": summary.to_dict(),
"all_fits": f.fitted_param
}
2. 拟合优度检验
from scipy import stats
import numpy as np
def goodness_of_fit_tests(data, distribution, params):
"""
执行多个拟合优度检验
"""
results = {}
# Kolmogorov-Smirnov 检验
ks_stat, ks_pvalue = stats.kstest(data, distribution, args=params)
results['kolmogorov_smirnov'] = {
'statistic': ks_stat,
'p_value': ks_pvalue,
'conclusion': 'accept' if ks_pvalue > 0.05 else 'reject'
}
# 卡方检验
observed, bins = np.histogram(data, bins='auto')
dist = getattr(stats, distribution)
expected = len(data) * np.diff(dist.cdf(bins, *params))
# 合并期望计数低的区间
mask = expected >= 5
chi2_stat, chi2_pvalue = stats.chisquare(
observed[mask], expected[mask]
)
results['chi_square'] = {
'statistic': chi2_stat,
'p_value': chi2_pvalue,
'degrees_of_freedom': sum(mask) - len(params) - 1
}
# Anderson-Darling 检验(针对特定分布)
if distribution in ['norm', 'expon', 'gumbel', 'logistic']:
ad_result = stats.anderson(data, dist=distribution)
results['anderson_darling'] = {
'statistic': ad_result.statistic,
'critical_values': dict(zip(
['15%', '10%', '5%', '2.5%', '1%'],
ad_result.critical_values
))
}
return results
3. 最大似然估计
from scipy.optimize import minimize
from scipy import stats
def mle_fit(data, distribution):
"""
使用最大似然法拟合分布
"""
dist = getattr(stats, distribution)
# 获取参数边界
bounds = get_parameter_bounds(distribution)
# 负对数似然函数
def neg_log_likelihood(params):
return -np.sum(dist.logpdf(data, *params))
# 初始猜测
x0 = get_initial_params(data, distribution)
# 优化
result = minimize(neg_log_likelihood, x0, bounds=bounds,
method='L-BFGS-B')
# 通过Hessian矩阵计算标准误
from scipy.optimize import approx_fprime
hessian = np.zeros((len(result.x), len(result.x)))
epsilon = 1e-5
for i in range(len(result.x)):
hessian[i] = approx_fprime(result.x,
lambda p: approx_fprime(p, neg_log_likelihood, epsilon)[i],
epsilon)
se = np.sqrt(np.diag(np.linalg.inv(hessian)))
return {
"distribution": distribution,
"parameters": result.x.tolist(),
"standard_errors": se.tolist(),
"log_likelihood": -result.fun,
"aic": 2 * len(result.x) + 2 * result.fun,
"bic": len(result.x) * np.log(len(data)) + 2 * result.fun
}
4. 到达间隔时间分析
def analyze_interarrival_times(timestamps):
"""
从时间戳数据中分析到达间隔时间
"""
# 计算到达间隔时间
timestamps = np.array(timestamps)
interarrivals = np.diff(timestamps)
# 基本统计量
stats_summary = {
"count": len(interarrivals),
"mean": np.mean(interarrivals),
"std": np.std(interarrivals),
"cv": np.std(interarrivals) / np.mean(interarrivals),
"min": np.min(interarrivals),
"max": np.max(interarrivals),
"median": np.median(interarrivals)
}
# 拟合指数分布(泊松过程检验)
exp_params = stats.expon.fit(interarrivals, floc=0)
ks_stat, ks_pvalue = stats.kstest(interarrivals, 'expon', args=exp_params)
is_poisson = ks_pvalue > 0.05 and 0.8 < stats_summary['cv'] < 1.2
# 拟合其他分布
fit_result = fit_distribution(interarrivals)
return {
"statistics": stats_summary,
"poisson_process_test": {
"ks_statistic": ks_stat,
"p_value": ks_pvalue,
"cv_test": stats_summary['cv'],
"is_poisson": is_poisson
},
"best_fit": fit_result,
"arrival_rate": 1 / stats_summary['mean']
}
5. 经验分布
class EmpiricalDistribution:
"""
从数据创建经验分布
"""
def __init__(self, data):
self.data = np.sort(data)
self.n = len(data)
self.ecdf = np.arange(1, self.n + 1) / self.n
def cdf(self, x):
"""累积分布函数"""
return np.searchsorted(self.data, x, side='right') / self.n
def ppf(self, q):
"""百分位点函数(逆CDF)"""
idx = int(q * self.n)
return self.data[min(idx, self.n - 1)]
def sample(self, size=1):
"""生成随机样本"""
u = np.random.uniform(0, 1, size)
return np.array([self.ppf(ui) for ui in u])
def to_dict(self):
"""导出用于存储"""
return {
"type": "empirical",
"values": self.data.tolist(),
"probabilities": self.ecdf.tolist()
}
6. 分布比较
def compare_distributions(data, candidates):
"""
比较多个分布拟合
"""
results = []
for dist_name in candidates:
try:
dist = getattr(stats, dist_name)
params = dist.fit(data)
# 对数似然
ll = np.sum(dist.logpdf(data, *params))
# 信息准则
k = len(params)
n = len(data)
aic = 2 * k - 2 * ll
bic = k * np.log(n) - 2 * ll
# KS检验
ks_stat, ks_pvalue = stats.kstest(data, dist_name, args=params)
results.append({
"distribution": dist_name,
"parameters": params,
"log_likelihood": ll,
"aic": aic,
"bic": bic,
"ks_statistic": ks_stat,
"ks_pvalue": ks_pvalue
})
except Exception as e:
continue
# 按AIC排序
results.sort(key=lambda x: x['aic'])
return {
"rankings": results,
"best_by_aic": results[0]['distribution'],
"best_by_bic": min(results, key=lambda x: x['bic'])['distribution']
}
流程集成
此技能与以下流程集成:
discrete-event-simulation-modeling.jsqueuing-system-analysis.jsdemand-forecasting-model-development.js
输出格式
{
"data_summary": {
"n": 500,
"mean": 5.2,
"std": 2.1,
"cv": 0.40
},
"best_fit": {
"distribution": "gamma",
"parameters": {"shape": 6.1, "scale": 0.85},
"goodness_of_fit": {
"ks_statistic": 0.032,
"ks_pvalue": 0.67,
"aic": 1523.4
}
},
"alternative_fits": [
{"distribution": "lognorm", "aic": 1528.1},
{"distribution": "weibull", "aic": 1531.2}
],
"recommendation": "使用 gamma(6.1, 0.85) 作为仿真输入"
}
工具/库
| 库 | 描述 | 用例 |
|---|---|---|
| scipy.stats | 统计函数 | 核心拟合 |
| fitter | 自动拟合 | 快速分析 |
| statsmodels | 高级统计 | 详细检验 |
| R fitdistrplus | R包 | 复杂拟合 |
最佳实践
- 先可视化 - 始终绘制直方图和Q-Q图
- 考虑理论 - 基于过程选择分布
- 测试多个 - 比较多个候选分布
- 检查尾部 - 极端值对仿真很重要
- 记录选择 - 记录所选分布的理由
- 定期更新 - 随着新数据的出现重新拟合
约束
- 报告拟合优度统计量,而不仅仅是参数
- 记录数据收集方法
- 考虑删失或截断数据
- 测试时变参数