Name: CorrelationAnalysisSkill
Rating: 5 (4 reviews)
Author: aj

概述

相关性分析衡量变量间关系强度和方向，帮助识别哪些特征相关并检测多重共线性。

使用场景

确定数值变量间的关系
回归建模前检测多重共线性
探索性数据分析以理解特征依赖性
特征选择和降维
验证变量关系的假设
比较线性和非线性关系

核心概念

相关系数：范围从-1到+1
正相关：变量一起变化
负相关：变量反向变化
多重共线性：预测变量间高相关

Python实现

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import pearsonr, spearmanr, kendalltau

# 示例数据
np.random.seed(42)
n = 200
age = np.random.uniform(20, 70, n)
income = age * 2000 + np.random.normal(0, 10000, n)
education_years = age / 2 + np.random.normal(0, 3, n)
satisfaction = income / 50000 + np.random.normal(0, 0.5, n)

df = pd.DataFrame({
    'age': age,
    'income': income,
    'education_years': education_years,
    'satisfaction': satisfaction,
    'years_employed': age - education_years - 6
})

# 皮尔逊相关（线性）
corr_matrix = df.corr(method='pearson')
print("皮尔逊相关矩阵:")
print(corr_matrix)

# 单独相关性与p值
corr_coef, p_value = pearsonr(df['age'], df['income'])
print(f"
皮尔逊相关（年龄 vs 收入）: r={corr_coef:.4f}, p-value={p_value:.4f}")

# 斯皮尔曼相关（基于排名）
spearman_matrix = df.corr(method='spearman')
print("
斯皮尔曼相关矩阵:")
print(spearman_matrix)

spearman_coef, p_value = spearmanr(df['age'], df['income'])
print(f"斯皮尔曼相关（年龄 vs 收入）: rho={spearman_coef:.4f}, p-value={p_value:.4f}")

# 肯德尔tau相关
kendall_coef, p_value = kendalltau(df['age'], df['income'])
print(f"肯德尔相关（年龄 vs 收入）: tau={kendall_coef:.4f}, p-value={p_value:.4f}")

# 相关性热图
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# 皮尔逊热图
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', center=0,
            square=True, ax=axes[0], vmin=-1, vmax=1)
axes[0].set_title('皮尔逊相关热图')

# 斯皮尔曼热图
sns.heatmap(spearman_matrix, annot=True, cmap='coolwarm', center=0,
            square=True, ax=axes[1], vmin=-1, vmax=1)
axes[1].set_title('斯皮尔曼相关热图')

plt.tight_layout()
plt.show()

# 相关性显著性检验

def correlation_with_pvalue(df):
    rows, cols = [], []
    for col1 in df.columns:
        for col2 in df.columns:
            if col1 < col2:  # 避免重复
                r, p = pearsonr(df[col1], df[col2])
                rows.append({
                    '变量1': col1,
                    '变量2': col2,
                    '相关性': r,
                    'P值': p,
                    '显著性': '是' if p < 0.05 else '否'
                })
    return pd.DataFrame(rows)

corr_table = correlation_with_pvalue(df)
print("
相关性与P值:")
print(corr_table)

# 散点图与回归线
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

pairs = [('age', 'income'), ('age', 'education_years'),
         ('income', 'satisfaction'), ('education_years', 'years_employed')]

for idx, (var1, var2) in enumerate(pairs):
    ax = axes[idx // 2, idx % 2]
    ax.scatter(df[var1], df[var2], alpha=0.5)

    # 添加回归线
    z = np.polyfit(df[var1], df[var2], 1)
    p = np.poly1d(z)
    x_line = np.linspace(df[var1].min(), df[var1].max(), 100)
    ax.plot(x_line, p(x_line), "r--", linewidth=2)

    r, p_val = pearsonr(df[var1], df[var2])
    ax.set_title(f'{var1} vs {var2}
r={r:.4f}, p={p_val:.4f}')
    ax.set_xlabel(var1)
    ax.set_ylabel(var2)
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# 多重共线性检测（VIF）
from statsmodels.stats.outliers_influence import variance_inflation_factor

X = df[['age', 'education_years', 'years_employed']]
vif_data = pd.DataFrame()
vif_data['变量'] = X.columns
vif_data['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

print("
方差膨胀因子（VIF）:")
print(vif_data)
print("
VIF > 10: 高多重共线性")
print("VIF > 5: 中等多重共线性")

# 偏相关（控制混杂变量）
def partial_correlation(df, x, y, control_vars):
    from scipy.stats import linregress

    # 去除控制变量后的x残差
    x_residuals = df[x] - np.poly1d(
        np.polyfit(df[control_vars].values, df[x], deg=1)
    )(df[control_vars].values)

    # 去除控制变量后的y残差
    y_residuals = df[y] - np.poly1d(
        np.polyfit(df[control_vars].values, df[y], deg=1)
    )(df[control_vars].values)

    return pearsonr(x_residuals, y_residuals)[0]

partial_corr = partial_correlation(df, 'income', 'satisfaction', ['age'])
print(f"
偏相关（收入 vs 满意度，控制年龄）: {partial_corr:.4f}")

# 距离相关（非线性关系）
try:
    from dcor import distance_correlation
    dist_corr = distance_correlation(df['age'], df['income'])
    print(f"距离相关（年龄 vs 收入）: {dist_corr:.4f}")
except ImportError:
    print("dcor库未安装，无法进行距离相关")

# 随时间变化的相关性
fig, ax = plt.subplots(figsize=(12, 5))

rolling_corr = df['age'].rolling(window=50).corr(df['income'])
ax.plot(rolling_corr.index, rolling_corr.values)
ax.set_title('滚动相关性（年龄 vs 收入，窗口=50）')
ax.set_ylabel('相关系数')
ax.grid(True, alpha=0.3)
plt.show()

解释指南

|r| = 0.0-0.3: 弱相关
|r| = 0.3-0.7: 中等相关
|r| = 0.7-1.0: 强相关
p < 0.05: 统计显著
高VIF (>10): 多重共线性问题

重要提示

相关性 ≠ 因果关系
皮尔逊可能遗漏非线性关系
异常值可能扭曲相关性
样本大小影响显著性
时间趋势可能产生虚假相关

可视化策略

热图概览
散点图显示关系
配对图多变量分析
滚动相关性分析时变关系

交付物

相关性矩阵（皮尔逊，斯皮尔曼）
带注释的相关性热图
统计显著性表格
带回归线的散点图
多重共线性评估（VIF）
偏相关分析
关系解释报告

CorrelationAnalysisSkill CorrelationAnalysis

相关性分析

概述

使用场景

相关性类型

核心概念

Python实现

解释指南

重要提示

可视化策略

交付物