名称：scikit-learn 描述：“机器学习工具包。分类、回归、聚类、PCA、预处理、管道、网格搜索、交叉验证、随机森林、SVM，用于通用机器学习工作流。”

Scikit-learn：Python 中的机器学习

概述

Scikit-learn 是 Python 的首要机器学习库，提供简单高效的预测数据分析工具。应用此技能进行分类、回归、聚类、降维、模型选择、预处理和超参数优化。

何时使用此技能

此技能应在以下情况使用：

构建分类模型（垃圾邮件检测、图像识别、医疗诊断）
创建回归模型（价格预测、预测、趋势分析）
执行聚类分析（客户细分、模式发现）
降低维度（PCA、t-SNE 用于可视化）
预处理数据（缩放、编码、插补）
评估模型性能（交叉验证、指标）
调优超参数（网格搜索、随机搜索）
创建机器学习管道
检测异常或离群值
实现集成方法

核心机器学习工作流

标准 ML 管道

遵循此一般工作流进行监督学习任务：

数据准备
- 加载和探索数据
- 分割为训练/测试集
- 处理缺失值
- 编码分类特征
- 缩放/归一化特征
模型选择
- 从基线模型开始
- 尝试更复杂的模型
- 使用领域知识指导选择
模型训练
- 在训练数据上拟合模型
- 使用管道防止数据泄漏
- 应用交叉验证
模型评估
- 在测试集上评估
- 使用适当的指标
- 分析错误
模型优化
- 调优超参数
- 特征工程
- 集成方法
部署
- 使用 joblib 保存模型
- 创建预测管道
- 监控性能

分类快速入门

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.pipeline import Pipeline

# 创建管道（防止数据泄漏）
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])

# 分割数据（对于不平衡类别使用分层）
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# 训练
pipeline.fit(X_train, y_train)

# 评估
y_pred = pipeline.predict(X_test)
print(classification_report(y_test, y_pred))

# 交叉验证进行稳健评估
from sklearn.model_selection import cross_val_score
scores = cross_val_score(pipeline, X_train, y_train, cv=5)
print(f"CV 准确率: {scores.mean():.3f} (+/- {scores.std():.3f})")

回归快速入门

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.pipeline import Pipeline

# 创建管道
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('regressor', RandomForestRegressor(n_estimators=100, random_state=42))
])

# 分割数据
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 训练
pipeline.fit(X_train, y_train)

# 评估
y_pred = pipeline.predict(X_test)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)
print(f"RMSE: {rmse:.3f}, R²: {r2:.3f}")

算法选择指南

分类算法

从基线开始：LogisticRegression

快速，可解释，适用于线性可分数据
适用于高维数据（文本分类）

通用目的：RandomForestClassifier

处理非线性关系
对离群值鲁棒
提供特征重要性
良好的默认选择

最佳性能：HistGradientBoostingClassifier

表格数据的最新方法
在大数据集上快速（>10K 样本）
经常赢得 Kaggle 竞赛

特殊情况：

小数据集 (<1K)：带 RBF 核的 SVC
非常大数据集 (>100K)：SGDClassifier 或 LinearSVC
可解释性关键：LogisticRegression 或 DecisionTreeClassifier
概率预测：GaussianNB 或校准模型
文本分类：带 TfidfVectorizer 的 LogisticRegression

回归算法

从基线开始：LinearRegression 或 Ridge

快速，可解释
当关系线性时效果好

通用目的：RandomForestRegressor

处理非线性关系
对离群值鲁棒
良好的默认选择

最佳性能：HistGradientBoostingRegressor

表格数据的最新方法
在大数据集上快速

特殊情况：

需要正则化：Ridge（L2）或 Lasso（L1 + 特征选择）
非常大数据集：SGDRegressor
存在离群值：HuberRegressor 或 RANSAC

聚类算法

已知聚类数：KMeans

快速且可扩展
假设球形聚类

未知聚类数：DBSCAN 或 HDBSCAN

处理任意形状
自动离群值检测

层次关系：AgglomerativeClustering

创建聚类层次
适用于可视化（树状图）

软聚类（概率）：GaussianMixture

提供聚类概率
处理椭圆聚类

降维

预处理/特征提取：PCA

快速高效
线性变换
始终先标准化

仅可视化：t-SNE 或 UMAP

保留局部结构
非线性
不要用于预处理

稀疏数据（文本）：TruncatedSVD

适用于稀疏矩阵
潜在语义分析

非负数据：NMF

可解释的组件
主题建模

处理不同数据类型

数值特征

连续特征：

检查分布
处理离群值（移除、裁剪或使用 RobustScaler）
使用 StandardScaler（大多数算法）或 MinMaxScaler（神经网络）缩放

计数数据：

考虑对数变换或平方根
变换后缩放

偏斜数据：

使用 PowerTransformer（Yeo-Johnson 或 Box-Cox）
或 QuantileTransformer 进行更强归一化

分类特征

低基数 (<10 类别)：

from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(drop='first', sparse_output=True)

高基数 (>10 类别)：

from sklearn.preprocessing import TargetEncoder
encoder = TargetEncoder()
# 使用目标统计，通过交叉拟合防止泄漏

序数关系：

from sklearn.preprocessing import OrdinalEncoder
encoder = OrdinalEncoder(categories=[['小', '中', '大']])

文本数据

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

text_pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=1000, stop_words='english')),
    ('classifier', MultinomialNB())
])

text_pipeline.fit(X_train_text, y_train)

混合数据类型

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline

# 定义特征类型
numeric_features = ['年龄', '收入', '信用评分']
categorical_features = ['国家', '职业']

# 分离预处理管道
numeric_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='constant', fill_value='缺失')),
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=True))
])

# 使用 ColumnTransformer 组合
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# 完整管道
from sklearn.ensemble import RandomForestClassifier
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier())
])

pipeline.fit(X_train, y_train)

模型评估

分类指标

平衡数据集：使用准确率或 F1 分数

不平衡数据集：使用平衡准确率、F1 加权或 ROC-AUC

from sklearn.metrics import balanced_accuracy_score, f1_score, roc_auc_score

balanced_acc = balanced_accuracy_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred, average='weighted')

# ROC-AUC 需要概率
y_proba = model.predict_proba(X_test)
auc = roc_auc_score(y_true, y_proba, multi_class='ovr')

成本敏感：定义自定义评分器或调整决策阈值

全面报告：

from sklearn.metrics import classification_report, confusion_matrix

print(classification_report(y_true, y_pred))
print(confusion_matrix(y_true, y_pred))

回归指标

标准使用：RMSE 和 R²

from sklearn.metrics import mean_squared_error, r2_score

rmse = mean_squared_error(y_true, y_pred, squared=False)
r2 = r2_score(y_true, y_pred)

存在离群值：使用 MAE（对离群值鲁棒）

from sklearn.metrics import mean_absolute_error
mae = mean_absolute_error(y_true, y_pred)

百分比错误重要：使用 MAPE

from sklearn.metrics import mean_absolute_percentage_error
mape = mean_absolute_percentage_error(y_true, y_pred)

交叉验证

标准方法（5-10 折）：

from sklearn.model_selection import cross_val_score

scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
print(f"CV 分数: {scores.mean():.3f} (+/- {scores.std():.3f})")

不平衡类别（使用分层）：

from sklearn.model_selection import StratifiedKFold

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=cv)

时间序列（尊重时间顺序）：

from sklearn.model_selection import TimeSeriesSplit

cv = TimeSeriesSplit(n_splits=5)
scores = cross_val_score(model, X, y, cv=cv)

多个指标：

from sklearn.model_selection import cross_validate

scoring = ['accuracy', 'precision_weighted', 'recall_weighted', 'f1_weighted']
results = cross_validate(model, X, y, cv=5, scoring=scoring)

for metric in scoring:
    scores = results[f'test_{metric}']
    print(f"{metric}: {scores.mean():.3f}")

超参数调优

网格搜索（穷举）

from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [100, 200, 500],
    'max_depth': [10, 20, 30, None],
    'min_samples_split': [2, 5, 10]
}

grid_search = GridSearchCV(
    RandomForestClassifier(random_state=42),
    param_grid,
    cv=5,
    scoring='f1_weighted',
    n_jobs=-1,  # 使用所有 CPU 核心
    verbose=1
)

grid_search.fit(X_train, y_train)

print(f"最佳参数: {grid_search.best_params_}")
print(f"最佳 CV 分数: {grid_search.best_score_:.3f}")

# 使用最佳模型
best_model = grid_search.best_estimator_
test_score = best_model.score(X_test, y_test)

随机搜索（更快）

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform

param_distributions = {
    'n_estimators': randint(100, 1000),
    'max_depth': randint(5, 50),
    'min_samples_split': randint(2, 20),
    'max_features': uniform(0.1, 0.9)
}

random_search = RandomizedSearchCV(
    RandomForestClassifier(random_state=42),
    param_distributions,
    n_iter=100,  # 尝试的组合数
    cv=5,
    scoring='f1_weighted',
    n_jobs=-1,
    random_state=42
)

random_search.fit(X_train, y_train)

管道超参数调优

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('svm', SVC())
])

# 使用双下划线表示嵌套参数
param_grid = {
    'svm__C': [0.1, 1, 10, 100],
    'svm__kernel': ['rbf', 'linear'],
    'svm__gamma': ['scale', 'auto', 0.001, 0.01]
}

grid_search = GridSearchCV(pipeline, param_grid, cv=5, n_jobs=-1)
grid_search.fit(X_train, y_train)

特征工程和选择

特征重要性

# 基于树的模型有内置特征重要性
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)

importances = model.feature_importances_
feature_importance_df = pd.DataFrame({
    'feature': feature_names,
    'importance': importances
}).sort_values('importance', ascending=False)

# 排列重要性（适用于任何模型）
from sklearn.inspection import permutation_importance

result = permutation_importance(
    model, X_test, y_test,
    n_repeats=10,
    random_state=42,
    n_jobs=-1
)

importance_df = pd.DataFrame({
    'feature': feature_names,
    'importance': result.importances_mean,
    'std': result.importances_std
}).sort_values('importance', ascending=False)

特征选择方法

单变量选择：

from sklearn.feature_selection import SelectKBest, f_classif

selector = SelectKBest(f_classif, k=10)
X_selected = selector.fit_transform(X, y)
selected_features = selector.get_support(indices=True)

递归特征消除：

from sklearn.feature_selection import RFECV
from sklearn.ensemble import RandomForestClassifier

selector = RFECV(
    RandomForestClassifier(n_estimators=100),
    step=1,
    cv=5,
    n_jobs=-1
)
X_selected = selector.fit_transform(X, y)
print(f"最优特征数: {selector.n_features_}")

基于模型的选择：

from sklearn.feature_selection import SelectFromModel

selector = SelectFromModel(
    RandomForestClassifier(n_estimators=100),
    threshold='median'  # 或 '0.5*mean'，或特定值
)
X_selected = selector.fit_transform(X, y)

多项式特征

from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import Ridge
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ('poly', PolynomialFeatures(degree=2, include_bias=False)),
    ('scaler', StandardScaler()),
    ('ridge', Ridge())
])

pipeline.fit(X_train, y_train)

常见模式和最佳实践

始终使用管道

管道防止数据泄漏并确保正确工作流：

✅ 正确：

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', LogisticRegression())
])
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

❌ 错误（数据泄漏）：

scaler = StandardScaler().fit(X)  # 在所有数据上拟合！
X_train, X_test = train_test_split(scaler.transform(X))

对不平衡类别进行分层

# 对于不平衡分类，始终使用分层
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

必要时缩放

需要缩放：SVM、神经网络、KNN、带正则化的线性模型、PCA、梯度下降

不需要缩放：基于树的模型（随机森林、梯度提升）、朴素贝叶斯

处理缺失值

from sklearn.impute import SimpleImputer

# 数值：使用中位数（对离群值鲁棒）
imputer = SimpleImputer(strategy='median')

# 分类：使用常数值或最频繁值
imputer = SimpleImputer(strategy='constant', fill_value='缺失')

使用适当的指标

平衡分类：准确率，F1
不平衡分类：平衡准确率，F1 加权，ROC-AUC
带离群值的回归：MAE 而非 RMSE
成本敏感：自定义评分器

设置随机状态

# 为了可重复性
model = RandomForestClassifier(random_state=42)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, random_state=42
)

使用并行处理

# 使用所有 CPU 核心
model = RandomForestClassifier(n_jobs=-1)
grid_search = GridSearchCV(model, param_grid, n_jobs=-1)

无监督学习

聚类工作流

from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# 对于聚类，始终缩放
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# 肘部方法找到最优 k
inertias = []
silhouette_scores = []
K_range = range(2, 11)

for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=42)
    labels = kmeans.fit_predict(X_scaled)
    inertias.append(kmeans.inertia_)
    silhouette_scores.append(silhouette_score(X_scaled, labels))

# 绘图并选择 k
import matplotlib.pyplot as plt
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))
ax1.plot(K_range, inertias, 'bo-')
ax1.set_xlabel('k')
ax1.set_ylabel('惯性')
ax2.plot(K_range, silhouette_scores, 'ro-')
ax2.set_xlabel('k')
ax2.set_ylabel('轮廓分数')
plt.show()

# 拟合最终模型
optimal_k = 5  # 基于肘部/轮廓
kmeans = KMeans(n_clusters=optimal_k, random_state=42)
labels = kmeans.fit_predict(X_scaled)

降维

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# 在 PCA 前始终缩放
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# 指定保留的方差
pca = PCA(n_components=0.95)  # 保留 95% 的方差
X_pca = pca.fit_transform(X_scaled)

print(f"原始特征: {X.shape[1]}")
print(f"降维后特征: {pca.n_components_}")
print(f"解释方差: {pca.explained_variance_ratio_.sum():.3f}")

# 可视化解释方差
import matplotlib.pyplot as plt
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('组件数')
plt.ylabel('累积解释方差')
plt.show()

使用 t-SNE 可视化

from sklearn.manifold import TSNE
from sklearn.decomposition import PCA

# 先用 PCA 降到 50 维（更快）
pca = PCA(n_components=min(50, X.shape[1]))
X_pca = pca.fit_transform(X_scaled)

# 应用 t-SNE（仅用于可视化！）
tsne = TSNE(n_components=2, random_state=42, perplexity=30)
X_tsne = tsne.fit_transform(X_pca)

# 可视化
plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap='viridis', alpha=0.6)
plt.colorbar()
plt.title('t-SNE 可视化')
plt.show()

保存和加载模型

import joblib

# 保存模型或管道
joblib.dump(model, 'model.pkl')
joblib.dump(pipeline, 'pipeline.pkl')

# 加载
loaded_model = joblib.load('model.pkl')
loaded_pipeline = joblib.load('pipeline.pkl')

# 使用加载的模型
predictions = loaded_model.predict(X_new)

参考文档

此技能包括全面的参考文件：

references/supervised_learning.md：详细涵盖所有分类和回归算法、参数、用例和选择指南
references/preprocessing.md：数据预处理的完整指南，包括缩放、编码、插补、变换和最佳实践
references/model_evaluation.md：交叉验证策略、指标、超参数调优和验证技术的深入覆盖
references/unsupervised_learning.md：聚类、降维、异常检测和评估方法的全面指南
references/pipelines_and_composition.md：Pipeline、ColumnTransformer、FeatureUnion、自定义变换器和组合模式的完整指南
references/quick_reference.md：代码片段、常见模式和算法选择决策树的快速查找指南

阅读这些文件当：

需要特定算法的详细参数解释
比较多个算法用于任务
深入理解评估指标
构建复杂预处理工作流
解决常见问题

示例搜索模式：

# 查找特定算法信息
grep -r "GradientBoosting" references/

# 查找预处理技术
grep -r "OneHotEncoder" references/preprocessing.md

# 查找评估指标
grep -r "f1_score" references/model_evaluation.md

常见陷阱避免

数据泄漏：始终使用管道，仅在训练数据上拟合
不缩放：为基于距离的算法缩放（SVM、KNN、神经网络）
错误指标：为不平衡数据使用适当的指标
不使用交叉验证：单一训练测试分割可能误导
忘记分层：为不平衡分类进行分层
使用 t-SNE 进行预处理：t-SNE 仅用于可视化！
不设置 random_state：结果不可重复
忽略类别不平衡：使用分层、适当指标或重采样
PCA 不缩放：组件将被高方差特征主导
在训练数据上测试：始终在保留测试集上评估