name: 数据科学 description: 数据分析、SQL查询、BigQuery操作和数据洞察。用于数据分析任务和查询。
数据科学
数据分析、SQL和洞察生成。
何时使用
- 编写SQL查询
- 数据分析和探索
- 创建可视化
- 统计分析
- ETL和数据管道
SQL模式
常见查询
-- 使用窗口函数进行聚合
SELECT
user_id,
order_date,
amount,
SUM(amount) OVER (PARTITION BY user_id ORDER BY order_date) as running_total,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY order_date DESC) as recency_rank
FROM orders;
-- 使用CTEs提高可读性
WITH monthly_stats AS (
SELECT
DATE_TRUNC('month', created_at) as month,
COUNT(*) as total_orders,
SUM(amount) as revenue
FROM orders
GROUP BY 1
),
growth AS (
SELECT
month,
revenue,
LAG(revenue) OVER (ORDER BY month) as prev_revenue,
(revenue - LAG(revenue) OVER (ORDER BY month)) / NULLIF(LAG(revenue) OVER (ORDER BY month), 0) as growth_rate
FROM monthly_stats
)
SELECT * FROM growth;
BigQuery特定
-- 分区表查询
SELECT *
FROM `project.dataset.events`
WHERE DATE(_PARTITIONTIME) BETWEEN '2024-01-01' AND '2024-01-31';
-- 使用UNNEST处理数组
SELECT
user_id,
item
FROM `project.dataset.orders`,
UNNEST(items) as item;
-- 大数据近似计数
SELECT APPROX_COUNT_DISTINCT(user_id) as unique_users
FROM `project.dataset.events`;
Python分析
import pandas as pd
import numpy as np
# 加载和探索
df = pd.read_csv('data.csv')
df.info()
df.describe()
# 清理和转换
df['date'] = pd.to_datetime(df['date'])
df = df.dropna(subset=['required_field'])
df['category'] = df['category'].fillna('Unknown')
# 聚合
summary = df.groupby('category').agg({
'value': ['mean', 'sum', 'count'],
'date': ['min', 'max']
}).round(2)
# 可视化
import matplotlib.pyplot as plt
df.groupby('date')['value'].sum().plot(figsize=(12, 6))
plt.title('每日值')
plt.savefig('chart.png', dpi=150, bbox_inches='tight')
统计分析
from scipy import stats
# 假设检验
t_stat, p_value = stats.ttest_ind(group_a, group_b)
# 相关性
correlation = df['x'].corr(df['y'])
# 回归
from sklearn.linear_model import LinearRegression
model = LinearRegression().fit(X, y)
print(f"R² = {model.score(X, y):.3f}")
输出格式
## 分析总结
**问题:** [我们要回答的问题]
**数据源:** [使用的表/文件]
**日期范围:** [时间段]
### 关键发现
1. [发现与支持指标]
2. [发现与支持指标]
### 可视化
[图表描述或嵌入图像]
### 建议
- [可操作的洞察]
示例
输入: “分析用户留存” 操作: 查询队列数据,计算留存率,可视化趋势
输入: “寻找顶级客户” 操作: 编写SQL进行RFM分析,分段用户,总结发现