Label Studio Setup

概览

Label Studio 是一个开源的数据标注平台，提供图像、文本、音频和视频标注的工具。本技能涵盖 Label Studio 安装、项目设置、数据导入/导出、标注界面定制、用户管理、质量控制、机器学习后端集成、API 使用、备份和迁移以及生产部署。

前提条件

理解 Docker 和容器化
了解 Python 编程
熟悉数据标注概念
基本了解 PostgreSQL 和 Redis
了解 Web 服务器配置（Nginx）

核心概念

Label Studio 组件

Web 应用程序：基于 Django 的标注 UI
数据库：PostgreSQL 用于数据存储
缓存：Redis 用于会话管理
机器学习后端：可选的机器学习模型集成用于预标注
存储：媒体资产的文件存储

标注类型

图像分类：每个图像一个标签
目标检测：边界框标注
语义分割：像素级标注
命名实体识别（NER）：文本实体提取
视频标注：逐帧标注
音频分类：音频剪辑标注

质量控制

审核工作流：多阶段审核流程
共识：每个任务多个标注者
主动学习：基于不确定性的采样
标注者间一致性：质量指标

实施指南

安装

Docker 设置

# 拉取 Label Studio 镜像
docker pull heartexlabs/label-studio:latest

# 创建数据目录
mkdir -p label-studio/data

# 运行 Label Studio
docker run -it \
  -p 8080:8080 \
  -v `pwd`/label-studio/data:/label-studio/data \
  heartexlabs/label-studio:latest

Docker Compose 设置

# docker-compose.yml
version: '3.3'

services:
  app:
    image: heartexlabs/label-studio:latest
    container_name: label-studio
    ports:
      - 8080:8080
    volumes:
      - ./label-studio/data:/label-studio/data
    environment:
      - DJANGO_DB=default
      - POSTGRE_HOST=postgres
      - POSTGRE_USER=labelstudio
      - POSTGRE_PASSWORD=labelstudio
      - POSTGRE_DB=labelstudio
      - LABEL_STUDIO_USERNAME=admin
      - LABEL_STUDIO_PASSWORD=admin
      - LABEL_STUDIO_EMAIL=admin@example.com
    depends_on:
      - postgres

  postgres:
    image: postgres:13-alpine
    container_name: postgres
    volumes:
      - ./label-studio/postgres-data:/var/lib/postgresql/data
    environment:
      - POSTGRES_USER=labelstudio
      - POSTGRES_PASSWORD=labelstudio
      - POSTGRES_DB=labelstudio

  redis:
    image: redis:alpine
    container_name: redis
    ports:
      - 6379:6379

volumes:
  label-studio-postgres-data:

# 使用 Docker Compose 启动
docker-compose up -d

# 停止
docker-compose down

# 查看日志
docker-compose logs -f app

本地安装

# 通过 pip 安装
pip install label-studio

# 安装 PostgreSQL 支持
pip install label-studio[postgresql]

# 安装所有依赖
pip install label-studio[all]

# 启动 Label Studio
label-studio start

# 使用自定义端口启动
label-studio start --port 9000

# 使用自定义数据目录启动
label-studio start --data-dir ./mydata

# 使用自定义主机启动
label-studio start --host 0.0.0.0

配置

# label_studio_config.py
import os

# 数据库设置
DATABASE = {
    'ENGINE': 'django.db.backends.postgresql',
    'NAME': os.getenv('POSTGRES_DB', 'labelstudio'),
    'USER': os.getenv('POSTGRES_USER', 'labelstudio'),
    'PASSWORD': os.getenv('POSTGRES_PASSWORD', 'labelstudio'),
    'HOST': os.getenv('POSTGRES_HOST', 'localhost'),
    'PORT': os.getenv('POSTGRES_PORT', '5432'),
}

# Redis 设置
REDIS_LOCATION = os.getenv('REDIS_LOCATION', 'redis://localhost:6379/0')

# 存储设置
MEDIA_ROOT = os.path.join(os.path.dirname(__file__), 'data', 'media')

# 安全设置
SECRET_KEY = os.getenv('SECRET_KEY', 'your-secret-key-here')
ALLOWED_HOSTS = ['*']

# 邮件设置（用于通知）
EMAIL_BACKEND = 'django.core.mail.backends.smtp.EmailBackend'
EMAIL_HOST = os.getenv('EMAIL_HOST', 'smtp.gmail.com')
EMAIL_PORT = int(os.getenv('EMAIL_PORT', '587'))
EMAIL_USE_TLS = True
EMAIL_HOST_USER = os.getenv('EMAIL_HOST_USER')
EMAIL_HOST_PASSWORD = os.getenv('EMAIL_HOST_PASSWORD')

# 机器学习后端设置
ML_BACKEND_HOST = os.getenv('ML_BACKEND_HOST', 'http://localhost:9090')
ML_BACKEND_TIMEOUT = int(os.getenv('ML_BACKEND_TIMEOUT', '100'))

项目设置

图像分类

<!-- 图像分类配置 -->
<View>
  <Image name="image" value="$image"/>
  <Choices name="label" toName="image">
    <Choice value="Cat"/>
    <Choice value="Dog"/>
    <Choice value="Bird"/>
    <Choice value="Other"/>
  </Choices>
</View>

<Header value="Image Classification"/>

# 创建图像分类项目
from label_studio_sdk import Client

# 连接到 Label Studio
LABEL_STUDIO_URL = 'http://localhost:8080'
API_KEY = 'your-api-key-here'

client = Client(url=LABEL_STUDIO_URL, api_key=API_KEY)

# 创建项目
project = client.create_project(
    title='Image Classification',
    description='Classify images into categories',
    label_config='''
    <View>
      <Image name="image" value="$image"/>
      <Choices name="label" toName="image">
        <Choice value="Cat"/>
        <Choice value="Dog"/>
        <Choice value="Bird"/>
        <Choice value="Other"/>
      </Choices>
    </View>
    '''
)

目标检测

<!-- 目标检测配置 -->
<View>
  <Image name="image" value="$image"/>
  <RectangleLabels name="label" toName="image" strokeWidth="3">
    <Label value="Person" background="#FF0000"/>
    <Label value="Car" background="#00FF00"/>
    <Label value="Bicycle" background="#0000FF"/>
    <Label value="Dog" background="#FFFF00"/>
  </RectangleLabels>
</View>

<Header value="Object Detection"/>

# 创建目标检测项目
project = client.create_project(
    title='Object Detection',
    description='Detect objects in images',
    label_config='''
    <View>
      <Image name="image" value="$image"/>
      <RectangleLabels name="label" toName="image" strokeWidth="3">
        <Label value="Person" background="#FF0000"/>
        <Label value="Car" background="#00FF00"/>
        <Label value="Bicycle" background="#0000FF"/>
        <Label value="Dog" background="#FFFF00"/>
      </RectangleLabels>
    </View>
    '''
)

语义分割

<!-- 语义分割配置 -->
<View>
  <Image name="image" value="$image"/>
  <PolygonLabels name="label" toName="image" strokeWidth="3">
    <Label value="Background" background="#000000"/>
    <Label value="Person" background="#FF0000"/>
    <Label value="Car" background="#00FF00"/>
    <Label value="Building" background="#0000FF"/>
  </PolygonLabels>
</View>

<Header value="Semantic Segmentation"/>

命名实体识别 (NER)

<!-- NER 配置 -->
<View>
  <Text name="text" value="$text"/>
  <Labels name="label" toName="text">
    <Label value="PERSON" background="#FF0000"/>
    <Label value="ORG" background="#00FF00"/>
    <Label value="LOC" background="#0000FF"/>
    <Label value="MISC" background="#FFFF00"/>
  </Labels>
</View>

<Header value="Named Entity Recognition"/>

# 创建 NER 项目
project = client.create_project(
    title='Named Entity Recognition',
    description='Extract named entities from text',
    label_config='''
    <View>
      <Text name="text" value="$text"/>
      <Labels name="label" toName="text">
        <Label value="PERSON" background="#FF0000"/>
        <Label value="ORG" background="#00FF00"/>
        <Label value="LOC" background="#0000FF"/>
        <Label value="MISC" background="#FFFF00"/>
      </Labels>
    </View>
    '''
)

自定义模板

<!-- 多任务配置（分类 + 边界框） -->
<View>
  <Image name="image" value="$image"/>

  <!-- 分类 -->
  <Choices name="category" toName="image">
    <Choice value="Indoor"/>
    <Choice value="Outdoor"/>
    <Choice value="Mixed"/>
  </Choices>

  <!-- 目标检测 -->
  <RectangleLabels name="objects" toName="image" strokeWidth="3">
    <Label value="Person" background="#FF0000"/>
    <Label value="Car" background="#00FF00"/>
  </RectangleLabels>

  <!-- 属性 -->
  <Taxonomy name="attributes" toName="objects">
    <Choice value="Occluded"/>
    <Choice value="Truncated"/>
    <Choice value="Crowded"/>
  </Taxonomy>
</View>

<Header value="Multi-Task Annotation"/>

<!-- 视频标注配置 -->
<View>
  <Video name="video" value="$video"/>
  <RectangleLabels name="label" toName="video" strokeWidth="3">
    <Label value="Person" background="#FF0000"/>
    <Label value="Car" background="#00FF00"/>
  </RectangleLabels>
  <Keyframe name="keyframe" toName="video"/>
</View>

<Header value="Video Annotation"/>

<!-- 音频分类配置 -->
<View>
  <Audio name="audio" value="$audio"/>
  <Choices name="label" toName="audio">
    <Choice value="Speech"/>
    <Choice value="Music"/>
    <Choice value="Noise"/>
    <Choice value="Other"/>
  </Choices>
</View>

<Header value="Audio Classification"/>

数据导入/导出

导入数据

# 导入图像
project.import_tasks(
    'path/to/images/',
    format='image_dir',
    label_config='label_config.xml'
)

# 从 JSON 导入
tasks = [
    {
        'image': 'http://example.com/image1.jpg',
        'text': 'Sample text 1'
    },
    {
        'image': 'http://example.com/image2.jpg',
        'text': 'Sample text 2'
    }
]

project.import_tasks(tasks)

# 从 CSV 导入
project.import_tasks(
    'data.csv',
    column_mapping={
        'image_url': 'image',
        'description': 'text'
    }
)

# 导入预标注
tasks_with_predictions = [
    {
        'image': 'image1.jpg',
        'predictions': [
            {
                'result': [
                    {
                        'from_name': 'label',
                        'to_name': 'image',
                        'type': 'choices',
                        'value': {'choices': ['Cat']}
                    }
                ],
                'model_version': 'v1.0'
            }
        ]
    }
]

project.import_tasks(tasks_with_predictions)

导出数据

# 导出为 JSON
export = project.export_tasks(
    export_type='JSON',
    download_all_tasks=True,
    download_resources=True
)

# 导出为 COCO 格式
export = project.export_tasks(
    export_type='COCO',
    download_all_tasks=True
)

# 导出为 YOLO 格式
export = project.export_tasks(
    export_type='YOLO',
    download_all_tasks=True
)

# 导出为 CSV
export = project.export_tasks(
    export_type='CSV',
    download_all_tasks=True
)

# 导出仅完成的任务
export = project.export_tasks(
    export_type='JSON',
    only_finished=True
)

# 保存到文件
import json
with open('export.json', 'w') as f:
    json.dump(export, f)

标注界面定制

自定义 CSS

<View style="background-color: #f0f0f0;">
  <Header value="Custom Styling" style="font-size: 24px; color: #333;"/>
  <Image name="image" value="$image" style="max-height: 600px;"/>
  <Choices name="label" toName="image" style="display: flex; gap: 10px;">
    <Choice value="Yes" style="background-color: #4CAF50; color: white; padding: 10px;"/>
    <Choice value="No" style="background-color: #f44336; color: white; padding: 10px;"/>
  </Choices>
</View>

快捷键

<View>
  <Header value="Use hotkeys: 1=Cat, 2=Dog, 3=Bird, 4=Other"/>
  <Image name="image" value="$image"/>
  <Choices name="label" toName="image">
    <Choice value="Cat" hotkey="1"/>
    <Choice value="Dog" hotkey="2"/>
    <Choice value="Bird" hotkey="3"/>
    <Choice value="Other" hotkey="4"/>
  </Choices>
</View>

条件逻辑

<View>
  <Image name="image" value="$image"/>
  <Choices name="has_object" toName="image">
    <Choice value="Yes"/>
    <Choice value="No"/>
  </Choices>

  <Condition name="cond" when="has_object" equal="Yes">
    <RectangleLabels name="object_label" toName="image">
      <Label value="Person"/>
      <Label value="Car"/>
    </RectangleLabels>
  </Condition>
</View>

用户管理

# 创建用户
user = client.create_user(
    email='user@example.com',
    username='newuser',
    password='password123',
    first_name='John',
    last_name='Doe'
)

# 列出用户
users = client.get_users()
for user in users:
    print(f"{user.username}: {user.email}")

# 更新用户
user = client.update_user(
    user_id=1,
    first_name='Jane'
)

# 删除用户
client.delete_user(user_id=1)

# 将用户分配到项目
project.add_member(user_id=1, role='Annotator')

# 从项目中移除用户
project.delete_member(user_id=1)

质量控制

审核工作流

# 启用审核工作流
project.update_settings({
    'review_mode': True,
    'review_percentage': 0.1  # 审核 10% 的任务
})

# 创建审核项目
review_project = client.create_project(
    title='Review Project',
    description='Review annotations',
    source_project_id=project.id
)

# 获取审核任务
review_tasks = review_project.get_tasks()

# 批准审核
review_task = review_tasks[0]
review_task.update_annotations(
    {
        'result': review_task.annotations[0]['result'],
        'was_cancelled': False
    }
)

共识

# 启用共识
project.update_settings({
    'consensus_type': 'majority_vote',
    'consensus_number_of_annotators': 3  # 每个任务 3 个标注者
})

# 获取共识结果
consensus_results = project.get_predictions(
    only_ground_truth=True
)

机器学习后端集成

预标注设置

# 机器学习后端服务器（Flask 示例）
from flask import Flask, request, jsonify
import torch
from transformers import pipeline

app = Flask(__name__)

# 加载模型
classifier = pipeline("image-classification", model="google/vit-base-patch16-224")

@app.route('/predict', methods=['POST'])
def predict():
    data = request.json
    image_url = data['data']['image']

    # 获取预测
    result = classifier(image_url)

    # 格式化为 Label Studio
    predictions = [{
        'result': [{
            'from_name': 'label',
            'to_name': 'image',
            'type': 'choices',
            'value': {
                'choices': [result[0]['label']]
            },
            'score': result[0]['score']
        }],
        'model_version': 'v1.0'
    }]

    return jsonify(predictions)

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=9090)

# 将机器学习后端连接到项目
project.connect_ml_backend(
    url='http://localhost:9090',
    model_version='v1.0'
)

主动学习

# 主动学习与不确定性采样
@app.route('/predict', methods=['POST'])
def predict():
    data = request.json
    image_url = data['data']['image']

    # 获取概率预测
    result = classifier(image_url, top_k=5)

    # 计算不确定性（熵）
    probs = [r['score'] for r in result]
    uncertainty = -sum(p * np.log(p) for p in probs if p > 0)

    predictions = [{
        'result': [{
            'from_name': 'label',
            'to_name': 'image',
            'type': 'choices',
            'value': {
                'choices': [result[0]['label']]
            },
            'score': result[0]['score']
        }],
        'model_version': 'v1.0',
        'score': uncertainty  # 用于主动学习
    }]

    return jsonify(predictions)

API 使用

项目管理

from label_studio_sdk import Client

# 初始化客户端
client = Client(
    url='http://localhost:8080',
    api_key='your-api-key'
)

# 创建项目
project = client.create_project(
    title='My Project',
    description='Project description',
    label_config='<View>...</View>'
)

# 获取项目
project = client.get_project(project_id=1)

# 列出项目
projects = client.get_projects()

# 更新项目
project.update(
    title='Updated Title',
    description='Updated description'
)

# 删除项目
client.delete_project(project_id=1)

任务管理

# 创建任务
tasks = [
    {'data': {'image': 'http://example.com/image1.jpg'}},
    {'data': {'image': 'http://example.com/image2.jpg'}}
]
project.import_tasks(tasks)

# 获取任务
tasks = project.get_tasks()

# 获取特定任务
task = project.get_task(task_id=1)

# 更新任务
task.update({
    'data': {'image': 'http://example.com/new_image.jpg'}
})

# 删除任务
task.delete()

# 搜索任务
tasks = project.get_tasks(
    filter={
        'task': 'search query',
        'completion_percentage': 50
    }
)

标注管理

# 获取任务的标注
task = project.get_task(task_id=1)
annotations = task.get_annotations()

# 创建标注
annotation = task.create_annotation(
    result=[{
        'from_name': 'label',
        'to_name': 'image',
        'type': 'choices',
        'value': {'choices': ['Cat']}
    }]
)

# 更新标注
annotation.update(
    result=[{
        'from_name': 'label',
        'to_name': 'image',
        'type': 'choices',
        'value': {'choices': ['Dog']}
    }]
)

# 删除标注
annotation.delete()

备份和迁移

备份

# 备份数据库
docker exec label-studio pg_dump -U labelstudio labelstudio > backup.sql

# 备份媒体文件
docker cp label-studio:/label-studio/data/media ./backup/media

# 使用 Docker Compose 备份
docker-compose exec postgres pg_dump -U labelstudio labelstudio > backup.sql

# 导出所有项目数据
projects = client.get_projects()

for project in projects:
    export = project.export_tasks(
        export_type='JSON',
        download_all_tasks=True,
        download_resources=True
    )

    # 保存到文件
    filename = f"backup_project_{project.id}.json"
    with open(filename, 'w') as f:
        json.dump(export, f)

迁移

# 迁移到新实例
old_client = Client(url='http://old-server:8080', api_key='old-key')
new_client = Client(url='http://new-server:8080', api_key='new-key')

# 从旧实例获取项目
old_projects = old_client.get_projects()

# 迁移每个项目
for old_project in old_projects:
    # 创建新项目
    new_project = new_client.create_project(
        title=old_project.title,
        description=old_project.description,
        label_config=old_project.label_config
    )

    # 从旧项目导出任务
    tasks = old_project.get_tasks()
    task_data = [{'data': t.data} for t in tasks]

    # 导入到新项目
    new_project.import_tasks(task_data)

生产部署

Nginx 反向代理

# /etc/nginx/sites-available/label-studio
server {
    listen 80;
    server_name label-studio.example.com;

    client_max_body_size 100M;

    location / {
        proxy_pass http://localhost:8080;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
    }

    location /static/ {
        alias /label-studio/data/static/;
    }
}

SSL 配置

server {
    listen 443 ssl http2;
    server_name label-studio.example.com;

    ssl_certificate /etc/ssl/certs/label-studio.crt;
    ssl_certificate_key /etc/ssl/private/label-studio.key;

    client_max_body_size 100M;

    location / {
        proxy_pass http://localhost:8080;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
    }

    server {
        listen 80;
        server_name label-studio.example.com;
        return 301 https://$server_name$request_uri;
    }
}

Systemd 服务

# /etc/systemd/system/label-studio.service
[Unit]
Description=Label Studio
After=network.target

[Service]
Type=simple
User=labelstudio
WorkingDirectory=/home/labelstudio
ExecStart=/home/labelstudio/venv/bin/label-studio start --port 8080
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target

# 启用并启动服务
sudo systemctl enable label-studio
sudo systemctl start label-studio
sudo systemctl status label-studio

最佳实践

项目组织
- 使用一致的命名约定
- 创建描述性的项目标题
- 按任务类型组织项目
- 使用适当的标注指南
质量保证
- 为关键任务启用审核工作流
- 对高风险标注使用共识
- 实施质量指标
- 提供清晰的标注指南
性能优化
- 对大型数据集使用分页
- 为导入实现异步操作
- 优化图像加载和服务
- 使用 CDN 为媒体资产
安全
- 使用强密码和 API 密钥
- 在生产中启用 SSL/TLS
- 实施适当的认证
- 定期更新依赖项
备份策略
- 定期备份数据库
- 定期导出项目数据
- 测试恢复程序
- 安全存储备份
用户管理
- 创建适当的用户角色
- 将用户分配到相关项目
- 监控用户活动
- 移除不活跃的用户
机器学习集成
- 使用预标注加速标注
- 实施主动学习以提高效率
- 监控模型性能
- 定期更新模型
文档
- 文档化标注指南
- 创建标注示例
- 维护项目文档
- 与团队共享知识
监控
- 跟踪标注进度
- 监控系统性能
- 设置问题警报
- 审核质量指标
可扩展性
- 使用适当的硬件
- 实施负载均衡
- 优化数据库查询
- 为增长做计划

Label Studio Setup

概览

前提条件

核心概念

Label Studio 组件

标注类型

质量控制

实施指南

安装

Docker 设置

Docker Compose 设置

本地安装

配置

项目设置

图像分类

目标检测

语义分割

命名实体识别 (NER)

自定义模板

数据导入/导出

导入数据

导出数据

标注界面定制

自定义 CSS

快捷键

条件逻辑

用户管理

质量控制

审核工作流

共识

机器学习后端集成

预标注设置

主动学习

API 使用

项目管理

任务管理

标注管理

备份和迁移

备份

迁移

生产部署

Nginx 反向代理

SSL 配置

Systemd 服务

最佳实践

相关技能