使用 Vercel Functions 作为桥接层整合遗留 PHP 应用与 Kubeflow 部署的 Python 模型服务

架构设计

文章字数: 3.6k

阅读时长: 16 分

我们团队维护着一个服役了近十年的PHP单体应用，它的产品推荐模块终于在一个季度复盘会上被正式标记为“高优先级技术债”。症结显而易见：它依赖一套陈旧的协同过滤算法，通过cron触发的PHP脚本，在每天凌晨流量低谷时进行批处理计算。这意味着用户的推荐列表24小时内毫无变化，完全无法响应其实时行为。在团队的Scrum流程中，这个问题的业务价值（提升用户参与度和转化率）和技术紧迫性都让它在待办事项列表（Backlog）中迅速攀升。

最初的构想是在现有的PHP技术栈内进行重构，但很快被否决。机器学习生态系统的主流在Python，任何试图用PHP复现现代推荐模型（如图神经网络或深度因子分解机）的努力都无异于重复造轮子，且维护成本高昂。我们决定采用Python，并选择了FastAPI框架，因为它基于Starlette和Pydantic，性能出色且具备类型提示，非常适合构建生产级的API服务。

然而，这立刻引出了一个核心的架构难题：如何将一个全新的、基于Python的、可能需要GPU资源的、有状态（需要加载大模型）的机器学习服务，与一个庞大的、无状态的、运行在传统LAMP环境下的PHP单体应用优雅地集成起来？

我们的数据科学团队已经在使用Kubeflow进行模型训练和实验管理，将其扩展到模型服务部署（KFServing/KServe）是自然而然的选择。这为我们提供了模型版本管理、弹性伸缩和强大的部署编排能力。但问题依然存在：直接从PHP应用内部调用Kubernetes集群内的服务，会引入复杂的网络策略、服务发现和认证问题。这是一个典型的异构系统集成挑战，而我们的Scrum迭代周期只有两周，必须找到一个能够快速验证且风险可控的方案。

在一次Sprint计划会议上，我们对比了多种方案：API网关（如Kong/APISIX，过于重型）、在PHP中引入Guzzle等HTTP客户端直接调用（耦合度高，认证复杂）、以及使用消息队列解耦（引入了异步，不适用于实时推荐场景）。最终，我们锁定了一个看似非主流但极其灵活的方案：使用Vercel Functions作为 serverless 桥接层。

这个决策的驱动力在于：

隔离关注点：PHP应用无需知道任何关于Kubernetes的细节。它只需要像调用内部API一样，调用一个Vercel Function的URL。
安全边界：认证逻辑被封装在Vercel Function中。它可以安全地存储访问Kubeflow服务的Service Account Token，而不会暴露给前端或PHP后端。
轻量与可扩展：Vercel Functions是无服务器的，按需执行，完美应对推荐请求的流量波动，且无需我们管理任何基础设施。
敏捷开发：团队的前端开发者对Vercel的开发和部署模式非常熟悉，这使得我们可以并行开发Python模型服务和这个桥接层，完美契合我们的Scrum流程。

于是，我们制定了第一个Sprint的目标：搭建一个最小化的端到端链路。一个部署在Kubeflow上的FastAPI健康检查端点，以及一个能够成功调用它的Vercel Function。

第一步：构建生产级的Python模型服务

在真实项目中，一个模型服务远不止一个预测函数。它必须包含配置管理、模型加载、日志记录、输入验证和优雅的错误处理。我们使用FastAPI和Pydantic来构建这个服务。

项目结构如下：

recommender-service/
├── app/
│   ├── __init__.py
│   ├── main.py          # FastAPI应用入口
│   ├── schemas.py       # Pydantic输入输出模型
│   ├── model_loader.py  # 模型加载与缓存逻辑
│   ├── predictor.py     # 预测核心逻辑
│   └── config.py        # 配置管理
├── tests/
│   └── test_api.py      # 单元测试
├── Dockerfile
└── requirements.txt

`app/config.py`: 配置管理

硬编码配置是生产环境的大忌。我们使用Pydantic的BaseSettings来从环境变量加载配置。

# app/config.py
import os
from pydantic_settings import BaseSettings

class AppSettings(BaseSettings):
    """
    Application settings loaded from environment variables.
    """
    # Model configuration
    MODEL_PATH: str = "/mnt/models/latest"  # Path inside the container where the model is mounted
    MODEL_WARMUP_TIMEOUT: int = 300  # Seconds to wait for model to be ready

    # Logging configuration
    LOG_LEVEL: str = "INFO"
    
    # KServe specific environment variables (if needed)
    KSERVE_STORAGE_URI: str | None = os.environ.get("KSERVE_STORAGE_URI")


settings = AppSettings()

`app/model_loader.py`: 模型加载与预热

模型加载通常是IO密集型和CPU密集型的操作，可能耗时数秒甚至数十秒。一个常见的错误是在每次API请求时都去加载模型。正确的做法是在应用启动时进行预加载（warm-up），并将其缓存为单例。

# app/model_loader.py
import logging
import time
from pathlib import Path
from typing import Any

# A placeholder for a real model class, e.g., from scikit-learn, PyTorch, etc.
class RecommenderModel:
    def __init__(self, model_path: Path):
        self._model_path = model_path
        self._model = None
        self._metadata = {}
        self.load()

    def load(self):
        # Simulate loading a large model file
        logging.info(f"Loading model from {self._model_path}...")
        if not self._model_path.exists():
            raise FileNotFoundError(f"Model directory not found at {self._model_path}")
        
        # In a real scenario, this would be something like:
        # self._model = torch.load(self._model_path / "model.pt")
        # self._metadata = json.load(open(self._model_path / "metadata.json"))
        time.sleep(5) # Simulating a 5-second load time
        self._model = "A_VERY_LARGE_MODEL_OBJECT" 
        self._metadata = {"version": "1.2.3", "trained_at": "2023-10-26"}
        logging.info("Model loaded successfully.")

    def predict(self, user_id: str, context: list) -> list:
        # Simulate prediction logic
        logging.debug(f"Predicting for user_id: {user_id} with context: {context}")
        return [f"product_{user_id}_{i}" for i in range(10)]

# Singleton instance of our model
_model_instance: RecommenderModel | None = None

def get_model() -> RecommenderModel:
    """
    Returns the singleton model instance.
    Initializes the model on first call.
    """
    global _model_instance
    if _model_instance is None:
        raise RuntimeError("Model has not been initialized. Call initialize_model() first.")
    return _model_instance

def initialize_model(model_path: Path) -> None:
    """
    Initializes the model singleton. To be called at application startup.
    """
    global _model_instance
    if _model_instance is None:
        logging.info("Initializing model singleton...")
        _model_instance = RecommenderModel(model_path)
    else:
        logging.warning("Model singleton already initialized.")

`app/main.py`: FastAPI应用

主应用文件将所有部分组合在一起。我们使用FastAPI的lifespan上下文管理器来处理应用启动和关闭事件，这是执行模型预热的最佳位置。

# app/main.py
import logging
from contextlib import asynccontextmanager
from pathlib import Path

from fastapi import FastAPI, Request, status
from fastapi.exceptions import RequestValidationError
from fastapi.responses import JSONResponse

from .config import settings
from .model_loader import initialize_model, get_model
from .schemas import RecommendationRequest, RecommendationResponse, HealthResponse

# Configure logging
logging.basicConfig(level=settings.LOG_LEVEL.upper())

@asynccontextmanager
async def lifespan(app: FastAPI):
    # Startup event
    logging.info("Application startup: Loading model...")
    model_path = Path(settings.MODEL_PATH)
    try:
        initialize_model(model_path)
        logging.info("Application startup complete.")
    except Exception as e:
        logging.critical(f"Failed to initialize model during startup: {e}", exc_info=True)
        # In a real scenario, you might want the pod to crash and restart
        # by raising the exception here.
    yield
    # Shutdown event (not typically used in serverless environments but good practice)
    logging.info("Application shutdown.")

app = FastAPI(
    title="Recommender Service",
    version="1.0.0",
    lifespan=lifespan
)

# Custom exception handler for better error visibility
@app.exception_handler(RequestValidationError)
async def validation_exception_handler(request: Request, exc: RequestValidationError):
    return JSONResponse(
        status_code=status.HTTP_422_UNPROCESSABLE_ENTITY,
        content={"detail": exc.errors(), "body": exc.body},
    )

@app.get("/health", response_model=HealthResponse, tags=["Monitoring"])
async def health_check():
    """
    Health check endpoint for Kubernetes liveness/readiness probes.
    """
    # Here you could add checks for database connections, etc.
    return {"status": "ok", "model_loaded": get_model() is not None}

@app.post("/v1/recommendations", response_model=RecommendationResponse, tags=["Predictions"])
async def predict_recommendations(request: RecommendationRequest):
    """
    Generates product recommendations for a given user and context.
    """
    try:
        model = get_model()
        # In a real project, there would be more complex feature engineering here
        recommendations = model.predict(request.user_id, request.session_context_ids)
        
        return RecommendationResponse(
            user_id=request.user_id,
            recommendations=recommendations,
            model_version=model._metadata.get("version", "unknown")
        )
    except RuntimeError as e:
        logging.error(f"Prediction failed: {e}", exc_info=True)
        return JSONResponse(
            status_code=status.HTTP_503_SERVICE_UNAVAILABLE,
            content={"detail": "Model is not available. Please try again later."}
        )
    except Exception as e:
        logging.error(f"An unexpected error occurred during prediction: {e}", exc_info=True)
        return JSONResponse(
            status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
            content={"detail": "An internal server error occurred."}
        )

第二步：使用Kubeflow进行容器化与部署

我们将上述FastAPI应用容器化，并使用KServe（Kubeflow的组件）进行部署。这里的关键在于KServe的InferenceService CRD，它简化了模型服务的部署，并集成了Knative Serving，提供了服务缩容到零的能力。

`Dockerfile`

# Dockerfile
FROM python:3.11-slim

WORKDIR /app

# Set environment variables to prevent pip from complaining about running as root
ENV PIP_NO_CACHE_DIR=off \
    PIP_DISABLE_PIP_VERSION_CHECK=on \
    PIP_DEFAULT_TIMEOUT=100

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY ./app ./app

# Expose the port the app runs on
EXPOSE 8080

# Command to run the application using uvicorn
# KServe expects the service to be running on port 8080.
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8080"]

`kserve-deployment.yaml`

这是我们提交给Kubernetes集群的YAML文件。它定义了一个InferenceService。KServe的控制器会读取这个定义，并创建所有必要的Kubernetes资源（Deployment, Service, Knative Service等）。

# kserve-deployment.yaml
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "product-recommender"
  namespace: "ml-services"
spec:
  predictor:
    # We use a custom predictor since our server is self-contained
    containers:
      - name: kserve-container # This name is mandatory
        image: gcr.io/our-project/recommender-service:sprint1-v0.1
        ports:
          - containerPort: 8080
            protocol: TCP
        env:
          - name: LOG_LEVEL
            value: "DEBUG"
          - name: MODEL_PATH
            # This path is where KServe's storage initializer will download and mount the model
            # For this example, we assume the model is baked into the image or mounted via a PVC.
            value: "/mnt/models"
        resources:
          requests:
            cpu: "1"
            memory: "2Gi"
          limits:
            cpu: "2"
            memory: "4Gi"
        readinessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 15
          periodSeconds: 10
          timeoutSeconds: 5

当这个YAML被kubectl apply -f kserve-deployment.yaml应用后，Kubeflow/KServe会在ml-services命名空间内创建一个名为product-recommender的服务。我们可以通过kubectl get isvc -n ml-services来查看其状态。部署成功后，它会提供一个集群内部的URL，例如：http://product-recommender.ml-services.svc.cluster.local。这就是我们Vercel Function需要调用的目标。

第三步：Vercel Functions - 敏捷的桥接层

现在到了集成的核心部分。我们在Vercel项目（通常与前端代码库在一起）的api目录下创建一个TypeScript函数，作为PHP应用和Python服务之间的中间人。

api/recommend.ts:

// api/recommend.ts

import type { VercelRequest, VercelResponse } from '@vercel/node';

// These should be set as Environment Variables in the Vercel project settings.
// A common mistake is to hardcode secrets.
const KUBEFLOW_SERVICE_URL = process.env.KUBEFLOW_RECOMMENDER_URL;
const SERVICE_AUTH_TOKEN = process.env.KUBEFLOW_SERVICE_AUTH_TOKEN;

const REQUEST_TIMEOUT_MS = 5000; // 5 seconds timeout

// Input validation schema (can be more robust with libraries like Zod)
interface RequestBody {
  userId: string;
  sessionContextIds?: string[];
}

function isValidRequest(body: any): body is RequestBody {
  return typeof body.userId === 'string' && body.userId.length > 0;
}

export default async function handler(
  req: VercelRequest,
  res: VercelResponse,
) {
  // 1. Method and Configuration Check
  if (req.method !== 'POST') {
    res.setHeader('Allow', 'POST');
    return res.status(405).send('Method Not Allowed');
  }

  if (!KUBEFLOW_SERVICE_URL || !SERVICE_AUTH_TOKEN) {
    console.error('Server configuration error: Missing Kubeflow service URL or auth token.');
    return res.status(500).json({ error: 'Internal server configuration error.' });
  }

  // 2. Input Validation
  const body = req.body;
  if (!isValidRequest(body)) {
    return res.status(400).json({ error: 'Invalid request body. `userId` is required.' });
  }

  // 3. Construct the payload for the Python service
  const payload = {
    user_id: body.userId,
    session_context_ids: body.sessionContextIds || [],
  };

  // 4. Call the internal Kubeflow service with a timeout controller
  const controller = new AbortController();
  const timeoutId = setTimeout(() => controller.abort(), REQUEST_TIMEOUT_MS);

  try {
    console.log(`Forwarding request for user ${body.userId} to ${KUBEFLOW_SERVICE_URL}`);

    const backendResponse = await fetch(`${KUBEFLOW_SERVICE_URL}/v1/recommendations`, {
      method: 'POST',
      headers: {
        'Content-Type': 'application/json',
        // In a real production setup, authentication would be more robust.
        // For Istio-enabled clusters, this might be a JWT token.
        'Authorization': `Bearer ${SERVICE_AUTH_TOKEN}`,
      },
      body: JSON.stringify(payload),
      signal: controller.signal,
    });
    
    clearTimeout(timeoutId);

    // 5. Handle the response from the Python service
    if (!backendResponse.ok) {
        const errorBody = await backendResponse.text();
        console.error(`Backend service returned an error: ${backendResponse.status}`, errorBody);
        // Do not expose detailed internal errors to the client.
        return res.status(502).json({ error: 'Bad Gateway: The recommendation service failed.' });
    }
    
    const data = await backendResponse.json();
    
    // Set caching headers if applicable
    res.setHeader('Cache-Control', 's-maxage=60, stale-while-revalidate=120');
    
    return res.status(200).json(data);

  } catch (error: any) {
    clearTimeout(timeoutId);

    if (error.name === 'AbortError') {
      console.warn(`Request to backend service timed out after ${REQUEST_TIMEOUT_MS}ms`);
      return res.status(504).json({ error: 'Gateway Timeout: The recommendation service took too long to respond.' });
    }

    console.error('An unexpected error occurred while calling the backend service:', error);
    return res.status(500).json({ error: 'Internal Server Error' });
  }
}

这段代码体现了生产级函数的核心要素：

配置外部化: KUBEFLOW_RECOMMENDER_URL 和 SERVICE_AUTH_TOKEN 从Vercel环境变量读取，保证了安全性。
严格的输入验证: 确保发往后端服务的请求是干净的。
超时处理: 使用AbortController来防止函数因后端服务卡死而长时间运行，这对于控制Serverless成本和提供快速失败响应至关重要。
精细的错误处理: 区分了配置错误、客户端错误、后端服务错误和超时，并向客户端返回适当的HTTP状态码和信息，同时在服务端日志中记录详细的内部错误。
日志记录: 在关键节点输出日志，便于在Vercel控制台进行调试。

整合与Scrum流程的反思

这个架构的最终形态可以用下面的图表来表示：

graph TD
    subgraph "Browser"
        A[User] -->|Interacts with page| B(Legacy PHP Application);
    end

    subgraph "Vercel Platform"
        B -->|AJAX call to /api/recommend| C{Vercel Function};
    end

    subgraph "Internal Network (VPC)"
        C -->|Secure API Call via VPC Peering or Service Connector| D(Kubernetes Ingress);
    end

    subgraph "Kubernetes Cluster"
        D -->|Routes traffic| E(KServe/Knative Service);
        E -->|Scales Pods 0 to N| F[Python FastAPI Pod];
        G[Kubeflow Pipelines] -.->|Deploys/Updates| E;
        H[Model Storage GCS/S3] -.->|Loads model| F;
    end

    style B fill:#f9f,stroke:#333,stroke-width:2px
    style C fill:#99f,stroke:#333,stroke-width:2px
    style F fill:#9f9,stroke:#333,stroke-width:2px

在我们的Scrum流程中，这种分解带来了巨大的好处。第一个Sprint我们只关注了FastAPI的健康检查和Vercel Function的连通性。第二个Sprint，我们实现了模型的加载和伪预测逻辑。第三个Sprint则专注于将真实的模型集成进去并进行端到端测试。每个Sprint都有明确、可交付的增量价值，极大地降低了项目风险。例如，在Sprint 2的评审会上，我们发现模型加载的冷启动时间超过了15秒，这在Vercel Function的默认超时限制内是不可接受的。这个早期发现促使我们在下一个Sprint中加入了一个新的技术故事：研究KServe的minReplicas配置和模型预热策略，确保始终有一个热实例可用，从而解决了这个问题。

这个方案并非没有局限性。Vercel Functions自身也有冷启动问题，虽然通常比容器启动快得多，但在极端情况下仍可能增加延迟。管理跨越Vercel和GCP/AWS两个平台的网络和认证策略，需要清晰的文档和规范。此外，这种架构引入了一个额外的网络跳数，对于延迟极其敏感的应用（如高频交易）可能不适用。

未来的迭代方向很明确。随着我们将更多业务逻辑从PHP单体中剥离出来，这个Vercel Function层可以演变成一个成熟的BFF（Backend for Frontend），统一处理认证、数据聚合和裁剪，为前端提供更友好的API。同时，我们会持续监控模型服务的性能指标，利用Kubeflow的能力自动化模型的再训练和A/B测试部署流程，真正实现敏捷的MLOps闭环。