Architecture

docsfy is a single FastAPI service that combines four major subsystems:

  • an authenticated web/API control plane,
  • a SQLite-backed metadata layer,
  • an asynchronous AI documentation generation pipeline,
  • a static site renderer that emits HTML, Markdown, search data, and LLM index files.

High-Level Component Model

  • Application layer: src/docsfy/main.py
  • Storage layer: src/docsfy/storage.py
  • Generation pipeline: src/docsfy/generator.py, src/docsfy/repository.py, src/docsfy/prompts.py, src/docsfy/ai_client.py, src/docsfy/json_parser.py
  • Static renderer: src/docsfy/renderer.py, src/docsfy/templates/*, src/docsfy/static/*

End-to-end flow:

  1. POST /api/generate receives a GenerateRequest.
  2. Request is authorized (Bearer token or session cookie).
  3. Variant metadata is stored in SQLite (status=generating).
  4. A background asyncio task runs cloning/planning/page generation/rendering.
  5. Output site is written to filesystem under /data/projects/.../site.
  6. Variant status flips to ready.
  7. Docs are served from /docs/{project}/{provider}/{model}/....

FastAPI App Architecture

The application enforces startup requirements (ADMIN_KEY), initializes DB state, and adds auth middleware globally:

@asynccontextmanager
async def lifespan(app: FastAPI) -> AsyncIterator[None]:
    settings = get_settings()
    if not settings.admin_key:
        logger.error("ADMIN_KEY environment variable is required")
        raise SystemExit(1)

    if len(settings.admin_key) < 16:
        logger.error("ADMIN_KEY must be at least 16 characters long")
        raise SystemExit(1)

    _generating.clear()
    await init_db(data_dir=settings.data_dir)
    await cleanup_expired_sessions()
    yield

Authentication is centralized in AuthMiddleware:

class AuthMiddleware(BaseHTTPMiddleware):
    """Authenticate every request via Bearer token or session cookie."""

    # Paths that do not require authentication
    _PUBLIC_PATHS = frozenset({"/login", "/login/", "/health"})

    async def dispatch(
        self, request: Request, call_next: RequestResponseEndpoint
    ) -> Response:
        if request.url.path in self._PUBLIC_PATHS:
            return await call_next(request)

        settings = get_settings()
        user = None
        is_admin = False
        username = ""

        # 1. Check Authorization header (API clients)
        auth_header = request.headers.get("authorization", "")
        if auth_header.startswith("Bearer "):
            token = auth_header[7:]
            if token == settings.admin_key:
                is_admin = True
                username = "admin"
            else:
                user = await get_user_by_key(token)

The generation endpoint uses a lock + in-memory task registry to prevent duplicate variant runs:

gen_key = f"{owner}/{project_name}/{ai_provider}/{ai_model}"
async with _gen_lock:
    if gen_key in _generating:
        raise HTTPException(
            status_code=409,
            detail=f"Variant '{project_name}/{ai_provider}/{ai_model}' is already being generated",
        )

    await save_project(
        name=project_name,
        repo_url=gen_request.repo_url or gen_request.repo_path or "",
        status="generating",
        ai_provider=ai_provider,
        ai_model=ai_model,
        owner=owner,
    )

    try:
        task = asyncio.create_task(
            _run_generation(
                repo_url=gen_request.repo_url,
                repo_path=gen_request.repo_path,
                project_name=project_name,
                ai_provider=ai_provider,
                ai_model=ai_model,
                ai_cli_timeout=gen_request.ai_cli_timeout
                or settings.ai_cli_timeout,
                force=gen_request.force,
                owner=owner,
            )
        )
        _generating[gen_key] = task

Note: Generated docs under /docs/... are still protected by middleware; only /login and /health are public.

Static file serving is path-safe (prevents traversal beyond the variant site directory):

file_path = site_dir / path
try:
    file_path.resolve().relative_to(site_dir.resolve())
except ValueError as exc:
    raise HTTPException(status_code=403, detail="Access denied") from exc
if not file_path.exists() or not file_path.is_file():
    raise HTTPException(status_code=404, detail="File not found")
return FileResponse(file_path)

SQLite Storage Layer

The projects table is variant-scoped by (name, ai_provider, ai_model, owner):

CREATE TABLE IF NOT EXISTS projects (
    name TEXT NOT NULL,
    ai_provider TEXT NOT NULL DEFAULT '',
    ai_model TEXT NOT NULL DEFAULT '',
    owner TEXT NOT NULL DEFAULT '',
    repo_url TEXT NOT NULL,
    status TEXT NOT NULL DEFAULT 'generating',
    current_stage TEXT,
    last_commit_sha TEXT,
    last_generated TEXT,
    page_count INTEGER DEFAULT 0,
    error_message TEXT,
    plan_json TEXT,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    PRIMARY KEY (name, ai_provider, ai_model, owner)
)

Additional tables: - users (role-based accounts, hashed API keys), - project_access (per-owner access grants), - sessions (hashed session tokens + expiry).

User key hashing uses HMAC with ADMIN_KEY as secret:

def hash_api_key(key: str, hmac_secret: str = "") -> str:
    """Hash an API key with HMAC-SHA256 for storage.

    Uses ADMIN_KEY as the HMAC secret so that even if the source is read,
    keys cannot be cracked without the environment secret.
    """
    # NOTE: ADMIN_KEY is used as the HMAC secret. Rotating ADMIN_KEY will
    # invalidate all existing api_key_hash values, requiring all users to
    # regenerate their API keys.
    secret = hmac_secret or os.getenv("ADMIN_KEY", "")
    if not secret:
        msg = "ADMIN_KEY environment variable is required for key hashing"
        raise RuntimeError(msg)
    return hmac.new(secret.encode(), key.encode(), hashlib.sha256).hexdigest()

Project artifact paths are computed and sanitized:

def get_project_dir(
    name: str, ai_provider: str = "", ai_model: str = "", owner: str = ""
) -> Path:
    if not ai_provider or not ai_model:
        msg = "ai_provider and ai_model are required for project directory paths"
        raise ValueError(msg)
    # Sanitize path segments to prevent traversal
    for segment_name, segment in [("ai_provider", ai_provider), ("ai_model", ai_model)]:
        if (
            "/" in segment
            or "\\" in segment
            or ".." in segment
            or segment.startswith(".")
        ):
            msg = f"Invalid {segment_name}: '{segment}'"
            raise ValueError(msg)
    safe_owner = _validate_owner(owner)
    return PROJECTS_DIR / safe_owner / _validate_name(name) / ai_provider / ai_model

Warning: Rotating ADMIN_KEY invalidates existing api_key_hash records by design.

AI Generation Pipeline

Provider integration is intentionally delegated to ai-cli-runner:

from ai_cli_runner import (
    PROVIDERS,
    VALID_AI_PROVIDERS,
    ProviderConfig,
    call_ai_cli,
    check_ai_cli_available,
    get_ai_cli_timeout,
    run_parallel_with_limit,
)

Main staged flow (_generate_from_path) updates current_stage in DB while progressing through planning, generation, and rendering:

await update_project_status(
    project_name,
    ai_provider,
    ai_model,
    status="generating",
    owner=owner,
    current_stage="planning",
)

plan = await run_planner(
    repo_path=repo_dir,
    project_name=project_name,
    ai_provider=ai_provider,
    ai_model=ai_model,
    ai_cli_timeout=ai_cli_timeout,
)

plan["repo_url"] = source_url
await update_project_status(
    project_name,
    ai_provider,
    ai_model,
    status="generating",
    owner=owner,
    current_stage="generating_pages",
    plan_json=json.dumps(plan),
)

pages = await generate_all_pages(
    repo_path=repo_dir,
    plan=plan,
    cache_dir=cache_dir,
    ai_provider=ai_provider,
    ai_model=ai_model,
    ai_cli_timeout=ai_cli_timeout,
    use_cache=use_cache if use_cache else not force,
    project_name=project_name,
    owner=owner,
)
await update_project_status(
    project_name,
    ai_provider,
    ai_model,
    status="generating",
    owner=owner,
    current_stage="rendering",
    page_count=len(pages),
)

site_dir = get_project_site_dir(project_name, ai_provider, ai_model, owner)
render_site(plan=plan, pages=pages, output_dir=site_dir)
await update_project_status(
    project_name,
    ai_provider,
    ai_model,
    status="ready",
    owner=owner,
    current_stage=None,
    last_commit_sha=commit_sha,
    page_count=page_count,
    plan_json=json.dumps(plan),
)

Parallel page generation is bounded (MAX_CONCURRENT_PAGES = 5):

MAX_CONCURRENT_PAGES = 5
...
results = await run_parallel_with_limit(
    coroutines, max_concurrency=MAX_CONCURRENT_PAGES
)

Incremental regeneration uses git diff + AI page targeting:

changed_files = get_changed_files(repo_dir, old_sha, commit_sha)
...
pages_to_regen = await run_incremental_planner(
    repo_dir,
    project_name,
    ai_provider,
    ai_model,
    changed_files,
    existing_plan,
    ai_cli_timeout,
)
if pages_to_regen != ["all"]:
    # Delete only the cached pages that need regeneration
    for slug in pages_to_regen:
        ...
        cache_file = cache_dir / f"{slug}.md"
        ...
        if cache_file.exists():
            cache_file.unlink()
    use_cache = True

Prompt construction explicitly requires source/config/test exploration and README avoidance:

def build_page_prompt(project_name: str, page_title: str, page_description: str) -> str:
    return f"""You are a technical documentation writer. Explore this repository to write
the "{page_title}" page for the {project_name} documentation.

Page description: {page_description}

Explore the codebase as needed. Read source files, configs, tests, and CI/CD pipelines
to write comprehensive, accurate documentation. Do NOT rely on the README.
...
"""

Tip: Use force=true in POST /api/generate to clear cached pages and force a full rebuild.

Static Site Renderer

Renderer converts Markdown to HTML with syntax highlighting and TOC, then sanitizes generated HTML:

md = markdown.Markdown(
    extensions=["fenced_code", "codehilite", "tables", "toc"],
    extension_configs={
        "codehilite": {"css_class": "highlight", "guess_lang": False},
        "toc": {"toc_depth": "2-3"},
    },
)
content_html = _sanitize_html(md.convert(md_text))
toc_html = getattr(md, "toc", "")

URL attributes are allowlisted in sanitization (http, https, #, /, mailto):

def _sanitize_url_attr(match: re.Match) -> str:  # type: ignore[type-arg]
    attr = match.group(1)  # href or src
    quote = match.group(2)  # " or '
    url = match.group(3)  # the URL value
    ...
    if clean_url.startswith(("http://", "https://", "#", "/", "mailto:")):
        return match.group(0)  # Keep as-is
    # Block everything else (javascript:, data:, vbscript:, etc.)
    return f"{attr}={quote}#{quote}"

Site output includes static pages and machine-readable indexes:

# Prevent GitHub Pages from running Jekyll
(output_dir / ".nojekyll").touch()
...
(output_dir / "index.html").write_text(index_html, encoding="utf-8")
...
(output_dir / f"{slug}.html").write_text(page_html, encoding="utf-8")
(output_dir / f"{slug}.md").write_text(md_content, encoding="utf-8")
...
(output_dir / "search-index.json").write_text(
    json.dumps(search_index), encoding="utf-8"
)
...
(output_dir / "llms.txt").write_text(llms_txt, encoding="utf-8")
(output_dir / "llms-full.txt").write_text(llms_full_txt, encoding="utf-8")

The generated UI is enhanced client-side with static assets: - search.js (Cmd/Ctrl+K modal search over search-index.json), - copy.js (copy buttons on code blocks), - callouts.js (blockquote callout classes), - theme.js, scrollspy.js, codelabels.js, github.js.

Configuration and Runtime

App settings (Pydantic settings model):

class Settings(BaseSettings):
    model_config = SettingsConfigDict(
        env_file=".env",
        env_file_encoding="utf-8",
        extra="ignore",
    )

    admin_key: str = ""  # Required — validated at startup
    ai_provider: str = "claude"
    ai_model: str = "claude-opus-4-6[1m]"  # [1m] = 1 million token context window
    ai_cli_timeout: int = Field(default=60, gt=0)
    log_level: str = "INFO"
    data_dir: str = "/data"
    secure_cookies: bool = True  # Set to False for local HTTP dev

Environment example:

ADMIN_KEY=your-secure-admin-key-here-min-16-chars

AI_PROVIDER=claude
AI_MODEL=claude-opus-4-6[1m]
AI_CLI_TIMEOUT=60

Container compose:

services:
  docsfy:
    build: .
    ports:
      - "8000:8000"
    env_file: .env
    volumes:
      - ./data:/data

Container entrypoint:

ENTRYPOINT ["uv", "run", "--no-sync", "uvicorn", "docsfy.main:app", "--host", "0.0.0.0", "--port", "8000"]

Note: ADMIN_KEY must be set and at least 16 characters, or startup exits.

Testing and CI/CD Posture

The repository has broad unit/integration coverage (tests/test_main.py, tests/test_storage.py, tests/test_generator.py, tests/test_renderer.py, tests/test_auth.py, tests/test_integration.py, etc.).

Local test pipeline (tox.toml):

[env.unittests]
deps = ["uv"]
commands = [["uv", "run", "--extra", "dev", "pytest", "-n", "auto", "tests"]]

Local quality/security checks (.pre-commit-config.yaml) include: - ruff + ruff-format, - mypy, - detect-secrets, - gitleaks, - flake8 (with project-specific plugin usage).

Warning: No in-repo hosted workflow definitions were found (for example, no .github/workflows), so remote CI/CD orchestration is external to this repository.