Architecture
docsfy is a single FastAPI service that combines four major subsystems:
- an authenticated web/API control plane,
- a SQLite-backed metadata layer,
- an asynchronous AI documentation generation pipeline,
- a static site renderer that emits HTML, Markdown, search data, and LLM index files.
High-Level Component Model
- Application layer:
src/docsfy/main.py - Storage layer:
src/docsfy/storage.py - Generation pipeline:
src/docsfy/generator.py,src/docsfy/repository.py,src/docsfy/prompts.py,src/docsfy/ai_client.py,src/docsfy/json_parser.py - Static renderer:
src/docsfy/renderer.py,src/docsfy/templates/*,src/docsfy/static/*
End-to-end flow:
POST /api/generatereceives aGenerateRequest.- Request is authorized (Bearer token or session cookie).
- Variant metadata is stored in SQLite (
status=generating). - A background
asynciotask runs cloning/planning/page generation/rendering. - Output site is written to filesystem under
/data/projects/.../site. - Variant status flips to
ready. - Docs are served from
/docs/{project}/{provider}/{model}/....
FastAPI App Architecture
The application enforces startup requirements (ADMIN_KEY), initializes DB state, and adds auth middleware globally:
@asynccontextmanager
async def lifespan(app: FastAPI) -> AsyncIterator[None]:
settings = get_settings()
if not settings.admin_key:
logger.error("ADMIN_KEY environment variable is required")
raise SystemExit(1)
if len(settings.admin_key) < 16:
logger.error("ADMIN_KEY must be at least 16 characters long")
raise SystemExit(1)
_generating.clear()
await init_db(data_dir=settings.data_dir)
await cleanup_expired_sessions()
yield
Authentication is centralized in AuthMiddleware:
class AuthMiddleware(BaseHTTPMiddleware):
"""Authenticate every request via Bearer token or session cookie."""
# Paths that do not require authentication
_PUBLIC_PATHS = frozenset({"/login", "/login/", "/health"})
async def dispatch(
self, request: Request, call_next: RequestResponseEndpoint
) -> Response:
if request.url.path in self._PUBLIC_PATHS:
return await call_next(request)
settings = get_settings()
user = None
is_admin = False
username = ""
# 1. Check Authorization header (API clients)
auth_header = request.headers.get("authorization", "")
if auth_header.startswith("Bearer "):
token = auth_header[7:]
if token == settings.admin_key:
is_admin = True
username = "admin"
else:
user = await get_user_by_key(token)
The generation endpoint uses a lock + in-memory task registry to prevent duplicate variant runs:
gen_key = f"{owner}/{project_name}/{ai_provider}/{ai_model}"
async with _gen_lock:
if gen_key in _generating:
raise HTTPException(
status_code=409,
detail=f"Variant '{project_name}/{ai_provider}/{ai_model}' is already being generated",
)
await save_project(
name=project_name,
repo_url=gen_request.repo_url or gen_request.repo_path or "",
status="generating",
ai_provider=ai_provider,
ai_model=ai_model,
owner=owner,
)
try:
task = asyncio.create_task(
_run_generation(
repo_url=gen_request.repo_url,
repo_path=gen_request.repo_path,
project_name=project_name,
ai_provider=ai_provider,
ai_model=ai_model,
ai_cli_timeout=gen_request.ai_cli_timeout
or settings.ai_cli_timeout,
force=gen_request.force,
owner=owner,
)
)
_generating[gen_key] = task
Note: Generated docs under
/docs/...are still protected by middleware; only/loginand/healthare public.
Static file serving is path-safe (prevents traversal beyond the variant site directory):
file_path = site_dir / path
try:
file_path.resolve().relative_to(site_dir.resolve())
except ValueError as exc:
raise HTTPException(status_code=403, detail="Access denied") from exc
if not file_path.exists() or not file_path.is_file():
raise HTTPException(status_code=404, detail="File not found")
return FileResponse(file_path)
SQLite Storage Layer
The projects table is variant-scoped by (name, ai_provider, ai_model, owner):
CREATE TABLE IF NOT EXISTS projects (
name TEXT NOT NULL,
ai_provider TEXT NOT NULL DEFAULT '',
ai_model TEXT NOT NULL DEFAULT '',
owner TEXT NOT NULL DEFAULT '',
repo_url TEXT NOT NULL,
status TEXT NOT NULL DEFAULT 'generating',
current_stage TEXT,
last_commit_sha TEXT,
last_generated TEXT,
page_count INTEGER DEFAULT 0,
error_message TEXT,
plan_json TEXT,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (name, ai_provider, ai_model, owner)
)
Additional tables:
- users (role-based accounts, hashed API keys),
- project_access (per-owner access grants),
- sessions (hashed session tokens + expiry).
User key hashing uses HMAC with ADMIN_KEY as secret:
def hash_api_key(key: str, hmac_secret: str = "") -> str:
"""Hash an API key with HMAC-SHA256 for storage.
Uses ADMIN_KEY as the HMAC secret so that even if the source is read,
keys cannot be cracked without the environment secret.
"""
# NOTE: ADMIN_KEY is used as the HMAC secret. Rotating ADMIN_KEY will
# invalidate all existing api_key_hash values, requiring all users to
# regenerate their API keys.
secret = hmac_secret or os.getenv("ADMIN_KEY", "")
if not secret:
msg = "ADMIN_KEY environment variable is required for key hashing"
raise RuntimeError(msg)
return hmac.new(secret.encode(), key.encode(), hashlib.sha256).hexdigest()
Project artifact paths are computed and sanitized:
def get_project_dir(
name: str, ai_provider: str = "", ai_model: str = "", owner: str = ""
) -> Path:
if not ai_provider or not ai_model:
msg = "ai_provider and ai_model are required for project directory paths"
raise ValueError(msg)
# Sanitize path segments to prevent traversal
for segment_name, segment in [("ai_provider", ai_provider), ("ai_model", ai_model)]:
if (
"/" in segment
or "\\" in segment
or ".." in segment
or segment.startswith(".")
):
msg = f"Invalid {segment_name}: '{segment}'"
raise ValueError(msg)
safe_owner = _validate_owner(owner)
return PROJECTS_DIR / safe_owner / _validate_name(name) / ai_provider / ai_model
Warning: Rotating
ADMIN_KEYinvalidates existingapi_key_hashrecords by design.
AI Generation Pipeline
Provider integration is intentionally delegated to ai-cli-runner:
from ai_cli_runner import (
PROVIDERS,
VALID_AI_PROVIDERS,
ProviderConfig,
call_ai_cli,
check_ai_cli_available,
get_ai_cli_timeout,
run_parallel_with_limit,
)
Main staged flow (_generate_from_path) updates current_stage in DB while progressing through planning, generation, and rendering:
await update_project_status(
project_name,
ai_provider,
ai_model,
status="generating",
owner=owner,
current_stage="planning",
)
plan = await run_planner(
repo_path=repo_dir,
project_name=project_name,
ai_provider=ai_provider,
ai_model=ai_model,
ai_cli_timeout=ai_cli_timeout,
)
plan["repo_url"] = source_url
await update_project_status(
project_name,
ai_provider,
ai_model,
status="generating",
owner=owner,
current_stage="generating_pages",
plan_json=json.dumps(plan),
)
pages = await generate_all_pages(
repo_path=repo_dir,
plan=plan,
cache_dir=cache_dir,
ai_provider=ai_provider,
ai_model=ai_model,
ai_cli_timeout=ai_cli_timeout,
use_cache=use_cache if use_cache else not force,
project_name=project_name,
owner=owner,
)
await update_project_status(
project_name,
ai_provider,
ai_model,
status="generating",
owner=owner,
current_stage="rendering",
page_count=len(pages),
)
site_dir = get_project_site_dir(project_name, ai_provider, ai_model, owner)
render_site(plan=plan, pages=pages, output_dir=site_dir)
await update_project_status(
project_name,
ai_provider,
ai_model,
status="ready",
owner=owner,
current_stage=None,
last_commit_sha=commit_sha,
page_count=page_count,
plan_json=json.dumps(plan),
)
Parallel page generation is bounded (MAX_CONCURRENT_PAGES = 5):
MAX_CONCURRENT_PAGES = 5
...
results = await run_parallel_with_limit(
coroutines, max_concurrency=MAX_CONCURRENT_PAGES
)
Incremental regeneration uses git diff + AI page targeting:
changed_files = get_changed_files(repo_dir, old_sha, commit_sha)
...
pages_to_regen = await run_incremental_planner(
repo_dir,
project_name,
ai_provider,
ai_model,
changed_files,
existing_plan,
ai_cli_timeout,
)
if pages_to_regen != ["all"]:
# Delete only the cached pages that need regeneration
for slug in pages_to_regen:
...
cache_file = cache_dir / f"{slug}.md"
...
if cache_file.exists():
cache_file.unlink()
use_cache = True
Prompt construction explicitly requires source/config/test exploration and README avoidance:
def build_page_prompt(project_name: str, page_title: str, page_description: str) -> str:
return f"""You are a technical documentation writer. Explore this repository to write
the "{page_title}" page for the {project_name} documentation.
Page description: {page_description}
Explore the codebase as needed. Read source files, configs, tests, and CI/CD pipelines
to write comprehensive, accurate documentation. Do NOT rely on the README.
...
"""
Tip: Use
force=trueinPOST /api/generateto clear cached pages and force a full rebuild.
Static Site Renderer
Renderer converts Markdown to HTML with syntax highlighting and TOC, then sanitizes generated HTML:
md = markdown.Markdown(
extensions=["fenced_code", "codehilite", "tables", "toc"],
extension_configs={
"codehilite": {"css_class": "highlight", "guess_lang": False},
"toc": {"toc_depth": "2-3"},
},
)
content_html = _sanitize_html(md.convert(md_text))
toc_html = getattr(md, "toc", "")
URL attributes are allowlisted in sanitization (http, https, #, /, mailto):
def _sanitize_url_attr(match: re.Match) -> str: # type: ignore[type-arg]
attr = match.group(1) # href or src
quote = match.group(2) # " or '
url = match.group(3) # the URL value
...
if clean_url.startswith(("http://", "https://", "#", "/", "mailto:")):
return match.group(0) # Keep as-is
# Block everything else (javascript:, data:, vbscript:, etc.)
return f"{attr}={quote}#{quote}"
Site output includes static pages and machine-readable indexes:
# Prevent GitHub Pages from running Jekyll
(output_dir / ".nojekyll").touch()
...
(output_dir / "index.html").write_text(index_html, encoding="utf-8")
...
(output_dir / f"{slug}.html").write_text(page_html, encoding="utf-8")
(output_dir / f"{slug}.md").write_text(md_content, encoding="utf-8")
...
(output_dir / "search-index.json").write_text(
json.dumps(search_index), encoding="utf-8"
)
...
(output_dir / "llms.txt").write_text(llms_txt, encoding="utf-8")
(output_dir / "llms-full.txt").write_text(llms_full_txt, encoding="utf-8")
The generated UI is enhanced client-side with static assets:
- search.js (Cmd/Ctrl+K modal search over search-index.json),
- copy.js (copy buttons on code blocks),
- callouts.js (blockquote callout classes),
- theme.js, scrollspy.js, codelabels.js, github.js.
Configuration and Runtime
App settings (Pydantic settings model):
class Settings(BaseSettings):
model_config = SettingsConfigDict(
env_file=".env",
env_file_encoding="utf-8",
extra="ignore",
)
admin_key: str = "" # Required — validated at startup
ai_provider: str = "claude"
ai_model: str = "claude-opus-4-6[1m]" # [1m] = 1 million token context window
ai_cli_timeout: int = Field(default=60, gt=0)
log_level: str = "INFO"
data_dir: str = "/data"
secure_cookies: bool = True # Set to False for local HTTP dev
Environment example:
ADMIN_KEY=your-secure-admin-key-here-min-16-chars
AI_PROVIDER=claude
AI_MODEL=claude-opus-4-6[1m]
AI_CLI_TIMEOUT=60
Container compose:
services:
docsfy:
build: .
ports:
- "8000:8000"
env_file: .env
volumes:
- ./data:/data
Container entrypoint:
ENTRYPOINT ["uv", "run", "--no-sync", "uvicorn", "docsfy.main:app", "--host", "0.0.0.0", "--port", "8000"]
Note:
ADMIN_KEYmust be set and at least 16 characters, or startup exits.
Testing and CI/CD Posture
The repository has broad unit/integration coverage (tests/test_main.py, tests/test_storage.py, tests/test_generator.py, tests/test_renderer.py, tests/test_auth.py, tests/test_integration.py, etc.).
Local test pipeline (tox.toml):
[env.unittests]
deps = ["uv"]
commands = [["uv", "run", "--extra", "dev", "pytest", "-n", "auto", "tests"]]
Local quality/security checks (.pre-commit-config.yaml) include:
- ruff + ruff-format,
- mypy,
- detect-secrets,
- gitleaks,
- flake8 (with project-specific plugin usage).
Warning: No in-repo hosted workflow definitions were found (for example, no
.github/workflows), so remote CI/CD orchestration is external to this repository.