Storage and Caching
docsfy uses a two-tier persistence strategy: a SQLite database for project metadata and the local filesystem for generated content. This separation keeps structured queries fast while allowing large markdown and HTML files to live on disk where they can be served directly.
SQLite Database
Location
The database file is stored at {DATA_DIR}/docsfy.db, where DATA_DIR defaults to /data and is configurable via environment variable:
# src/docsfy/storage.py
DB_PATH = Path(os.getenv("DATA_DIR", "/data")) / "docsfy.db"
In Docker deployments, this path lives inside the mounted volume:
# docker-compose.yaml
volumes:
- ./data:/data
Schema
docsfy uses a single projects table, created automatically on application startup via init_db():
CREATE TABLE IF NOT EXISTS projects (
name TEXT PRIMARY KEY,
repo_url TEXT NOT NULL,
status TEXT NOT NULL DEFAULT 'generating',
last_commit_sha TEXT,
last_generated TEXT,
page_count INTEGER DEFAULT 0,
error_message TEXT,
plan_json TEXT,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)
| Column | Type | Description |
|---|---|---|
name |
TEXT PRIMARY KEY |
Project identifier, derived from the repository name |
repo_url |
TEXT NOT NULL |
Git clone URL or local path used as source |
status |
TEXT NOT NULL |
Current state: generating, ready, or error |
last_commit_sha |
TEXT |
Git SHA at which documentation was last generated |
last_generated |
TEXT |
Timestamp of last successful generation |
page_count |
INTEGER |
Number of documentation pages produced |
error_message |
TEXT |
Error details when status = 'error' |
plan_json |
TEXT |
Serialized JSON of the documentation plan |
created_at |
TIMESTAMP |
Record creation time |
updated_at |
TIMESTAMP |
Last modification time |
Status Lifecycle
The status column is validated against a fixed set of allowed values:
VALID_STATUSES = frozenset({"generating", "ready", "error"})
A project progresses through these states during generation:
generating ──► ready
│
└──────► error
generating— Set when a generation request is accepted. Updated incrementally as pages are produced (page_countincreases).ready— Set once all pages are generated and the HTML site is rendered.error— Set if any unrecoverable failure occurs. Theerror_messagecolumn captures the details.
Database Operations
All database access is asynchronous via aiosqlite. Every query uses parameterized placeholders to prevent SQL injection:
# src/docsfy/storage.py
async def save_project(name: str, repo_url: str, status: str = "generating") -> None:
async with aiosqlite.connect(DB_PATH) as db:
await db.execute(
"""INSERT INTO projects (name, repo_url, status, updated_at)
VALUES (?, ?, ?, CURRENT_TIMESTAMP)
ON CONFLICT(name) DO UPDATE SET
repo_url = excluded.repo_url,
status = excluded.status,
error_message = NULL,
updated_at = CURRENT_TIMESTAMP""",
(name, repo_url, status),
)
await db.commit()
The save_project function uses INSERT ... ON CONFLICT (upsert) so that re-generating an existing project updates the record rather than failing.
The update_project_status function builds its SET clause dynamically, but only from hardcoded column names — never from user input:
fields = ["status = ?", "updated_at = CURRENT_TIMESTAMP"]
values: list[str | int | None] = [status]
if last_commit_sha is not None:
fields.append("last_commit_sha = ?")
values.append(last_commit_sha)
# ... additional optional fields
Note: The
plan_jsoncolumn is updated early in the generation pipeline so that API consumers can inspect the documentation structure while pages are still being generated.
Filesystem Layout
Each project gets its own directory tree under /data/projects/. The three-level structure separates the documentation plan, cached intermediate files, and the final rendered site:
/data/
├── docsfy.db # SQLite database
└── projects/
└── {project-name}/
├── plan.json # Documentation structure
├── cache/
│ └── pages/
│ ├── introduction.md # Cached AI-generated markdown
│ ├── quickstart.md
│ ├── configuration.md
│ └── *.md
└── site/ # Final rendered output
├── index.html # Landing page
├── introduction.html # Rendered content pages
├── introduction.md # Raw markdown copies
├── quickstart.html
├── quickstart.md
├── assets/
│ ├── style.css
│ ├── search.js
│ ├── theme.js
│ ├── copy.js
│ ├── callouts.js
│ ├── codelabels.js
│ ├── scrollspy.js
│ └── github.js
├── search-index.json # Client-side search data
├── llms.txt # AI-readable page index
└── llms-full.txt # All pages concatenated
Path Helper Functions
Three functions in storage.py provide consistent access to each level of the tree. All three validate the project name before constructing a path:
# src/docsfy/storage.py
def get_project_dir(name: str) -> Path:
return PROJECTS_DIR / _validate_name(name)
def get_project_site_dir(name: str) -> Path:
return PROJECTS_DIR / _validate_name(name) / "site"
def get_project_cache_dir(name: str) -> Path:
return PROJECTS_DIR / _validate_name(name) / "cache" / "pages"
plan.json
The documentation plan is a JSON file written to the project root after generation completes. It captures the project's structure, navigation hierarchy, and metadata:
{
"project_name": "my-library",
"tagline": "A fast data processing library",
"repo_url": "https://github.com/org/my-library",
"navigation": [
{
"group": "Getting Started",
"pages": [
{
"slug": "introduction",
"title": "Introduction",
"description": "Overview of the library"
},
{
"slug": "quickstart",
"title": "Quick Start",
"description": "Get started fast"
}
]
}
]
}
This file is also stored in the database's plan_json column for fast API access without filesystem reads.
site/ Directory
The renderer (renderer.py) produces the final HTML site by cleaning the output directory and rebuilding it from scratch:
# src/docsfy/renderer.py
def render_site(plan: dict[str, Any], pages: dict[str, str], output_dir: Path) -> None:
if output_dir.exists():
shutil.rmtree(output_dir)
output_dir.mkdir(parents=True, exist_ok=True)
Each page slug produces two files: {slug}.html (rendered HTML) and {slug}.md (raw markdown). The site also includes:
search-index.json— An array of objects withslug,title, andcontent(truncated to 2000 characters) for client-side search.llms.txt— An index of all pages in a format optimized for LLM consumption.llms-full.txt— The complete concatenated markdown of every page.
Page-Level Caching Strategy
docsfy caches each documentation page individually as a markdown file under cache/pages/. This allows the system to skip expensive AI generation calls for pages that have already been produced.
Cache Lookup
When generating a page, the system first checks whether a cached version exists:
# src/docsfy/generator.py
cache_file = cache_dir / f"{slug}.md"
if use_cache and cache_file.exists():
return cache_file.read_text(encoding="utf-8")
If use_cache is True and the file exists, the cached content is returned immediately — no AI call is made.
Cache Write
After a page is generated by the AI, it is always written to the cache directory:
# src/docsfy/generator.py
output = _strip_ai_preamble(output)
cache_dir.mkdir(parents=True, exist_ok=True)
cache_file.write_text(output, encoding="utf-8")
Note: The
_strip_ai_preamble()function removes any AI "thinking" or planning text that appears before the first markdown heading, ensuring clean content in the cache.
Progress Tracking
After each page is written to cache, the current count of cached .md files is reported back to the database so API consumers can track generation progress:
# src/docsfy/generator.py
existing_pages = len(list(cache_dir.glob("*.md")))
await update_project_status(
project_name, status="generating", page_count=existing_pages
)
Concurrent Generation
Pages are generated concurrently with a configurable concurrency limit. Each page independently checks its own cache, so a mix of cached and uncached pages is handled efficiently:
# src/docsfy/generator.py
MAX_CONCURRENT_PAGES = 5
results = await run_parallel_with_limit(
coroutines, max_concurrency=MAX_CONCURRENT_PAGES
)
Failed pages receive placeholder content rather than blocking the entire generation:
pages[page_info["slug"]] = (
f"# {page_info['title']}\n\n*Documentation generation failed.*"
)
Cache Invalidation
Cache invalidation happens through two mechanisms:
1. Commit SHA comparison — On non-forced generation, the system compares the repository's current commit SHA against the stored last_commit_sha. If they match and the project status is ready, generation is skipped entirely:
# src/docsfy/main.py
existing = await get_project(project_name)
if (
existing
and existing.get("last_commit_sha") == commit_sha
and existing.get("status") == "ready"
):
await update_project_status(project_name, status="ready")
return
2. Force regeneration — When force=True is passed in the API request, the entire cache directory is deleted and page count is reset to zero:
# src/docsfy/main.py
if force:
cache_dir = get_project_cache_dir(project_name)
if cache_dir.exists():
shutil.rmtree(cache_dir)
await update_project_status(project_name, status="generating", page_count=0)
Tip: Use
"force": truein the generate request when you've made changes to AI prompts or want to regenerate documentation even if the repository hasn't changed:bash curl -X POST http://localhost:8000/api/generate \ -H "Content-Type: application/json" \ -d '{"repo_url": "https://github.com/org/repo", "force": true}'
Deletion Cleanup
When a project is deleted via the API, both the database record and the entire project directory (including cache and site) are removed:
# src/docsfy/main.py
deleted = await delete_project(name)
project_dir = get_project_dir(name)
if project_dir.exists():
shutil.rmtree(project_dir)
Path Traversal Security
docsfy employs a defense-in-depth strategy with four layers of validation to prevent path traversal attacks.
Layer 1: Project Name Validation (Storage)
All filesystem path construction passes through _validate_name(), which enforces a strict allowlist regex:
# src/docsfy/storage.py
def _validate_name(name: str) -> str:
"""Validate project name to prevent path traversal."""
if not re.match(r"^[a-zA-Z0-9][a-zA-Z0-9._-]*$", name):
msg = f"Invalid project name: '{name}'"
raise ValueError(msg)
return name
This pattern ensures:
- The name starts with an alphanumeric character (no leading . or -)
- Only alphanumeric characters, dots, underscores, and hyphens are allowed
- No slashes, spaces, or .. sequences can appear
Layer 2: API Endpoint Validation
A duplicate validation at the HTTP layer catches invalid names early and returns a proper 400 Bad Request response:
# src/docsfy/main.py
def _validate_project_name(name: str) -> str:
"""Validate project name to prevent path traversal."""
if not _re.match(r"^[a-zA-Z0-9][a-zA-Z0-9._-]*$", name):
raise HTTPException(status_code=400, detail=f"Invalid project name: '{name}'")
return name
This is applied to every endpoint that accepts a project name parameter: GET /api/projects/{name}, DELETE /api/projects/{name}, GET /api/projects/{name}/download, and GET /docs/{project}/{path}.
Layer 3: Page Slug Validation
Page slugs — used to construct cache file paths — are validated at both the individual page and batch levels:
# src/docsfy/generator.py
if "/" in slug or "\\" in slug or slug.startswith(".") or ".." in slug:
msg = f"Invalid page slug: '{slug}'"
raise ValueError(msg)
In generate_all_pages(), invalid slugs are logged and silently skipped rather than causing the entire generation to fail:
if "/" in slug or "\\" in slug or slug.startswith(".") or ".." in slug:
logger.warning(f"[{_label}] Skipping path-unsafe slug: '{slug}'")
continue
The renderer applies the same check when writing output files:
# src/docsfy/renderer.py
for slug, content in pages.items():
if "/" in slug or "\\" in slug or slug.startswith(".") or ".." in slug:
logger.warning(f"Skipping invalid slug: {slug}")
Layer 4: Resolved Path Verification
The file-serving endpoint applies a final structural check using Python's Path.resolve() and relative_to(). This catches any traversal attempt that might bypass string-level validation:
# src/docsfy/main.py
@app.get("/docs/{project}/{path:path}")
async def serve_docs(project: str, path: str = "index.html") -> FileResponse:
project = _validate_project_name(project)
site_dir = get_project_site_dir(project)
file_path = site_dir / path
try:
file_path.resolve().relative_to(site_dir.resolve())
except ValueError:
raise HTTPException(status_code=403, detail="Access denied")
This resolves symlinks and .. components in the actual filesystem, then verifies that the resulting absolute path is still within the expected site/ directory. If not, the request is rejected with 403 Access denied.
Warning: The
resolve()+relative_to()check is the most critical security boundary. It operates on the real filesystem path after symlink resolution, making it resistant to encoding tricks and double-dot sequences that might slip through regex-based validation.
Defense Summary
| Layer | Location | Mechanism | Response |
|---|---|---|---|
| 1 | storage._validate_name() |
Alphanumeric regex | ValueError |
| 2 | main._validate_project_name() |
Same regex at HTTP boundary | 400 Bad Request |
| 3 | generator.generate_page() |
Slash/dot checks on slugs | ValueError or skip |
| 4 | main.serve_docs() |
Path.resolve().relative_to() |
403 Access Denied |
SQL Injection Prevention
All database queries use parameterized placeholders (?), and the dynamic field construction in update_project_status uses only hardcoded column names:
# src/docsfy/storage.py
await db.execute("SELECT * FROM projects WHERE name = ?", (name,))
Note: The
update_project_statusfunction constructs a dynamicSETclause, but the field names in thefieldslist are always string literals ("status = ?","last_commit_sha = ?", etc.) — never derived from user input.
Configuration Reference
Storage-related settings are managed through environment variables, loaded via Pydantic Settings:
# src/docsfy/config.py
class Settings(BaseSettings):
data_dir: str = "/data"
| Variable | Default | Description |
|---|---|---|
DATA_DIR |
/data |
Root directory for the SQLite database and all project files |
In Docker deployments, mount a host volume to persist data across container restarts:
# docker-compose.yaml
services:
docsfy:
volumes:
- ./data:/data
Warning: If
DATA_DIRis not backed by a persistent volume, all project data — database, cached pages, and rendered sites — will be lost when the container is removed.