File Downloads
Yosoi can surface files as contract fields. A file field is not a selector over HTML. It is an action that runs in the browser tab, downloads bytes, records provenance, and writes the resolved value into the extracted item.
Use it for PDFs, CSV exports, JSON files, and browser-only download buttons where plain HTTP is not enough.
The Safety Model
Downloads are default-deny.
| Gate | Where | What it does |
|---|---|---|
| Run opt-in | allow_downloads=True | Allows this scrape run to download files at all. |
| Type allowlist | allowed_download_types and ys.File(allowed_types=...) | Limits what file types can be accepted. |
| Size cap | max_download_bytes or ys.File(max_bytes=...) | Rejects files over the configured byte limit. |
| Provenance | DownloadRecord | Stores path, hash, size, content type, source URL, and timestamp. |
Both the run and the field must agree. A field that allows CSV cannot download a PDF unless the run also allows PDF.
Basic CSV Example
import asyncioimport yosoi as ys
class Dataset(ys.Contract): title: str = ys.Title(description="Dataset title") rows: list[dict] = ys.File( trigger='a[href$=".csv"]', allowed_types=["csv"], )
async def main() -> None: records = await ys.scrape( "https://example.com/data", Dataset, fetcher_type="headless", allow_downloads=True, allowed_download_types=["csv"], ) print(records[0]["rows"][:3])
asyncio.run(main())rows: list[dict] is the important part. The annotation asks Yosoi to parse the downloaded CSV into rows. There is no separate parse= knob.
Output Shapes
The field annotation decides what you get back.
| Annotation | Returned value |
|---|---|
ys.DownloadRecord | Provenance handle with path, hash, size, content type, and timestamp. |
pathlib.Path | Local file path. |
bytes | Raw bytes. |
str | Decoded text. |
dict or list[dict] | Parsed JSON or CSV data. |
list[MyRow] | Parsed rows validated through a Pydantic model. |
MyModel | Parsed JSON validated through that model. |
Yosoi always creates a DownloadRecord internally. The annotation controls which view is placed in the output record.
Trigger Modes
Exactly one source is required.
# Click a selector and capture the browser download.report: ys.DownloadRecord = ys.File(trigger="button.download", allowed_types=["pdf"])
# Read href from a selector and download that URL.report: ys.DownloadRecord = ys.File(href='a[href$=".pdf"]', allowed_types=["pdf"])
# Download a known URL.report: ys.DownloadRecord = ys.File(url="https://example.com/report.pdf", allowed_types=["pdf"])
# Ask Yosoi to discover the trigger and cache it.report: ys.DownloadRecord = ys.File(description="Download the annual report PDF", allowed_types=["pdf"])Prefer trigger for signed URLs, generated exports, Google Drive style buttons, or anything that needs the page’s current session. Prefer href or url only when the file URL is stable.
EDGAR Pattern
The EDGAR example uses a browser tier, a CSV allowlist, and annotation-directed parsing:
class EdgarDataset(ys.Contract): title: str = ys.Title(description="The page or dataset title") rows: list[dict] = ys.File( trigger='a[href$=".csv"]', allowed_types=["csv"], )Run it with:
uv run python examples/edgar_download_example.pyYosoi stops at surfacing rows and provenance. Analysis belongs downstream, for example in DuckDB, SQLite, or an agent workflow.
FAQs
Are downloads enabled by default?
No. Pass allow_downloads=True and allow the file type. Without both gates, nothing downloads.
What does ys.File return?
The field annotation decides the value. Use ys.DownloadRecord for provenance, Path for a path, bytes for raw bytes, str for text, and dict or list[...] for parsed CSV or JSON data.
Should I use trigger, href, url, or description?
Use trigger for browser actions, href for a selector that points at a stable link, url for a direct file URL, and description when you want Yosoi to discover and cache the trigger.
Is a downloaded file trusted?
No. Treat it as quarantined input. Yosoi gives you provenance and safety gates, not a guarantee that the file is safe to execute or open blindly.
Why does this require a browser fetcher?
File fields run in a live tab. Use fetcher_type="headless", fetcher_type="headful", or fetcher_type="waterfall".