Classes

Generated from yosoi v0.0.1a11. Only symbols in __all__ are listed.

`Contract`

Base class for user-defined scraping contracts.

`define`

define(name: str) -> ContractBuilder

Start a fluent ContractBuilder for the given contract name.

`discovery_field_names`

discovery_field_names() -> set[str]

Return the set of flattened field names used for discovery and cache keys.

Non-Contract fields keep their original name; nested Contract fields are expanded to {parent}_{child} keys. This matches the key format used by snapshots, field_descriptions(), and get_selector_overrides().

`field_descriptions`

field_descriptions() -> dict[str, str]

Return a mapping of field name to description, excluding selector overrides.

Nested Contract-typed fields are expanded to flat {parent}_{child} keys. When the child contract has a pinned root, the description includes a scoping hint. When the child has root = ys.discover(), a co-location hint is added.

`field_hints`

field_hints() -> dict[str, str | None]

Return yosoi_hint per (flat) field name, expanding nested contracts to {parent}_{child}.

`generate_manifest`

generate_manifest() -> str

Return a markdown table documenting all contract fields and their config.

`get_root`

get_root() -> SelectorEntry | None

Return the root selector if explicitly set on the contract class. Returns: SelectorEntry | None — SelectorEntry for the repeating container element, or None.

`get_selector_overrides`

get_selector_overrides() -> dict[str, dict[str, str]]

Return selector overrides defined on fields via yosoi_selector. Returns: dict[str, dict[str, str]] — Mapping of field name to selector dict (e.g. {"primary": "h1.title"}). dict[str, dict[str, str]] — Nested contract overrides use flat {parent}_{child} keys.

`is_grouped`

is_grouped() -> bool

Return True if the contract explicitly configures multi-item mode.

`list_fields`

list_fields() -> dict[str, type]

Return {field_name: inner_type} for fields annotated as list[T].

`nested_contracts`

nested_contracts() -> dict[str, type[Contract]]

Return a mapping of field name → child Contract class for Contract-typed fields.

`to_selector_model`

to_selector_model() -> type[BaseModel]

Generate a Pydantic model mapping each contract field to FieldSelectors.

This ensures that the LLM agent knows exactly which fields to find selectors for, preserving any descriptions or hints provided in the contract. Fields with a yosoi_selector override are excluded — their selectors are provided directly and do not require AI discovery. Nested Contract-typed fields are expanded to flat {parent}_{child} entries.

`DebugConfig`

Configuration for debug output.

`JobPosting`

Contract for job listing pages.

`LLMConfig`

Base configuration for any LLM provider.

`NewsArticle`

Default contract matching the original 5-field behavior.

`Pipeline`

Main pipeline for discovering and saving CSS selectors with retry logic.

Fetches HTML, cleans it, runs LLM-based selector discovery, then verifies and stores the selectors.

`normalize_url`

normalize_url(url: str) -> str

Add protocol to URL, preferring https. Args:

url str — The URL that is being fetched

Returns: str — The complete URL

`process_url`

process_url(url: str, force: bool | None = None, max_fetch_retries: int = 2, max_discovery_retries: int = 3, skip_verification: bool = False, fetcher_type: str = 'simple', output_format: str | list[str] | None = None) -> None

Process a single URL: discover, verify, and save selectors.

Thin wrapper around :meth:scrape that drains the generator. Raises on failure — callers are responsible for error handling. Args:

url str — URL to process
force bool | None — Force re-discovery even if selectors exist. Defaults to False.
skip_verification bool — Skip verification step. Defaults to False.
fetcher_type str — Type of fetcher (‘simple’). Defaults to ‘simple’.
max_fetch_retries int — Maximum fetch retry attempts. Defaults to 2.
max_discovery_retries int — Maximum AI discovery retry attempts. Defaults to 3.
output_format str | list[str] | None — Format(s) for extracted content. Defaults to None (uses pipeline default).

`process_urls`

process_urls(urls: list[str], force: bool | None = None, skip_verification: bool = False, fetcher_type: str = 'simple', max_fetch_retries: int = 2, max_discovery_retries: int = 3, output_format: str | list[str] | None = None, workers: int = 1, on_complete: Callable[[str, bool, float], Awaitable[None]] | None = None, on_start: Callable[[str], Awaitable[None]] | None = None) -> dict[str, list[str]]

Process multiple URLs and collect results.

When workers > 1 and there are multiple URLs, processing runs concurrently via the taskiq broker. Otherwise URLs are processed sequentially. Args:

urls list[str] — List of URLs to process.
force bool | None — Force re-discovery even if selectors exist. Defaults to False.
skip_verification bool — Skip verification step. Defaults to False.
fetcher_type str — Type of fetcher (‘simple’). Defaults to ‘simple’.
max_fetch_retries int — Maximum fetch retry attempts. Defaults to 2.
max_discovery_retries int — Maximum AI discovery retry attempts. Defaults to 3.
output_format str | list[str] | None — Format(s) for extracted content. Defaults to None (uses pipeline default).
workers int — Number of concurrent workers. Defaults to 1 (sequential).
on_complete Callable[[str, bool, float], Awaitable[None]] | None — Optional async callback (url, success, elapsed) called after each URL finishes. Used by the CLI for live progress display.
on_start Callable[[str], Awaitable[None]] | None — Optional async callback (url) called just before each URL begins processing.

Returns: dict[str, list[str]] — Dictionary with keys:

‘successful’: List of successfully processed URLs
‘failed’: List of URLs that failed processing
‘skipped’: List of URLs skipped (concurrent only)

`scrape`

scrape(url: str, force: bool | None = None, max_fetch_retries: int = 2, max_discovery_retries: int = 3, skip_verification: bool = False, fetcher_type: str = 'simple', output_format: str | list[str] | None = None) -> AsyncIterator[ContentMap]

Async generator yielding individual content items from a URL.

Canonical entry point — handles both cached-selector replay and fresh AI discovery. For multi-item pages (catalogs, listings), yields one ContentMap per matched container element. For single-item pages, yields exactly one ContentMap. Args:

url str — URL to process
force bool | None — Force re-discovery even if selectors exist. Defaults to the pipeline-level force flag.
max_fetch_retries int — Maximum fetch retry attempts. Defaults to 2.
max_discovery_retries int — Maximum AI discovery retry attempts. Defaults to 3.
skip_verification bool — Skip verification step. Defaults to False.
fetcher_type str — Type of fetcher (‘simple’). Defaults to ‘simple’.
output_format str | list[str] | None — Format(s) for saving extracted content.

Yields: AsyncIterator[ContentMap] — ContentMap dicts — one per extracted item.

`show_llm_stats`

show_llm_stats() -> None

Show LLM usage statistics.

`show_summary`

show_summary() -> None

Show summary of all saved selectors.

`TelemetryConfig`

Configuration for observability / telemetry.

`Video`

Contract for video pages (YouTube-style).

`YosoiConfig`

Top-level Yosoi configuration bundling LLM, debug, and telemetry settings.

Example::

config = YosoiConfig(
    llm=ys.groq('llama-3.3-70b-versatile', api_key),
    debug=DebugConfig(save_html=False),
)
pipeline = Pipeline(config, contract=MyContract)

`validate_api_key_env`

validate_api_key_env() -> YosoiConfig

Resolve API key for the selected provider, falling back to others if needed.

Resolution order:

Use api_key if already set on the LLMConfig.
Try the environment variable for the selected provider.
Walk PROVIDER_FALLBACK_ORDER and use the first provider with a key.
Raise if nothing works.