Selectors
Selectors are the core unit of Yosoi’s caching model. Once discovered for a domain, they are stored locally and reused on every subsequent scrape with no LLM call required.
Primary, Fallback, and Tertiary
For each field in your contract, Yosoi asks the LLM for up to three selector candidates:
| Slot | Purpose |
|---|---|
| Primary | The most specific, reliable selector for the field |
| Fallback | A less specific alternative if the primary stops matching |
| Tertiary | A generic last resort |
During extraction, Yosoi tries the primary first. If it matches, the other slots are ignored. If it returns no elements, the fallback is tried, then the tertiary. This layered approach means a minor site redesign that breaks the primary selector often does not break extraction entirely.
Selector Types
Yosoi supports four selector strategies, tried in escalating order if the preferred strategy fails inline verification:
| Level | Strategy | Example |
|---|---|---|
| 1 | CSS△ | article.product h2.title |
| 2 | XPath○ | //article[@class="product"]//h2 |
| 3 | Regex | <h2[^>]*>(.*?)</h2> |
| 4 | JSON-LD◑ | $.name |
CSS is the default and covers the vast majority of sites. Escalation to XPath or Regex only happens when CSS selectors fail to match during the inline verification step immediately after the LLM responds.
The .yosoi/ directory
All Yosoi state lives in .yosoi/ in your project root. It is created automatically on first run.
.yosoi/ selectors/ # One JSON file per domain logs/ # Run logs (run_YYYYMMDD_HHMMSS.log) debug_html/ # HTML snapshots when --debug is active content/ # Extracted output files (JSON, CSV, etc.) stats.json # Cumulative LLM call and usage statisticsSelector Cache Files
Each domain gets its own file: selectors_example_com.json. The format is a snapshot per field:
{ "title": { "primary": "h1.article-title", "fallback": "h1", "tertiary": null, "discovered_at": "2025-03-01T14:22:00Z", "last_verified_at": "2025-03-15T09:00:00Z", "last_failed_at": null, "failure_count": 0, "source": "discovered" }}The source field records how the selector was obtained:
| Value | Meaning |
|---|---|
discovered | Generated by the LLM |
pinned | Set explicitly in the contract with ys.css(...) |
override | Manually edited in the cache file |
Typed Selector Cache Models
Every selector cache structure is backed by a Pydantic model, so cache files are validated on read and write. The key types are:
| Type | Purpose |
|---|---|
SelectorSnapshot | Per-field cache entry with audit metadata |
SnapshotMap | Top-level cache file (domain + field snapshots) |
CacheVerdict | Enum for cache validation results (FRESH, STALE, DEGRADED) |
FieldSelectors | Primary / fallback / tertiary selector container |
SelectorEntry | Single selector with strategy type and value |
SelectorLevel | IntEnum for selector strategies (CSS → JSON-LD) |
All are importable from the top-level package (e.g. from yosoi import SelectorSnapshot, SnapshotMap). For full field-level documentation, see the Selector Cache Types reference.
Cache Verdicts
On each scrape, Yosoi checks the cached selectors against the live HTML before deciding whether to call the LLM:
| Verdict | Meaning | Action |
|---|---|---|
FRESH | Selector still matches the page | Extract directly, no LLM call |
STALE | Selector no longer matches | Re-discover this field |
DEGRADED | Selector matches but quality has dropped | Flagged for future re-discovery |
When only some fields are stale, Yosoi runs partial rediscovery: it re-discovers only the stale fields and merges the results with the fresh cached selectors. This keeps LLM costs minimal even when a site partially redesigns.
Pinning Selectors
If you already know the right selector, set it directly on the contract field. Pinned selectors are never sent to the LLM and never overwritten by discovery:
import yosoi as ys
class Article(ys.Contract): title: str = ys.Title(selector='h1.article-title') author: str = ys.Author(selector='span.byline')Pinned selectors still go through verification and fallback logic at extraction time.
Forcing Re-Discovery
To discard the cache and run discovery from scratch:
pipeline = ys.Pipeline(config, contract=Article, force=True)Or delete the domain file manually:
rm .yosoi/selectors/selectors_example_com.jsonFAQs
When does Yosoi decide a cached selector is stale?
At the start of each scrape, Yosoi fetches the page and tests each cached selector against the live HTML. If no elements match, the selector is marked stale and re-discovery is queued for that field.
Is the .yosoi/ directory safe to commit?
The selectors/ subdirectory is safe and can be useful to share across a team in a .gitinclude. The logs/, debug_html/, and content/ and other directories are noisy and should stay gitignored.
What happens if all three selector slots fail?
Yosoi returns None for that field. If the field is required in your contract, a ValidationError is raised. Annotate optional fields as T | None to handle this gracefully.
Can I edit the cache files by manually?
Yes. Optionally, set source to "override" to signal the edit was intentional. Yosoi will use your selector and treat it like a pinned value.
References
△ CSS Selectors. W3C. Specification for selecting HTML elements via CSS patterns. https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_selectors
○ XPath. W3C. XML Path Language for navigating and querying document trees. https://developer.mozilla.org/en-US/docs/Web/XPath
◑ JSON-LD. W3C. JSON-based format for linked data, commonly embedded in web pages as structured metadata. https://json-ld.org/