Skip to content
Cascading Labs QScrape VoidCrawl Yosoi

Selectors

Selectors are the core unit of Yosoi’s caching model. Once discovered for a domain, they are stored locally and reused on every subsequent scrape with no LLM call required.

Primary, Fallback, and Tertiary

For each field in your contract, Yosoi asks the LLM for up to three selector candidates:

SlotPurpose
PrimaryThe most specific, reliable selector for the field
FallbackA less specific alternative if the primary stops matching
TertiaryA generic last resort

During extraction, Yosoi tries the primary first. If it matches, the other slots are ignored. If it returns no elements, the fallback is tried, then the tertiary. This layered approach means a minor site redesign that breaks the primary selector often does not break extraction entirely.

Selector Types

Yosoi supports four selector strategies, tried in escalating order if the preferred strategy fails inline verification:

LevelStrategyExample
1CSSarticle.product h2.title
2XPath//article[@class="product"]//h2
3Regex<h2[^>]*>(.*?)</h2>
4JSON-LD$.name

CSS is the default and covers the vast majority of sites. Escalation to XPath or Regex only happens when CSS selectors fail to match during the inline verification step immediately after the LLM responds.

The .yosoi/ directory

All Yosoi state lives in .yosoi/ in your project root. It is created automatically on first run.

.yosoi/
selectors/ # One JSON file per domain
logs/ # Run logs (run_YYYYMMDD_HHMMSS.log)
debug_html/ # HTML snapshots when --debug is active
content/ # Extracted output files (JSON, CSV, etc.)
stats.json # Cumulative LLM call and usage statistics

Selector Cache Files

Each domain gets its own file: selectors_example_com.json. The format is a snapshot per field:

{
"title": {
"primary": "h1.article-title",
"fallback": "h1",
"tertiary": null,
"discovered_at": "2025-03-01T14:22:00Z",
"last_verified_at": "2025-03-15T09:00:00Z",
"last_failed_at": null,
"failure_count": 0,
"source": "discovered"
}
}

The source field records how the selector was obtained:

ValueMeaning
discoveredGenerated by the LLM
pinnedSet explicitly in the contract with ys.css(...)
overrideManually edited in the cache file

Typed Selector Cache Models

Every selector cache structure is backed by a Pydantic model, so cache files are validated on read and write. The key types are:

TypePurpose
SelectorSnapshotPer-field cache entry with audit metadata
SnapshotMapTop-level cache file (domain + field snapshots)
CacheVerdictEnum for cache validation results (FRESH, STALE, DEGRADED)
FieldSelectorsPrimary / fallback / tertiary selector container
SelectorEntrySingle selector with strategy type and value
SelectorLevelIntEnum for selector strategies (CSS → JSON-LD)

All are importable from the top-level package (e.g. from yosoi import SelectorSnapshot, SnapshotMap). For full field-level documentation, see the Selector Cache Types reference.

Cache Verdicts

On each scrape, Yosoi checks the cached selectors against the live HTML before deciding whether to call the LLM:

VerdictMeaningAction
FRESHSelector still matches the pageExtract directly, no LLM call
STALESelector no longer matchesRe-discover this field
DEGRADEDSelector matches but quality has droppedFlagged for future re-discovery

When only some fields are stale, Yosoi runs partial rediscovery: it re-discovers only the stale fields and merges the results with the fresh cached selectors. This keeps LLM costs minimal even when a site partially redesigns.

Pinning Selectors

If you already know the right selector, set it directly on the contract field. Pinned selectors are never sent to the LLM and never overwritten by discovery:

import yosoi as ys
class Article(ys.Contract):
title: str = ys.Title(selector='h1.article-title')
author: str = ys.Author(selector='span.byline')

Pinned selectors still go through verification and fallback logic at extraction time.

Forcing Re-Discovery

To discard the cache and run discovery from scratch:

pipeline = ys.Pipeline(config, contract=Article, force=True)

Or delete the domain file manually:

rm .yosoi/selectors/selectors_example_com.json

FAQs

When does Yosoi decide a cached selector is stale?

At the start of each scrape, Yosoi fetches the page and tests each cached selector against the live HTML. If no elements match, the selector is marked stale and re-discovery is queued for that field.

Is the .yosoi/ directory safe to commit?

The selectors/ subdirectory is safe and can be useful to share across a team in a .gitinclude. The logs/, debug_html/, and content/ and other directories are noisy and should stay gitignored.

What happens if all three selector slots fail?

Yosoi returns None for that field. If the field is required in your contract, a ValidationError is raised. Annotate optional fields as T | None to handle this gracefully.

Can I edit the cache files by manually?

Yes. Optionally, set source to "override" to signal the edit was intentional. Yosoi will use your selector and treat it like a pinned value.

References

CSS Selectors. W3C. Specification for selecting HTML elements via CSS patterns. https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_selectors

XPath. W3C. XML Path Language for navigating and querying document trees. https://developer.mozilla.org/en-US/docs/Web/XPath

JSON-LD. W3C. JSON-based format for linked data, commonly embedded in web pages as structured metadata. https://json-ld.org/