Understanding the Web

Not all websites are equally easy to scrape. Understanding why helps you choose the right approach and set realistic expectations before discovery runs.

HTML vs. the DOM

When you load a page in a browser, the browser parses the raw HTML and constructs a live tree of objects called the DOM^△ (Document Object Model). JavaScript can then modify that tree: adding nodes, removing them, rewriting text content, injecting entire components. What you see on screen may look nothing like the original HTML the server sent.

Yosoi fetches the raw HTML (for now). It sees what the server returns, not what JavaScript builds after the fact. For most news sites, blogs, and product catalogues this is fine. The content is in the HTML. For single-page apps that load data after the initial render, Yosoi will get a mostly-empty shell.

If you can view the page source in your browser and see the text you want to extract, Yosoi can work with it. If the content only appears after JavaScript runs, render the page yourself with Playwright^○ or Puppeteer^◑ and pass the HTML to Yosoi directly. This is the difference between page source (Cmd+U / Ctrl+U) and using the inspect tool to view the DOM (Cmd+Shift+I / Ctrl+Shift+I)

A native DOM fetcher that handles this automatically is on the Roadmap, but is not available yet.

Difficulty Levels

Note

This is how we at Yosoi think about the web; this is not a standard grouping.

L1: Static HTML, CSS, and JS

Standard HTML pages with no frontend framework and no dynamic data loading. The full content is present in the server response. Selectors are stable and discovery is reliable.

These are the easiest targets. QScrape L1 sites simulate this category with realistic content and layouts.

L2: SPAs and data-driven dynamic sites

Single-page applications (SPAs) and sites that load content after the initial HTML render. The server sends a minimal shell; JavaScript fetches data and injects it into the DOM. By the time the page looks right in a browser, the raw HTML Yosoi fetched may be largely empty.

For L2 sites, render the page yourself with a headless browser (Playwright^○, Puppeteer^◑, etc.) and pass the resulting HTML to Yosoi. Discovery works against the rendered output the same way it works against static HTML.

Why QScrape tests four frameworks

QScrape L2 sites cover this category with Svelte^◇, React^★, Vue^⬡, and Solid^▽. We picked these because they cover different ways to interact with the DOM.

Svelte — compiled, meaning it emits plain imperative DOM calls with no runtime framework layer, but build-tool class hashing still applies.
React — uses a virtual DOM, diffing a JavaScript tree against the real DOM on each update, so the live markup can shift between renders.
Vue — pairs a reactive proxy system with a virtual DOM and template-based rendering, producing output similar to React but with different internal update patterns.
Solid — uses fine-grained signals with no virtual DOM; JSX compiles to direct DOM operations that update only the specific nodes affected by a state change, with no component re-renders.

Each approach leaves a different fingerprint in the rendered HTML, which is why a selector strategy that works on one framework may fail on another.

L3: Anti-Bot Sites

Sites that actively try to detect and block automated access. Techniques include rate limiting, fingerprinting request headers, requiring session cookies, obfuscating markup, and serving different content to suspected bots.

Yosoi does not attempt to bypass anti-bot measures (yet!). If a site blocks the HTTP request, discovery fails. You are responsible for handling authentication, headers, and session management before passing HTML to Yosoi (for now!).

QScrape L3 sites apply generic anti-bot patterns (excluding captchas) for testing.

L4: Captchas and Traffic Monitoring

The hardest class of sites. In addition to L3 protections, these deploy active challenge puzzles (CAPTCHAs, proof-of-work) and monitor traffic patterns across sessions to flag and block suspicious behavior.

Bypassing L4 protections will be outside of Yosoi’s scope for some time. Anti-captcha APIs and residential proxy networks exist for this purpose but are not integrated. QScrape does not include L4 sites.

QScrape

QScrape is a purpose-built evaluation suite of fictional websites covering L1 through L3. Use it to:

Validate that Yosoi’s selector discovery works correctly for a given site type
Regression-test your pipelines against a stable, unchanging source
Benchmark discovery accuracy across providers

Because QScrape sites are controlled and stable, they are ideal for CI. Point Yosoi at a QScrape URL and assert on the extracted field values.

FAQs

Can Yosoi scrape single-page apps (L2 sites)?

Not directly right now. Yosoi fetches raw HTML, which is mostly empty for SPAs. Render the page yourself with a headless browser (Playwright, Puppeteer) and pass the rendered HTML to Yosoi. A native DOM fetcher is on the Roadmap but not available yet.

Why do class names change between deploys on framework-based sites?

Build tools like webpack^◎ and Vite^✦ generate hashed class names to enable long-term caching. A deploy changes the hash, which invalidates any selector that targets that class. Yosoi’s LLM-based discovery tries to avoid these by preferring stable attributes (IDs, data attributes, semantic tags), but it is not guaranteed.

What is the practical difference between L2 and L3 difficulty?

L2 adds frontend framework complexity but does not try to block you. L3 actively resists scraping. A tool that works on L2 will not necessarily work on L3.

Should I test against QScrape sites before hitting my real targets?

Yes. It is a fast way to verify your pipeline is wired up correctly before sending requests to sites you do not control.

References

△ DOM (Document Object Model). W3C / MDN. Programmatic interface for HTML and XML documents, representing the page as a tree of objects. https://developer.mozilla.org/en-US/docs/Web/API/Document_Object_Model

○ Playwright. Microsoft. Browser automation library for Node.js, Python, Java, and .NET. https://playwright.dev/python/

◑ Puppeteer. Google Chrome DevTools. Node.js library providing a high-level API to control headless Chrome. https://pptr.dev/

◇ Svelte. Rich Harris. Compiler-based frontend framework. https://svelte.dev/

★ React. Meta. JavaScript library for building user interfaces. https://react.dev/

⬡ Vue. Evan You. Progressive JavaScript framework for building UIs. https://vuejs.org/

▽ Solid. Ryan Carniato. Fine-grained reactive UI library with no virtual DOM. https://www.solidjs.com/

◎ webpack. webpack contributors. Static module bundler for JavaScript applications. https://webpack.js.org/

✦ Vite. Evan You. Next generation frontend build tool. https://vite.dev/