Skip to content
Cascading Labs QScrape VoidCrawl Yosoi

HTML vs. the DOM

When a server responds to an HTTP request, it sends back raw HTML — a string of tags and text. When a browser receives that string, it parses it into a live tree of objects called the DOM (Document Object Model). JavaScript can then modify that tree: adding nodes, removing them, rewriting text, injecting entire components.

What you see on screen may look nothing like the original HTML the server sent.

This distinction is why tools like curl or requests are useless for scraping modern web apps. They fetch the raw HTML — the string the server sent. If the content is injected by JavaScript after the page loads, that string is a mostly-empty shell.

Server HTML:<body> <div id="root"></div> <script src="/static/js/main.c8f2a1d.js"></script> </body>
Browser parses
DOM after JS:<body> <div id="root"> <nav class="Nav sc-aXZVg kWIkMi"> <a href="/">Home</a> <a href="/products">Products</a> </nav> <main> <ul class="sc-bdVTJa bKTGaB"> <li class="sc-hKMtZM gUzRaI" data-id="42"> <img src="/cdn/42.webp" alt="Widget Pro" /> <span class="price">$29.99</span> </li> </ul> </main> </div> </body>

VoidCrawl gives you access to the live DOM, not the raw HTML. It controls a real Chrome instance that executes JavaScript, renders the page, and exposes the resulting DOM tree for you to query and interact with.

Why You Need a Real Browser

Static HTML fetching (requests, httpx, curl) works when the content is in the server response. But modern websites increasingly rely on:

  • Client-side rendering — React, Vue, Svelte, and Solid apps that build the page in JavaScript
  • Lazy loading — content that loads as you scroll or interact
  • Authentication flows — login forms, OAuth redirects, session cookies
  • Anti-bot protections — WAFs that check for real browser behavior before serving content
  • Shadow DOM — encapsulated components that aren’t visible in the outer HTML

For all of these, you need a real browser that executes JavaScript, manages cookies, and renders the page. VoidCrawl provides that browser.

The Tool Landscape

VoidCrawl isn’t the only browser automation tool. Here’s how it fits in:

ToolLanguageProtocolLicenseApproach
VoidCrawlRust + PythonCDPApache-2.0Rust core, async-native, tab pooling
PlaywrightNode/Python/Java/.NETCDP + customApache-2.0Bundled browsers, high-level API
SeleniumJava/Python/JS/etcWebDriverApache-2.0Oldest, broadest browser support
PuppeteerNode.jsCDPApache-2.0Google-maintained, Chrome-only
zendriverPythonCDPAGPL-3.0Async, stealth-focused, undetected-chromedriver successor
nodriverPythonCDPAGPL-3.0zendriver predecessor
undetected-chromedriverPythonCDPGPL-3.0Patched ChromeDriver to bypass detection

Why VoidCrawl exists

Playwright and Selenium are excellent general-purpose tools. But for high-volume scraping where you need to keep browsers alive, reuse tabs, and minimize overhead:

  • Playwright launches a new browser context per session. There’s no built-in tab pooling or long-lived daemon mode.
  • Selenium uses the WebDriver protocol, which has higher overhead than CDP and doesn’t support tab-level recycling.
  • zendriver/nodriver are async and stealth-focused (VoidCrawl’s stealth approach is inspired by them), but are pure Python with no compiled core. Also they have copy-left licenses.

VoidCrawl’s Rust core handles CDP I/O on a Tokio runtime. The BrowserPool keeps Chrome alive as a daemon and recycles tabs via a semaphore-bounded queue. The result: near-instant tab acquisition after warmup, with the stealth properties of zendriver.

When to use what

  • Just need to render a page once? Playwright is simpler to set up.
  • Need long-running concurrent scraping with tab reuse? VoidCrawl’s pool is purpose-built for this.
  • Need to bypass WAFs? VoidCrawl’s stealth mode handles most cases. See Stealth Mode.
  • Need cross-browser support (Firefox, Safari)? Use Playwright or Selenium. VoidCrawl is Chrome-only.

Page Source vs. Inspect

A quick way to check whether a site needs browser automation:

  • View Source (Ctrl+U / Cmd+U) shows the raw HTML the server sent. If the content you want is here, static fetching works.
  • Inspect (Ctrl+Shift+I / Cmd+Shift+I) shows the live DOM after JavaScript has run. If the content only appears here, you need a browser.

VoidCrawl’s tab.content() returns the live DOM — equivalent to what Inspect shows, not View Source.

FAQs

Can VoidCrawl handle single-page apps (SPAs)?

Yes. VoidCrawl runs a real Chrome instance that executes JavaScript and builds the full DOM. SPAs render normally. Use wait_for_stable_dom() to ensure the page has finished rendering before extracting content.

What about shadow DOM components?

VoidCrawl’s stealth layer forces all attachShadow calls to use mode: 'open', making shadow DOM content accessible to automation. This also enables interaction with WAF challenges like Cloudflare Turnstile.

Do I need VoidCrawl if I’m already using Yosoi?

Yosoi fetches raw HTML by default. For sites that require JavaScript rendering (L2+ difficulty), you can use VoidCrawl to render the page and pass the DOM to Yosoi. A native integration is planned.

References

DOM (Document Object Model). W3C / MDN. Programmatic interface for HTML and XML documents, representing the page as a tree of objects. https://developer.mozilla.org/en-US/docs/Web/API/Document_Object_Model

Playwright. Microsoft. Browser automation library for Node.js, Python, Java, and .NET. https://playwright.dev/python/

Selenium. Selenium Contributors. Browser automation framework. https://www.selenium.dev/

Puppeteer. Google Chrome DevTools. Node.js library providing a high-level API to control headless Chrome. https://pptr.dev/

zendriver. cdpdriver. Async Chrome automation with stealth, successor to undetected-chromedriver. https://github.com/cdpdriver/zendriver

nodriver. ultrafunkamsterdam. Predecessor to zendriver. https://github.com/ultrafunkamsterdam/nodriver

undetected-chromedriver. ultrafunkamsterdam. Patched ChromeDriver to bypass bot detection. https://github.com/ultrafunkamsterdam/undetected-chromedriver