5 min read

Giving Agents Eyes (Web Scraping)

Agents are blind. They can't see the web. Here is how to build a tool that lets them read, research, and browse.

Dw

Dwizi Team

Editorial

Giving Agents Eyes (Web Scraping)

"Can you summarize the latest article on TechCrunch about the new iPhone?"

If you ask this question to a standard LLM, you will likely get a disappointing answer: "I'm sorry, but as an AI, I do not have the ability to browse the live web. My knowledge cutoff is..."

Or, if you are using a model with a built-in browser (like ChatGPT Plus), it might work, but it will be slow, clunky, and opaque. You can't control it. You can't automate it. You can't pipe the output into your own database.

Most AI agents are trapped in a library where no new books have been added for two years. They are brilliant historians, but terrible journalists. To make them truly useful in a fast-moving world, we need to give them eyes. We need to give them a browser.

The Problem: The Web is Hostile to Machines

You might think, "Why is this hard? Just curl the URL and feed the HTML to the LLM."

If you have ever looked at the raw HTML of a modern website, you know why this fails. The web is a mess.

  • JavaScript: Most content is loaded dynamically. curl just gives you an empty <div id="app"></div>.
  • Noise: For every byte of article text, there are 100 bytes of navigation menus, sidebar ads, "Recommended for You" widgets, and tracking scripts.
  • Safety: Feeding raw, unsanitized HTML to an LLM is a security risk (Prompt Injection via invisible text).

An agent doesn't need a browser. It needs a Distiller. It needs a tool that goes out to the noisy, chaotic web and brings back clean, structured, meaningful text.

The Solution: The "Markdown Browser"

Markdown is the native language of LLMs. It represents structure (headers, lists, links) without the visual baggage of HTML/CSS.

We need to build a tool that:

  1. Visits a URL.
  2. Executes the JavaScript to render the page.
  3. Identifies the "main content" (the article).
  4. Strips away the ads and fluff.
  5. Converts the result into clean Markdown.

The Implementation

We could build this from scratch using Puppeteer or Playwright, but managing headless browsers is painful (they are heavy, crash often, and get blocked by Cloudflare).

Instead, we will wrap a specialized extraction API (like Jina, Firecrawl, or similar) into a Dwizi tool. This gives us the power of a headless browser with the simplicity of a fetch call.

/**
 * Reads a webpage and returns its content as Markdown.
 * 
 * Description for LLM: "Use this to read an article, documentation, or blog post from a URL. It returns the main content in Markdown format."
 */

type Input = {
  url: string;
};

export default async function readWebpage({ url }: Input) {
  // We use a hypothetical but realistic extraction service endpoint.
  // Services like 'r.jina.ai' allow you to append a URL to their domain
  // to get a markdown version.
  const target = `https://r.jina.ai/${url}`;

  // We set headers to ask for a "Reader Mode" summary.
  const response = await fetch(target, {
    headers: {
      "X-With-Links-Summary": "true",
      "X-With-Images-Summary": "true"
    }
  });

  if (!response.ok) {
    // If the site is down or blocks us, we return a clear error.
    // The Agent can then decide to try a different URL or apologize to the user.
    return { error: `Failed to read URL: ${response.statusText}` };
  }

  const markdown = await response.text();

  // The Context Window Problem
  // Some webpages are massive (e.g., a Wikipedia page with 100k words).
  // If we return the whole thing, we might blow the Agent's token budget.
  // We impose a hard limit (e.g., 20,000 characters).
  // This is an engineering tradeoff: we sacrifice completeness for reliability.
  const limit = 20000;
  const content = markdown.slice(0, limit);
  const truncated = markdown.length > limit;

  return { 
    content,
    truncated,
    originalUrl: url,
    title: extractTitle(markdown) // Helper to grab the first # Header
  };
}

function extractTitle(md: string) {
  const match = md.match(/^# (.*$)/m);
  return match ? match[1] : "No Title";
}

The Execution Story

Imagine you are a financial analyst. You want to track news about a specific company.

User: "What is the latest pricing for the Dwizi Pro plan? Check their website."

Agent Thought Process:

  1. Search: The agent might first search Google to find the URL (if it has a search tool). Let's assume it finds https://dwizi.com/pricing.
  2. Read: The agent calls read_webpage({ url: "https://dwizi.com/pricing" }).
  3. Process: The tool fetches the HTML, strips the nav bar, strips the footer, and returns:
    # Pricing
    
    ## Hobby
    Free forever. Perfect for side projects.
    
    ## Pro
    $20/month. For serious developers.
    - Unlimited runners
    - encrypted secrets
    ...
    
  4. Synthesize: The Agent reads this Markdown. It ignores the CSS classes. It sees the facts.
  5. Answer: "Dwizi offers two plans. The Hobby plan is free, and the Pro plan is $20/month."

Why This Matters

By giving the agent "eyes," we transform it from a static knowledge base into a dynamic research assistant. It can read documentation, check competitor pricing, summarize news, and debug errors by reading StackOverflow.

The web is the world's largest database. This tool gives your agent the SQL query to access it.

Subscribe to Dwizi Blog

Get stories on the future of work, autonomous agents, and the infrastructure that powers them. No fluff.

We respect your inbox. Unsubscribe at any time.

Read Next

The Junior Dev (GitHub)

Automating the first 15 minutes of every bug report. How to build an agent that triages issues.

Read article