Launch HN: Webhound (YC S23) – Research agent that builds datasets from the web

mfkhalil 7 hours ago

We're the team behind Webhound (https://webhound.ai), an AI agent that builds datasets from the web based on natural language prompts. You describe what you're trying to find. The agent figures out how to structure the data and where to look, then searches, extracts the results, and outputs everything in a CSV you can export.

We've set up a special no-signup version for the HN community at https://hn.webhound.ai - just click "Continue as Guest" to try it without signing up.

Here's a demo: https://youtu.be/fGaRfPdK1Sk

We started building it after getting tired of doing this kind of research manually. Open 50 tabs, copy everything into a spreadsheet, realize it's inconsistent, start over. It felt like something an LLM should be able to handle.

Some examples of how people have used it in the past month:

Competitor analysis: "Create a comparison table of internal tooling platforms (Retool, Appsmith, Superblocks, UI Bakery, BudiBase, etc) with their free plan limits, pricing tiers, onboarding experience, integrations, and how they position themselves on their landing pages." (https://www.webhound.ai/dataset/c67c96a6-9d17-4c91-b9a0-ff69...)

Lead generation: "Find Shopify stores launched recently that sell skincare products. I want the store URLs, founder names, emails, Instagram handles, and product categories." (https://www.webhound.ai/dataset/b63d148a-8895-4aab-ac34-455e...)

Pricing tracking: "Track how the free and paid plans of note-taking apps have changed over the past 6 months using official sites and changelogs. List each app with a timeline of changes and the source for each." (https://www.webhound.ai/dataset/c17e6033-5d00-4e54-baf6-8dea...)

Investor mapping: "Find VCs who led or participated in pre-seed or seed rounds for browser-based devtools startups in the past year. Include the VC name, relevant partners, contact info, and portfolio links for context." (https://www.webhound.ai/dataset/1480c053-d86b-40ce-a620-37fd...)

Research collection: "Get a list of recent arXiv papers on weak supervision in NLP. For each, include the abstract, citation count, publication date, and a GitHub repo if available." (https://www.webhound.ai/dataset/e274ca26-0513-4296-85a5-2b7b...)

Hypothesis testing: "Check if user complaints about Figma's performance on large files have increased in the last 3 months. Search forums like Hacker News, Reddit, and Figma's community site and show the most relevant posts with timestamps and engagement metrics." (https://www.webhound.ai/dataset/42b2de49-acbf-4851-bbb7-080b...)

The first version of Webhound was a single agent running on Claude 4 Sonnet. It worked, but sessions routinely cost over $1100 and it would often get lost in infinite loops. We knew that wasn't sustainable, so we started building around smaller models.

That meant adding more structure. We introduced a multi-agent system to keep it reliable and accurate. There's a main agent, a set of search agents that run subtasks in parallel, a critic agent that keeps things on track, and a validator that double-checks extracted data before saving it. We also gave it a notepad for long-term memory, which helps avoid duplicates and keeps track of what it's already seen.

After switching to Gemini 2.5 Flash and layering in the agent system, we were able to cut costs by more than 30x while also improving speed and output quality.

The system runs in two phases. First is planning, where it decides the schema, how to search, what sources to use, and how to know when it's done. Then comes extraction, where it executes the plan and gathers the data.

It uses a text-based browser we built that renders pages as markdown and extracts content directly. We tried full browser use but it was slower and less reliable. Plain text still works better for this kind of task.

We also built scheduled refreshes to keep datasets up to date and an API so you can integrate the data directly into your workflows.

Right now, everything stays in the agent's context during a run. It starts to break down around 1000-5000 rows depending on the number of attributes. We're working on a better architecture for scaling past that.

We'd love feedback, especially from anyone who's tried solving this problem or built similar tools. Happy to answer anything in the thread.

Thanks! Moe

codingdave 3 hours ago

It looks promising overall, and I can see where this will save a ton of time for some types of data gathering. But I'm also seeing quite a lot of queries and processing going on, and it has only found about 3% of the data that I know is publicly available, after 20 minutes.

It does say that extraction can take hours, but I was expecting it would be more of an 80/20 kind of thing, with a lot of data found quickly, then a long tail of searching to fill in gaps. Is my expectation wrong?

I worry for two related reasons. One, inefficient gathering of data is going to churn and burn more resources than necessary, both on your systems and on the sites being hit. Secondly, although this free opportunity is an amazing way to show off your tool, I fear the pricing of an actual run is going to be high.

mfkhalil 3 hours ago

Thanks for the feedback! From what we've seen it's actually the other way around - once it gets a sense of where this information lives the latter stages of data collection go quicker, especially since it's able to deploy search agents in parallel to get information and doesn't need to do the manual work as much anymore. Having said that, it does sometimes forget to do that, and although we've added the critic agent to remind it to do that it can be inconsistent but usually if you step in and ask it to deploy agents in parallel that fixes it.

We use Gemini 2.5 Flash which is already pretty cheap, so inference costs are actually not as high as they would seem given the number of steps. Our architecture allows for small models like that to operate well enough, and we think those kinds of models will only get cheaper.

Having said all that, we are working on improving latency and allowing for more parallelization wherever possible and hope to include that in future versions, especially for enrichment. We do think that one of the weaknesses of the product is for mass collection - it's better at finding medium sized datasets from siloed sources and less good at getting large comprehensive datasets, but we're also considering approaches that incorporate more traditional scraping tactics for finding these large datasets.

4ndrewl 25 minutes ago

Congratulations on the launch. I'm currently using it to run a very specific task that I was thinking about just earlier today. Will let you know how it gets on.

Oras 6 hours ago

When I read your description, I thought, "This is just like the Exa dataset, no?" but then I gave it a try, and I am genuinely impressed.

Great decision to make it without a login so people can test.

Here is what I liked:

- The agent told me exactly what's happening, which sources it is checking, and the schema.

- The agent correctly identified where to look at, and how to obtain the data.

- Managing expectations: Webhound is extracting data Extraction can take multiple hours. We'll send you an email when it's complete.

Minor point:

- There is no pricing on the main domain, just the HN one https://hn.webhound.ai/pricing

Good luck!

mfkhalil 6 hours ago

Thanks, glad to hear you had a good experience.

We were heavily inspired by tools like Cursor - basically tried to prioritize user control and visibility above everything else.

What we discovered during iteration was that our users are usually domain experts who know exactly what they want. The more we showed them what was happening under the hood and gave them control over the process, the better their results got.

arjunchint 2 hours ago

Since it looks like you are built on FireCrawl, and FireCrawl itself has similar products like FireEnrich, how do you see yourself maintaining a differentiation and compete with FireCrawl directly, if they just decide to copy you?

As an aside, we are about to launch something like similar at rtrvr.ai but having AI Web Agents navigate pages, fill forms and retrieve data. We are able to get our costs down to negligible by doing headless, serverless browsers and our own grounds up DOM construction/actuation (so no FireCrawl costs). https://www.youtube.com/watch?v=gIU3K4E8pyw

mfkhalil 1 hour ago

Good point. Our main differentiation is the shared workspace - users can step in and guide the agent mid-task, kind of like Cursor vs Claude (which can technically generate the same code that Cursor does). Firecrawl (or any crawler we may use) is only part of the process, we want to make the collaborative process for user <> agent as robust and user controllable as possible.

lubujackson 4 hours ago

Great job on the launch. This is fills an exact need my company has, and your UI is fantastic. One nit, it would be nice to be able to manually edit the schema instead of only interacting with the LLM.

I am concerned about your pricing, as "unlimited" anything seems to be fading away from most LLM providers. Also, I don't think it makes sense for B2B clients who have no problem paying per usage. You are going to find customers that want to use this to poll for updates daily, for example.

Are you using proxies for your text-based browser? I am curious how you are circumventing web crawling blocking.

mfkhalil 4 hours ago

Thanks a lot regarding UI and good point on the schema editing.

We've been having similar thoughts about pricing and offering unlimited, but since it is feasible for us in the short term due to credits we enjoy offering that option to early users, even if it may be a bit naive.

Having said that, we are currently working on a pilot with a company whom we are offering live updates, and they are paying per usage since they don't want to have to set it up themselves, so we can definitely see the demand there. We also offer an API for companies that want to reliably query the same thing at a preset cadence, which is also usage based.

For crawling we use Firecrawl. They handle most of the blocking issues and proxies.

caltonji 3 hours ago

On a simple dataset (that can be answered by a single table in a single Wikipedia page) it seems to overthink https://www.webhound.ai/dataset/57a5c745-909e-466d-bcac-1dd7....

Quickly hit your limits but on a complex dataset requiring looking at a lot of unstructured data on a lot of different web page, it seems to do really well!https://hn.webhound.ai/dataset/c6ca527e-1754-4171-9326-11cc8...

mfkhalil 1 hour ago

Yeah, we've noticed it overthinks simple tasks that could be solved with a single table scrape. The agent architecture is built for complex, multi-source problems so it overengineers straightforward queries.

Working on better task classification upfront to route simple requests more directly.

mips_avatar 2 hours ago

This is really cool, I could imagine a number of ways I'd want to incorporate this into my data pipeline. Your agent is actually doing a great job of finding useful sources, though I wish there was a way it could really dig into some of these sources and pull hard from them. For my query it actually found some cool sources I didn't know existed, but it only pulled a single query from them. I'm now thinking I should go and write a custom scraper for those sources to actually get the full data I want.

mfkhalil 1 hour ago

Thanks, we have noticed that it can tend to "give up" early on certain sources. Ideally the critic agent would guide it back to the correct path of continuing to go deeper, but if that doesn't work usually just adding something to the prompt or sending it a message later on telling it to go deep on these sources would work.

_false 1 hour ago

I found the ability to stop and clarify a task in "one-shot" mode impressive. In my original prompt it misunderstood MCP to stand for Medical Care Plan. I was worried I wasted a generation but being able to stop and clarify fixed it.

_false 1 hour ago

Oh, nevermind. It became confused and was unable to complete the task:

> I noticed you mentioned that "MCP stands for model context protocol." My current understanding, based on the initial problem description and the articles I've been reviewing, is that MCP refers to "Managed Care Plan." This is important because the entire schema and extraction plan are built around "Managed Care Plans."

Session ID: fcd1edb8-7b3c-480e-a352-ed6528556a63

mfkhalil 1 hour ago

Sorry about that. If you tell it to restructure the schema and search plan around MCP as model context protocol it should work. The agent can get stuck on its initial interpretation sometimes.

artembugara 3 hours ago

Congrats on the HN Launch!

It's probably the best research agent that uses live search. Are you using Firecrawl, I assume?

We're soon launching a similar tool (CatchALL by NewsCatcher) that does the same thing but on a much larger scale because we already index and pre-process millions of pages daily (news, corporate, government files). We're seeing so much better results compared to parallel.ai for queries like "find all new funding announcements for any kind of public transit in California State, US that took place in the past two weeks"

However, our tool will not perform live searches, so I think we're complementary.

i'd love to chat.

Poomba 3 hours ago

I like this approach better TBH - more reliable and robust. It probably satisfies 80% of most customer queries too as most want to query against the same sources

artembugara 2 hours ago

Oh, I totally see your point.

We’re optimising for large enterprises and government customers that we serve, not consumers.

Even the most motivated people, such as OSINT or KYC analysts, can only skim through tens, maybe hundreds of web pages. Our tool goes through 10,000+ pages per minute.

An LLM that has to open each web page to process the context isn’t much better than a human.

A perfect web search experience for LLM would be to get just the answer, aka the valid tokens that can be fully loaded into context with citations.

Many enterprises should leverage AI workflows, not AI agents.

Nice to have // must have. Existing AI implementations are failing because it’s hard to rely on results; therefore, they’re used for nice-to-haves.

Most business departments know precisely what real-world events can impact their operations. Therefore, search is unnecessary; businesses would love to get notifications.

The best search is no search at all. We’re building monitors – a solution that transforms your catchALL query into a real-time updating feed.

Poomba 2 hours ago

So your customers just want to use this for their own internal data, not external data from the web. Is that correct?

artembugara 2 hours ago

no no, they want to use it on external data, we do not do any internal data.

I'll give a few examples of how they use the tool.

Example 1 -- real estate PE that invests in multi-family residential buildings. Let's say they operate in Texas and want to get notifications about many different events. For example, they need to know about any new public transport infrastructure that will make specific area more accessible -> prices wil go up.

There are hundreds of valid records each month. However, to derive those records, we usually have to sift through tens of thousands of hyper-local news articles.

Example 2 -- Logistics & Supply Chain at F100 Tracking of all the 3rd party providers, any kind of instability in the main regions, disruptions at air and marine ports, political discussions around the regulation that might affect them, etc. There are like 20-50 events, and all of them are multi-lingual at global scale.

thousands of valid records each week, millions of web pages to derive those from.

mfkhalil 1 hour ago

Hey, would be happy to chat. Shoot us an email at team@webhound.ai and we can set up a time.

artembugara 1 hour ago

done

whinvik 4 hours ago

Nice, this looks interesting.

> It uses a text-based browser we built

Can you tell us more about this. How does it work?

mfkhalil 4 hours ago

We maintain a constant browser state that gets fed into the system prompt which shows the most recent results, current page, where you are in the content, what actions are available, etc. It's markdown by default but can switch to HTML if needed (for pagination or CSS selectors). The agent always has full context of its browsing session.

A few design decisions we made that turned out pretty interesting:

1. We gave it an analyze results function. When the agent is on a search results page, instead of visiting each page one by one, it can just ask "What are the pricing models?" and get answers from all search results in parallel.

2. Long web pages get broken into chunks with navigation hints so the agent always knows where it is and can jump around without overloading its context ("continue reading", "jump to middle", etc.).

3. For sites that are commonly visited but have messy layouts or spread out information, we built custom tool calls that let the agent request specific info that might be scattered on different pages and consolidates it all into one clean text response.

4. We're adding DOM interaction via text in the next couple of days, so the agent can click buttons, fill forms, enter keys, but everything still comes back as structured text instead of screenshots.

whinvik 3 hours ago

Thanks. If I am interpreting this correctly, what you have is not a browser but a translation layer. You are still using something that scrapes the data and then you translate it to be in the format that works best for your agent.

My original interpretation was that you had built a full blown browser, something akin to a Chromium/Firefox fork

jackienotchan 5 hours ago

AI crawlers have lead to a big surge in scraping activity, and most of these bots don't respect any scraping best practices that the industry has developed over the past two decades (robots.txt, rate limits, user agents, etc.).

This comes with negative side effects for website owners (costs, downtime, etc.), as repeatedly reported here on HN (and experienced myself).

Does Webhound respect robots.txt directives and do you disclose the identity of your crawlers via user-agent header?

mfkhalil 5 hours ago

We currently use Firecrawl for our crawling infrastructure. Looking at their documentation, they claim to respect robots.txt, but based on user reports in their GitHub issues, the implementation seems inconsistent - particularly for one-off scrapes vs full crawls.

This is definitely something we need to address on our end. Site owners should have clear ways to opt out, and crawlers should be identifiable. We're looking into either working with Firecrawl to improve this or potentially switching to a solution that gives us more control over respecting these standards.

Appreciate you bringing this up.

nextworddev 4 hours ago

Firecrawl is egregiously expensive

jquaint 2 hours ago

Used this in a "vibe coding hackathon" the other day. Works really well!

Poomba 5 hours ago

When it finds a page that has the results I'm looking for in a particular website, does it paginate through all of them? When I searched for "stores that are on the Faire marketplace" it seems to return just the first page of results without paginating through all of them.

mfkhalil 5 hours ago

Right now it can do that via URL params if that is how the website handles pagination, although we are pushing a feature in the next couple of days which allows it to take action on the DOM.

If it isn't doing that in your session, you can usually just step in and tell it to and it will follow your instructions.

nextworddev 4 hours ago

Is there an open source alternative to building this type of UI / job scheduling backend that works for grids?

nc 5 hours ago

Pretty cool, how does it compare to Parallel?

mfkhalil 5 hours ago

Thanks! Unlike a lot of our competitors who use search-inspired UX, we went with an agentic approach inspired by tools like Cursor - basically iterative user control.

Instead of just search query → final result (though you can do that too), you can step in and guide it. Tell it exactly where to look, what sources to check, how to dig deeper, how to use its notepad.

We've found this gets you way better results that actually match what you're looking for, as well as being a more satisfying user experience for people who already know how they would do the job themselves. Plus it lets you tap into niche datasets that wouldn't show up with just generic search queries.

admn2 4 hours ago

Can I give it a list of products and have it provide data enrichment for each of them?

mfkhalil 4 hours ago

Yep, you can paste the list as text, or we also accept file uploads. Then, you can prompt it to enrich with certain attributes and it will do that for you.

giancarlostoro 6 hours ago

This is how I use Perplexity, I will have to give this a try, I am always on the lookout for newer tools.