Extract Public Web Data via API
HasData is a cloud API that extracts structured public web data from search engines, e-commerce sites, and arbitrary URLs with three execution modes: sync
Why it matters
Leverage a unified API to extract structured data from public websites, including search engine results, e-commerce listings, and local business information.
Outcomes
What it gets done
Scrape arbitrary URLs with JS rendering and CSS/AI extraction.
Access pre-parsed JSON data for platforms like Google, Amazon, and Zillow.
Perform bulk data extraction and recursive crawling via asynchronous jobs.
Gather business contact details from Maps and other public sources.
Install
Add it to your toolbox
Run in your project directory:
curl -fsSL https://spark.entire.vc/get/ag-hasdata | bash Capabilities
What this skill does
Fetches and parses content from web pages.
Pulls structured data fields from unstructured text.
Searches the web and retrieves relevant sources.
Adds company, role, and contact data to lead records.
Overview
HasData
What it does
Cloud platform for extracting public web data with one API key and three execution modes: sync Scraper APIs for known platforms, Web Scraping API for arbitrary URLs with JavaScript rendering and CSS/AI extraction, and async Scraper Jobs for bulk crawling and webhook fan-out.
How it connects
Use when you need search engine results, structured data from e-commerce or travel sites, arbitrary URL scraping with rendering, or bulk extraction jobs. Default to Scraper APIs for supported platforms like Google, Amazon, and Zillow; use Web Scraping for other URLs; use Scraper Jobs for crawler, contacts, SEC EDGAR, or async workflows.
Source README
HasData
Cloud platform for extracting public web data. One API key, three execution modes. All endpoints sit under https://api.hasdata.com and authenticate with x-api-key.
curl -G 'https://api.hasdata.com/scrape/google/serp' \
--data-urlencode 'q=coffee' \
-H 'x-api-key: <your-api-key>'
401 invalid key, 403 quota exhausted, 429 concurrency cap, 500 server error (retry).
When to Use
Use this skill when:
- The user needs web scraping.
- The user needs search engine results.
- The user needs structured data extraction.
- The user needs ecommerce, travel, jobs, or local business data.
- The user explicitly asks about HasData.
Three execution modes
| Mode | Latency | When | Endpoint |
|---|---|---|---|
| Web Scraping API | seconds | Arbitrary URL - JS rendering, CSS/AI extraction, screenshots | POST /scrape/web |
| Scraper APIs (sync) | seconds | Pre-parsed JSON for known platforms (Google, Amazon, Zillow, …) | GET /scrape/<vertical>/<resource> |
| Scraper Jobs (async) | minutes-hours | Bulk extraction, recursive crawling, webhook fan-out | POST /scrapers/<slug>/jobs |
Decision rule. Default to a Scraper API when one exists for the platform (pre-parsed JSON, no selector maintenance). Use Web Scraping for arbitrary URLs not covered by an API. Reach for a Scraper Job only when no API equivalent exists - crawler, contacts, sec-edgar, amazon-bestsellers, amazon-product-reviews - or when async fan-out + webhooks save engineering time over a paginated client loop.
Always-true response shape
{ "requestMetadata": { "id": "…", "status": "ok", "url": "…" }, "...": "endpoint-specific" }
Treat data as valid only if requestMetadata.status === "ok". HTTP 200 alone isn't enough.
High-leverage patterns
- SERP-first enrichment. Google SERP can surface public snippets for company and professional-profile lookup. Use it for business or authorized research, avoid unnecessary direct scraping, and treat personal email/phone lookup as allowed only with a legitimate purpose and user authorization.
- AI Mode + verify.
/scrape/google/ai-modefor the answer + references →/scrape/web(markdown) on each reference URL → cited RAG context, no vector DB. - Maps → leads.
/scrape/google-maps/searchreturns business websites and phones; collect contact details only from public, permitted sources and apply opt-out, rate, and privacy-law constraints before any outreach use. - Crawler → corpus.
crawlerScraper Job withoutputFormat: ["markdown"]+includePaths: "/docs/.+"produces an LLM-ready corpus in one submission. - Pre-extracted via SERP rich snippets.
knowledgeGraph,localResults,inlineShoppingResults,relatedQuestionscarry pre-parsed public facts. Always check them before considering direct page access.
When to call from code (the wiring)
- Auth:
x-api-keyheader on every request. Read fromHASDATA_API_KEYenv. Never hardcode, never log. - Timeouts: set client timeout ≥ 300 s. HasData's own deadline is 300 s; shorter clients produce phantom failures while still being billed on completion.
- Retries:
429and5xxonly - exponential backoff, jitter. Never retry4xx(auth, validation). - Concurrency: cap at your plan limit. The free tier is 1; anything higher just generates
429s. - Async jobs: the submit response handle is
body.id(integer), notjobId. Persist it immediately. PollGET /scrapers/jobs/<id>every 10-30 s with backoff; treat webhooks as best-effort and always pair with polling. Onfinishedthe status carriesdata: {csv, json, xlsx}short-lived URLs - download immediately.
See references/code-recipes.md for ready-to-paste Python and TypeScript clients with retry, backoff, bounded concurrency, and the full job lifecycle.
Common gotchas
- 300 s server deadline. Match client timeout.
- Disable
jsRenderingfirst, enable only if the page needs it - most static pages parse fine without a headless browser. - No
cookiesparameter - cookies go throughheaders["Cookie"]. includePathsregex is case-sensitive./blog/.+won't match/Blog/....- Scraper Job
datais double-wrapped. Each row isbody.data[i].data; outer wraps withid,jobId,dataId,createdAt,updatedAt. requestMetadata.status === "ok"is the only success signal. HTTP 200 alone isn't enough.- Webhooks are best-effort with 3 retries. Always have a polling fallback.
References
references/web-scraping.md-POST /scrape/webparameters, JS scenarios, AI extraction, cookie auth.references/search.md- Google SERP / Light / AI Mode / News / Shopping / Bing / Trends + pagination.references/ecommerce.md- Amazon (product, search, seller, seller-products) and Shopify.references/real-estate.md- Zillow, Redfin (bracketed filters).references/travel.md- Airbnb, Booking, Google Flights (occupancy rules, token pagination, IATA codes).references/local-business.md- Maps (search/place/reviews/photos/posts), Yelp, YellowPages.references/jobs.md- Indeed and Glassdoor.references/youtube.md- YouTube search / video / channel / transcript.references/scraper-jobs.md- async submit/poll/results, Crawler, Contacts, SEC EDGAR, webhook receiver.references/code-recipes.md- Python / TypeScript clients with retry, backoff, concurrency, polling.
Resources
- Sitemap: https://docs.hasdata.com/llms.txt
- API status codes: https://docs.hasdata.com/api-codes
- Credits & concurrency: https://docs.hasdata.com/credits-and-concurrency
- Dashboard: https://app.hasdata.com
Limitations
- Requires access to HasData services and valid credentials.
- Data quality and available fields depend on the target website and extraction method used.
- JavaScript-heavy websites may require rendering, which can affect performance and cost.
- Use only for public data or content the user is authorized to access; respect site terms, robots/access controls, privacy law, and rate limits.
- Rate limits, quotas, and account restrictions may apply depending on the endpoint and subscription plan.
Discussion
Questions & comments · 0
Sign In Sign in to leave a comment.