Agents

Ai2Bot-Dolma

Ai2Bot-Dolma is operated by Ai2, a non-profit AI research institute. It's used to download data to train open source AI models.

AI Data Scraper

See More →

Applebot-Extended

Apple-Extended is used to train Apple’s foundation LLM models powering generative AI features across Apple products, including Apple Intelligence, Services, and Developer Tools.

AI Data Scraper

See More →

Bytespider

Bytespider is a web crawler operated by ByteDance, the Chinese owner of TikTok. It's allegedly used to download training data for its LLMs (Large Language Model) including those powering ChatGPT competitor Doubao.

AI Data Scraper

See More →

CCBot

CCBot is Common Crawl's web crawler that creates an open repository of web data, making crawled content universally accessible for research, analysis, and AI training purposes.

AI Data Scraper

See More →

ChatGLM-Spider

ChatGLM-Spider is a web crawler operated by Zhipu AI, the Chinese company behind ChatGLM. It is used for collecting data to train and evaluate the company's large language models.

AI Data Scraper

See More →

ClaudeBot

ClaudeBot is a web crawler operated by Anthropic to download training data for its LLMs (Large Language Models) that power AI products like Claude.

AI Data Scraper

See More →

CloudVertexBot

CloudVertexBot is a Google-operated crawler available to site owners to request targeted crawls of their own sites for AI training purposes on the Vertex AI platform.

AI Data Scraper

See More →

cohere-training-data-crawler

cohere-training-data-crawler is a web crawler operated by Cohere to download training data for its LLMs (Large Language Models) that power its enterprise AI products.

AI Data Scraper

See More →

Cotoyogi

Cotoyogi is a research crawler operated by Japan's Research Organization of Information and Systems that collects web content to build AI training datasets for research purposes.

AI Data Scraper

See More →

Datenbank Crawler

Datenbank Crawler is a web crawler operated by German company netEstate used for collecting and selling international website data.

AI Data Scraper

See More →

Diffbot

Diffbot is an intelligent web crawler used to understand, aggregate, and ultimately sell structured website data for real-time monitoring and training other AI models.

AI Data Scraper

See More →

FacebookBot

FacebookBot is a web crawler used by Meta to download training data for its AI speech recognition technology.

AI Data Scraper

See More →

Google-Extended

Google-Extended is a web crawler used by Google to download AI training content for its AI products like the Gemini assistant and its Vertex AI generative APIs.

AI Data Scraper

See More →

GoogleOther

GoogleOther is Google's generic crawler used by various product teams for fetching publicly accessible content, including one-off crawls for internal research and development.

AI Data Scraper

See More →

GPTBot

GPTBot is OpenAI's web crawler that collects data from publicly accessible web pages to improve AI models like ChatGPT, while respecting robots.txt and opt-out preferences.

AI Data Scraper

See More →

ICC-Crawler

ICC-Crawler is NICT's research crawler that automatically collects web pages from the Internet for academic research at Japan's National Institute of Information and Communications Technology.

AI Data Scraper

See More →

imageSpider

imageSpider is a web crawler operated by ByteDance, the company behind TikTok, Douyin, and other content platforms. The bot collects images from websites across the internet, likely to support ByteDance's various AI products.

AI Data Scraper

See More →

Kangaroo Bot

Kangaroo Bot is used by the company Kangaroo LLM to download data to train open source AI models tailored to Australian language and culture.

AI Data Scraper

See More →

laion-huggingface-processor

LAION-huggingface-processor is a web crawler operated by LAION (Large-scale Artificial Intelligence Open Network), a non-profit organization that creates open datasets for AI research. This bot collects images and associated metadata from websites to build large-scale datasets like LAION-5B, which are used to train AI models including text-to-image generators.

AI Data Scraper

See More →

LCC

LCC is a web crawler operated by the University of Leipzig that collects text data from websites to build large-scale linguistic corpora for research purposes. The bot gathers multilingual text content to support the Wortschatz project, which creates comprehensive language resources and dictionaries for natural language processing and computational linguistics research.

AI Data Scraper

See More →

meta-externalagent

meta-externalagent crawls web content for training AI models and improving Meta's products by indexing content directly across the internet.

AI Data Scraper

See More →

netEstate Imprint Crawler

netEstate Imprint Crawler is an AI data scraper operated by netEstate. If you think this is incorrect or can provide additional detail about its purpose, please let us know.

AI Data Scraper

See More →

omgili

omgili is a web crawler used by Webz.io to maintain a repository of web crawl data that it sells to other companies, including those using it to train AI models.

AI Data Scraper

See More →

PanguBot

PanguBot is a web crawler operated by the Chinese company Huawei. It's used to download training data for its multimodal LLM (Large Language Model) called PanGu.

AI Data Scraper

See More →

SBIntuitionsBot

SBIntuitionsBot is a web crawler operated by SB Intuitions, a Japanese company that develops generative language models optimized for the Japanese language and culture. This bot collects data from websites to train and improve their language models, with all collected data stored and managed within Japan.

AI Data Scraper

See More →

Spider

Spider is a web crawler designed for AI projects, including AI agents, LLMs, RAG systems, and data analysis. It collects and converts web data into multiple formats including markdown, HTML, and text for AI training and fine-tuning purposes.

AI Data Scraper

See More →

Timpibot

Timpibot is used by Timpi's decentralized network of independent node operators. The index they build can be used to train LLMs (Large Language Models).

AI Data Scraper

See More →

VelenPublicWebCrawler

VelenPublicWebCrawler is a web crawler developed by Velen for Hunter that analyzes millions of publicly accessible internet pages every month. The bot builds business datasets and machine learning models while crawling respectfully with a minimum 2-second delay between requests.

AI Data Scraper

See More →

webzio-extended

webzio-extended is a web crawler used by Webz.io to maintain a repository of web crawl data that it sells to other companies, including those using it to train AI models.

AI Data Scraper

See More →

AI Data Scrapers