Diffbot

Learn how to use Diffbot with Composio

Overview

SLUG: DIFFBOT

Description

Diffbot provides AI-powered tools to extract and structure data from web pages, transforming unstructured web content into structured, linked data.

Authentication Details

generic_api_key
stringRequired

Connecting to Diffbot

Create an auth config

Use the dashboard to create an auth config for the Diffbot toolkit. This allows you to connect multiple Diffbot accounts to Composio for agents to use.

1

Select App

Navigate to the Diffbot toolkit page and click “Setup Integration”.

2

Configure Auth Config Settings

Select among the supported auth schemes of and configure them here.

3

Create and Get auth config ID

Click “Create Integration”. After creation, copy the displayed ID starting with ac_. This is your auth config ID. This is not a sensitive ID — you can save it in environment variables or a database. This ID will be used to create connections to the toolkit for a given user.

Connect Your Account

Using API Key

1from composio import Composio
2
3# Replace these with your actual values
4diffbot_auth_config_id = "ac_YOUR_DIFFBOT_CONFIG_ID" # Auth config ID created above
5user_id = "0000-0000-0000" # UUID from database/app
6
7composio = Composio()
8
9def authenticate_toolkit(user_id: str, auth_config_id: str):
10 # Replace this with a method to retrieve an API key from the user.
11 # Or supply your own.
12 user_api_key = input("[!] Enter API key")
13
14 connection_request = composio.connected_accounts.initiate(
15 user_id=user_id,
16 auth_config_id=auth_config_id,
17 config={"auth_scheme": "API_KEY", "val": user_api_key}
18 )
19
20 # API Key authentication is immediate - no redirect needed
21 print(f"Successfully connected Diffbot for user {user_id}")
22 print(f"Connection status: {connection_request.status}")
23
24 return connection_request.id
25
26
27connection_id = authenticate_toolkit(user_id, diffbot_auth_config_id)
28
29# You can verify the connection using:
30connected_account = composio.connected_accounts.get(connection_id)
31print(f"Connected account: {connected_account}")

Tools

Executing tools

To prototype you can execute some tools to see the responses and working on the Diffbot toolkit’s playground

Python
1from composio import Composio
2from openai import OpenAI
3import json
4
5openai = OpenAI()
6composio = Composio()
7
8# User ID must be a valid UUID format
9user_id = "0000-0000-0000" # Replace with actual user UUID from your database
10
11tools = composio.tools.get(user_id=user_id, toolkits=["DIFFBOT"])
12
13print("[!] Tools:")
14print(json.dumps(tools))
15
16def invoke_llm(task = "What can you do?"):
17 completion = openai.chat.completions.create(
18 model="gpt-4o",
19 messages=[
20 {
21 "role": "user",
22 "content": task, # Your task here!
23 },
24 ],
25 tools=tools,
26 )
27
28 # Handle Result from tool call
29 result = composio.provider.handle_tool_calls(user_id=user_id, response=completion)
30 print(f"[!] Completion: {completion}")
31 print(f"[!] Tool call result: {result}")
32
33invoke_llm()

Tool List

Tool Name: Get Diffbot Account Details

Description

Tool to retrieve account details, including plan information and usage statistics. use after authenticating to verify subscription and daily quota status.

Action Parameters

Action Response

data
objectRequired
error
string
successful
booleanRequired

Tool Name: Diffbot Analyze

Description

Tool to automatically determine a page's content type and route it to the appropriate extraction api. use when you have only a url and need diffbot to choose the right extractor.

Action Parameters

callback
string
fields
string
url
stringRequired

Action Response

data
objectRequired
error
string
successful
booleanRequired

Tool Name: Get Article Data

Description

Tool to extract information from articles, including authors, publication dates, and images. use when you need structured metadata from a web article url.

Action Parameters

callback
string
discussion
boolean
fields
array
mode
string
paging
string
stats
boolean
timeout
integer
url
stringRequired

Action Response

data
objectRequired
error
string
successful
booleanRequired

Tool Name: Get Discussion Thread

Description

Tool to extract threads of content from forums, comment sections, and review pages. use when you need structured discussion data from web pages after identifying the discussion url.

Action Parameters

discussion
booleanDefaults to True
fields
string
maxPages
integerDefaults to 1
norender
boolean
url
stringRequired

Action Response

data
objectRequired
error
string
successful
booleanRequired

Tool Name: Diffbot Get Event

Description

Tool to extract event details from web pages. use when you need structured event data such as venue, date, and description.

Action Parameters

callback
string
fields
string
paging
boolean
timeout
integer
url
stringRequired

Action Response

data
objectRequired
error
string
successful
booleanRequired

Tool Name: Diffbot Get Image

Description

Tool to extract detailed information about images, including dimensions and recognition data. use after confirming the image url is publicly accessible.

Action Parameters

fields
array
paging
boolean
timeout
integer
url
stringRequired

Action Response

data
objectRequired
error
string
successful
booleanRequired

Tool Name: Diffbot Get Product

Description

Tool to extract product information such as specifications, prices, availability, and reviews. use when you need structured product data including specs, pricing, and reviews.

Action Parameters

callback
string
discussion
boolean
fields
array
mode
string
paging
boolean
timeout
integer
url
stringRequired

Action Response

data
objectRequired
error
string
successful
booleanRequired

Tool Name: Get Video Data

Description

Tool to extract information from videos, including titles, descriptions, and embedded html. use when you need structured video metadata from any web page.

Action Parameters

callback
string
discussion
boolean
fallback
boolean
fields
array
mode
string
paging
boolean
timeout
integer
url
stringRequired

Action Response

data
objectRequired
error
string
successful
booleanRequired

Tool Name: List Bulk Jobs

Description

Tool to list all bulk jobs associated with a specific token. use after authenticating to retrieve statuses of all jobs for the account.

Action Parameters

Action Response

data
objectRequired
error
string
successful
booleanRequired

Tool Name: Resolve Lost ID

Description

Tool to resolve lost ids in the knowledge graph. use when you need to map a lost identifier to its canonical counterpart for data consistency.

Action Parameters

lostId
stringRequired
type
string

Action Response

data
objectRequired
error
string
successful
booleanRequired

Tool Name: Start Bulk Job

Description

Tool to start a bulk extract job. use when processing large numbers of urls asynchronously.

Action Parameters

apiUrl
stringRequired
jobConfig
object
name
string
notifyEmail
string
notifyWebhook
string
urlList
string
urls
array

Action Response

data
objectRequired
error
string
successful
booleanRequired

Tool Name: Start Crawl Job

Description

Tool to spider a site for links and process them with the extract api into a single collection. use when you have seed urls and want to collect structured data across a site. requires a plus plan for crawl api access.

Action Parameters

apiUrl
string
crawlDelay
number
customHeaders
object
maxToCrawl
integer
maxToProcess
integer
name
stringRequired
notifyEmail
string
obeyRobotsTxt
booleanDefaults to True
repeat
string
seeds
arrayRequired
type
stringRequired

Action Response

data
objectRequired
error
string
successful
booleanRequired

Tool Name: Stop Bulk Job

Description

Tool to stop a running bulk job. use when you need to halt further processing of urls in a job in progress. invoke only after confirming the jobid to avoid accidental stoppage.

Action Parameters

jobId
stringRequired

Action Response

data
objectRequired
error
string
successful
booleanRequired