-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
User script portability regarding drivers like *CrawlDriver
classes
#1
Comments
Looking back, I realize that the userscript couldn't be ported as easily because the underlying APIs are totally different actually. As a result, the user would have to rewrite the code depending on the EDIT: Maybe it would be a |
Yeah maybe recommending CDP before puppeteer/playwright is a good idea, for exactly the reason you're saying. I think the order plugins should be implement scripts is something like
If I understand correctly CDP is an event driven API anyway, so it may be easy to expose a common interface for plugins to send CDP events even if they're using puppeteer/playwright. |
Can you provide an example of how you're using the CDP APIs currently? I dug through the I can help show how the CDP stuff could be written as a |
You can find the code using the CDP API here: https://github.com/gildas-lormeau/single-file-cli/blob/a5dc004949b4a8b5180ffb53461a6305b6b4d07a/lib/cdp-client.js (you were searching in the wrong repository). I have a more general question, |
Ok so for simplecdp a behavior might look like this: const AdDetectorBehavior = {
name: 'AdDetectorBehavior',
schema: 'BehaviorSchema@0.1.0',
version: '0.1.0',
// known ad network domains/patterns
AD_PATTERNS: [
'doubleclick.net',
'googlesyndication.com',
'adnxs.com',
'/ads/',
'/adserve/',
'analytics',
'tracker',
],
hooks: {
simplecdp: {
PAGE_SETUP: async (event, BehaviorBus, cdp) => {
await cdp.Network.enable();
await cdp.Network.setRequestInterception({ patterns: [{ urlPattern: '*' }] });
cdp.Network.requestIntercepted(async ({ interceptionId, request }) => {
const isAd = AdDetectorBehavior.AD_PATTERNS.some(pattern => request.url.includes(pattern));
if (isAd) {
BehaviorBus.emit({
type: 'DETECTED_AD',
url: request.url,
timestamp: Date.now(),
requestData: {
method: request.method,
headers: request.headers,
},
});
// either block the request or let it continue
await cdp.Network.continueInterceptedRequest({
interceptionId,
errorReason: 'blocked' // or remove this to let ads load
});
} else {
await cdp.Network.continueInterceptedRequest({ interceptionId });
}
});
},
}
}
};
export default AdDetectorBehavior;
So to use behaviors you'd add someting like this to your existing
async function getPageData(options) {
...
const cdp = new CDP(targetInfo);
const { Browser, Security, Page, Emulation, Fetch, Network, Runtime, Debugger, Console } = cdp;
...
const BehaviorBus = new BehaviorBus();
BehaviorBus.attachContext(cdp);
BehaviorBus.attachBehaviors([AdDetectorBehavior]);
await Page.addScriptToEvaluateOnNewDocument({
source: `
window.BEHAVIORS = [${JSON.stringify(AdDetectorBehavior)}];
${fs.readFileSync('behaviors.js')};
window.BehaviorBus.addEventListener('*', (event) => {
if (!event.detail.metadata.path.includes('SimpleCDPBehaviorBus')) {
dispatchEventToCDPBehaviorBus(JSON.stringify(event.detail));
}
});
`,
runImmediately: true,
});
// set up forwarding from WindowBehaviorBus -> SimpleCDPBehaviorBus
await Runtime.addBinding({name: 'dispatchEventToCDPBehaviorBus'});
Runtime.bindingCalled(({name, payload}) => {
if (name === 'dispatchEventToCDPBehaviorBus') {
BehaviorBus.dispatchEvent(JSON.parse(payload));
}
});
// set up forwarding from SimpleCDPBehaviorBus -> WindowBehaviorBus
BehaviorBus.addEventListener('*', (event) => {
event = new BehaviorEvent(event);
if (!event.detail.metadata.path.includes('WindowBehaviorBus')) {
cdp.Runtime.evaluate({
expression: `
const event = new BehaviorEvent(${JSON.stringify(event.detail)});
window.BehaviorBus.dispatchEvent(event);
`
});
}
});
...
BehaviorBus.emit({type: 'PAGE_SETUP', url})
// starting load the to capture URL
const [contextId] = await Promise.all([
loadPage({ Page, Runtime }, options, debugMessages),
options.browserDebug ? waitForDebuggerReady({ Debugger }) : Promise.resolve()
]);
BehaviorBus.emit({type: 'PAGE_LOAD', url})
...
BehaviorBus.emit({type: 'PAGE_CAPTURE, url})
...
} |
Thanks for the info! I haven't tested the code but I understand the principle and it it sounds good to me. This pattern would probably help to better organize the code in |
Ok cool, don't do any big changes to your code just yet! I'm still discussing the design with webrecorder / not convinced it's good enough yet. I'll keep you posted! Let me know if you have any ideas on other approaches or how to improve it. |
What are your thoughts on https://w3c.github.io/webdriver-bidi/ ? It seems like CDP is going away slowly in favor of it, so I'm considering removing playwright/puppeteer/cdp contexts in the spec in favor of focing bidi to be the common spec for browser-layer commands. Unfortunately it's not as clean as your nice proxy model solution and there's a lot of common utilities that are missing (e.g. Scripts would look something llike this: // Using raw WebSocket from browser or Node for BiDi connection
import WebSocket from 'ws';
// this would be built into the spec / utility library
class WebDriverBiDi {
constructor(websocketUrl) {
this.ws = new WebSocket(websocketUrl);
this.messageId = 0;
this.subscribers = new Map();
this.ws.on('message', (data) => {
const message = JSON.parse(data);
if (message.id) {
const subscriber = this.subscribers.get(message.id);
if (subscriber) {
subscriber(message);
this.subscribers.delete(message.id);
}
}
});
}
async send(method, params = {}) {
const id = ++this.messageId;
const message = {
id,
method,
params
};
return new Promise((resolve) => {
this.subscribers.set(id, resolve);
this.ws.send(JSON.stringify(message));
});
}
}
async function example() {
// Connect to Chrome's BiDi endpoint
// Chrome should be started with: --enable-bidi-protocol
const bidi = new WebDriverBiDi('ws://localhost:9222/session');
// Create a new context (tab)
const { result: { context: contextId } } = await bidi.send('browsingContext.create', {
type: 'tab'
});
// the code below here is what would be implemented inside a behavior...
// Set up network interception
await bidi.send('network.addIntercept', {
phases: ['beforeRequestSent'],
patterns: [{ urlPattern: '*example.com*' }]
});
await bidi.send('network.onIntercept', {
callback: (params) => {
if (params.phase === 'beforeRequestSent' && params.request.url.includes('example.com')) {
return {
action: 'block'
};
}
return { action: 'continue' };
}
});
// Navigate to a URL
await bidi.send('browsingContext.navigate', {
context: contextId,
url: 'https://google.com'
});
// Wait for element to appear
const script = `
new Promise((resolve) => {
const checkElement = () => {
const element = document.querySelector('input[name="q"]');
if (element) {
resolve(true);
} else {
requestAnimationFrame(checkElement);
}
};
checkElement();
});
`;
await bidi.send('script.evaluate', {
context: contextId,
expression: script,
awaitPromise: true
});
console.log('Search input found!');
}
// Run the example
example().catch(console.error); A few potential benefits:
|
I think the WebDriver BiDi standard is a very good initiative. I'd had a look but hadn't noticed the existence of the script.addPreloadScript command, that's the point that blocked me in the past with WebDriver. I'll have to do some testing but I'm interested in replacing the CDP client with a BiDi client. My basic need was to be able to provide executables that weren't too heavy. That's why I went down this road. In the short term, I think I'll try to implement a library based on the |
Overall, I find the idea interesting! For my part, I think I could implement a
CDPCrawlDriver
class (using the Chrome Devtools Protocol under the hood) in single-file-cli. Now, let's imagine a userscript written by a user for ArchiveBox that depends onPuppeteerCrawlDriver
. Assuming the APIs of the twoCrawlDriver
classes are identical, If he wanted to run it in single-file-cli, would he be responsible for replacing the "puppeteer" occurrences with "cdp" in the userscript?The text was updated successfully, but these errors were encountered: