feat: performance evaluation by jirispilka · Pull Request #61 · apify/rag-web-browser

jirispilka · 2025-03-18T21:42:29Z

Based on @matyascimbulka's suggestion, I refactored the code and moved preNavigationHooks to a separate function so that selecting blockMedia: true/false does not create a new instance of the crawler.

There may be a better way to block media, but it didn’t work for me—perhaps @metalwarrior665 can help here?

preNavigationHooks: [
    async ({ blockRequests }) => {
        // Block all requests to URLs that include `adsbygoogle.js` and also all defaults.
        await blockRequests({
            extraUrlPatterns: ['adsbygoogle.js'],
        });
    },
],

Another issue (#60) in standby mode causes multiple crawlers to be created without reason. I’ll leave this for a separate PR.

And some number not as good as I hoped for but still it is an improvement

matyascimbulka

Thank you for implementing the changes. There was no need to start new crawler for blocking media.

I'm not sure why the blockRequests function doesn't work. But the page.route function seems to be the way to go for this use case (outside of Crawlee).

MQ37

LGTM 👍 And thank you for fixing this, I haven't noticed that it spawns another crawler instance.

metalwarrior665

Let's test the perf a bit more

metalwarrior665 · 2025-03-19T16:07:10Z

src/crawlers.ts

+ * Only blocks resources if blockMedia is true.
+ */
+async function blockMediaResourcesHook({ page, request }: PlaywrightCrawlingContext<ContentCrawlerUserData>) {
+    await page.route('**/*', async (route) => {


page.route disables native browser cache which is why blockRequests is normally recommended (that is a native Chromium CDP call). The cache disabling is only bad if you do more requests for the same site. I would do a perf test on more URLs of the same site and test more sites because this could slow us down as well.

jirispilka added 8 commits March 18, 2025 11:25

fix: sort defaults

13e1501

fix preNavigationHooks

e708227

fix: input

8b38952

fix: input and headless

9c2ab32

fix: false positive issue

5268750

fix: add blocking into a function, pass blockMedia in userData

888a714

fix: add blocking into a function, pass blockMedia in userData

78934b9

fix: update README.md

4e88619

jirispilka requested review from MQ37 and matyascimbulka March 18, 2025 21:42

matyascimbulka approved these changes Mar 19, 2025

View reviewed changes

MQ37 approved these changes Mar 19, 2025

View reviewed changes

metalwarrior665 requested changes Mar 19, 2025

View reviewed changes

fnesveda added the t-ai Issues owned by the AI team. label Jun 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: performance evaluation#61

feat: performance evaluation#61
jirispilka wants to merge 8 commits intofeat/block-mediafrom
feat/perf-eval

jirispilka commented Mar 18, 2025

Uh oh!

matyascimbulka left a comment

Uh oh!

MQ37 left a comment

Uh oh!

metalwarrior665 left a comment

Uh oh!

metalwarrior665 Mar 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

jirispilka commented Mar 18, 2025

Uh oh!

matyascimbulka left a comment

Choose a reason for hiding this comment

Uh oh!

MQ37 left a comment

Choose a reason for hiding this comment

Uh oh!

metalwarrior665 left a comment

Choose a reason for hiding this comment

Uh oh!

metalwarrior665 Mar 19, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants