Conversation
matyascimbulka
left a comment
There was a problem hiding this comment.
Thank you for implementing the changes. There was no need to start new crawler for blocking media.
I'm not sure why the blockRequests function doesn't work. But the page.route function seems to be the way to go for this use case (outside of Crawlee).
MQ37
left a comment
There was a problem hiding this comment.
LGTM 👍 And thank you for fixing this, I haven't noticed that it spawns another crawler instance.
metalwarrior665
left a comment
There was a problem hiding this comment.
Let's test the perf a bit more
| * Only blocks resources if blockMedia is true. | ||
| */ | ||
| async function blockMediaResourcesHook({ page, request }: PlaywrightCrawlingContext<ContentCrawlerUserData>) { | ||
| await page.route('**/*', async (route) => { |
There was a problem hiding this comment.
page.route disables native browser cache which is why blockRequests is normally recommended (that is a native Chromium CDP call). The cache disabling is only bad if you do more requests for the same site. I would do a perf test on more URLs of the same site and test more sites because this could slow us down as well.
Based on @matyascimbulka's suggestion, I refactored the code and moved
preNavigationHooksto a separate function so that selectingblockMedia: true/falsedoes not create a new instance of the crawler.There may be a better way to block media, but it didn’t work for me—perhaps @metalwarrior665 can help here?
Another issue (#60) in standby mode causes multiple crawlers to be created without reason. I’ll leave this for a separate PR.
And some number not as good as I hoped for but still it is an improvement
