-
Notifications
You must be signed in to change notification settings - Fork 44
Add smart URL transformation support for search-replace command #209
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Hello! 👋 Thanks for opening this pull request! Please check out our contributing guidelines. We appreciate you taking the initiative to contribute to this project. Contributing isn't limited to just code. We encourage you to contribute in the way that best fits your abilities, by writing tutorials, giving a demo at your local meetup, helping other users with their support questions, or revising our documentation. Here are some useful Composer commands to get you started:
To run a single Behat test, you can use the following command: # Run all tests in a single file
composer behat features/some-feature.feature
# Run only a specific scenario (where 123 is the line number of the "Scenario:" title)
composer behat features/some-feature.feature:123You can find a list of all available Behat steps in our handbook. |
Summary of ChangesHello @MarkBerube, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces intelligent URL transformation capabilities to the Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a valuable enhancement to the search-replace command by adding smart URL transformation support. The new --smart-url and --analyze-tables flags, along with automatic URL detection, significantly improve performance for URL replacements by intelligently skipping non-URL columns. The implementation is robust, covering static core column lists, dynamic datatype analysis, and pattern matching for column names. Comprehensive test cases have been added to ensure the new functionality works as expected across various scenarios, including error handling for invalid URLs. The documentation has also been updated to reflect these new options and their usage.
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
| ); | ||
|
|
||
| if ( empty( $columns ) ) { | ||
| continue; // @codeCoverageIgnore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To explain the ignore - this is a extremely rare edge case where the wpdb reports no columns for a DB table. This makes this feature effectively not usable so we must skip the operation. The ignore makes more sense to me than mocking a complex DB edge case in the .feature that should never be hit to begin with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds smart URL transformation support to the search-replace command to improve performance when replacing URLs in WordPress databases. The feature automatically skips columns that cannot contain URLs, resulting in ~34-42% performance improvement on large databases.
Changes:
- Adds
--smart-urlflag with auto-detection for URLs starting with http:// or https:// - Adds
--analyze-tablesflag for advanced MySQL datatype analysis to skip additional non-text columns - Implements comprehensive test coverage with 42 new test scenarios
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| src/WP_CLI/SearchReplace/Non_URL_Columns.php | New class providing static list of WordPress core non-URL columns and dynamic analysis methods for identifying non-text datatypes and naming patterns |
| src/Search_Replace_Command.php | Adds URL auto-detection, smart-url mode implementation, table analysis, and validation logic for the new flags |
| features/search-replace.feature | Updates existing test scenarios to expect "Detected URL replacement" message and adjusted output format |
| features/search-replace-url.feature | New comprehensive test file with 42 scenarios covering smart-url and analyze-tables functionality |
| README.md | Documents the new --smart-url and --analyze-tables flags with usage examples |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| 'to_ping', | ||
| 'pinged', |
Copilot
AI
Jan 15, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The columns to_ping and pinged in wp_posts can contain URLs. According to WordPress Codex, to_ping stores a list of URLs to ping when the post is published, and pinged stores URLs that have already been pinged. These columns should be removed from the skip list as they are specifically designed to contain URLs.
| 'to_ping', | |
| 'pinged', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, these cols are going to contain URLs, this is true. However they aren't going contain URLs we would like to replace on. We pingback to other WP sites(not the one we are on) so if we're cloning a DB using search-replace to replace the URLs in these columns aren't going to be for the current site.
| 'comment_id', | ||
|
|
||
| // wp_users table - User metadata and status | ||
| 'user_login', |
Copilot
AI
Jan 15, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The user_login column could theoretically contain a URL-like string if someone uses an email address or URL-formatted username. While uncommon, excluding this from search-replace in URL mode could miss legitimate use cases. Consider whether this exclusion is too aggressive.
| 'user_login', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so if user admin at admin@blah.com search and replaces blah.com to foobar.com they are expected to know their login name has changed? Can't say I agree.
| 'link_rating', | ||
| 'link_updated', | ||
| 'link_rel', | ||
| 'link_rss', |
Copilot
AI
Jan 15, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The link_rss column in wp_links is specifically designed to store RSS feed URLs. This should not be in the skip list as it can contain URLs. The link_rel column typically contains relationship values like 'nofollow' rather than URLs, so that one is correctly excluded.
| 'link_rss', |
| if ( ! filter_var( $old, FILTER_VALIDATE_URL ) ) { | ||
| WP_CLI::error( | ||
| sprintf( | ||
| 'The --smart-url flag is designed for URL replacements, but "%s" is not a valid URL. ' . | ||
| 'Please use a full URL (e.g., http://example.com) or remove the --smart-url flag.', | ||
| $old | ||
| ) | ||
| ); | ||
| } |
Copilot
AI
Jan 15, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PHP's FILTER_VALIDATE_URL accepts URLs without schemes (e.g., 'example.com'), but the auto-detection only triggers for URLs starting with 'http://' or 'https://'. This creates an inconsistency where the validation would pass for 'example.com' if someone manually uses --smart-url, but it wouldn't auto-detect. The validation should either require a scheme (http:// or https://) or accept URLs without schemes consistently.
| that cannot contain URLs (like post_type, post_status, user_pass, etc.), | ||
| significantly improving performance for URL replacements. This is | ||
| particularly useful when migrating sites or changing domain names. | ||
| Performance: ~34% faster on large databases. |
Copilot
AI
Jan 15, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR description shows test results indicating 41.6% performance improvement, but the README states ~34%. While both might be valid depending on the database, it would be better to use consistent numbers or provide a range (e.g., '30-40% faster' or '~34-42% faster') to set accurate expectations.
| Performance: ~34% faster on large databases. | |
| Performance: ~34–42% faster on large databases (depending on the database). |
| | replacement | flags | | ||
| | {SITEURL}/subdir | | | ||
| | newdomain.com | | | ||
| | newdomain.com | --dry-run | |
Copilot
AI
Jan 15, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test uses 'newdomain.com' as a replacement value, which is not a valid URL (missing scheme). This will cause FILTER_VALIDATE_URL to fail in the validation logic at line 311 of Search_Replace_Command.php when smart-url mode is auto-enabled. The test should use 'http://newdomain.com' or 'https://newdomain.com' instead.
| | replacement | flags | | |
| | {SITEURL}/subdir | | | |
| | newdomain.com | | | |
| | newdomain.com | --dry-run | | |
| | replacement | flags | | |
| | {SITEURL}/subdir | | | |
| | http://newdomain.com | | | |
| | http://newdomain.com | --dry-run | |
|
Thanks a lot! I think this goes in the right direction. This is very much related to #186. A separate command might be interesting for this as it allows for future optimizations more easily, such as ones mentioned at https://make.wordpress.org/core/2025/11/27/wordpress-importer-can-now-migrate-urls-in-your-content/ (cc @adamziel) |
|
At a glance, I feel like the real feature here is an option that says "ignore all non text columns" and it feels like it would be useful beyond URLs (though URLs is probably the most popular use case for this). Id be curious of a breakdown of performance and if there is a lot to gain by adding anything beyond the single feature that ignores based on column types. Feels like a lot of the hardcoded table names are already of a type that would be ignored anyway. The regex pattern for potential column names seems fragile and prone to false positives (and again, the positives are already likely to have a type that would be ignored). In other words, can we get most of the benefits here with just the single feature to ignore Maybe another command could build on top of that, but to me that seems like a small general change that could maybe have a big impact if a lot of time is spent on those types of columns. |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
|
@mrsdrizzle thank you for posting your thoughts here. That is how search-replace already operates via I agree that these skips can be quite aggressive and fragile as well - and that's why I exactly limited it to URLs only and made the regex and opt-in only thing via Take for example the core table There are another 9, nearly half, that I can say without doubt aren't worth scanning over for URLs but are typed as text columns: Now I'm obviously picking out one of the worst offenders in the WP DB to prove my point here, but we really wouldn't be discussing this if the core WP DB had better typing in general. Here's my benchmarks from scratch with a bit more data like you requested. Now this isn't exactly equivalent to the 100+ GB real world instances nor is the data distribution probably accurate, but is enough to show how much we're saving query overhead on. Database Size: ~15.72GB Final Results (3-run average)
Summary
|
|
@swissspidy I am not opposed to moving this to a separate command(although I don't know where to start on that), but this change will probably stay under the radar of folks that use |
|
Interesting, I supposed while technically being a type that holds text I would also consider I guess a better version of my question then is "Do you get most of these savings by only looking at TEXT/MEDIUMTEXT/LONGTEXT types -- ignoring anything else including For example, in your initial post when I look at the tables/columns that have been removed from the current output and the "Smart URL Mode Output" those are almost all just varchar right? and you'd presumably get the same savings by just allowing to skip This is quick/by hand (sorry for any errors) base on details here, but these seem to be the differences between the current output and your modified example -- pretty much all of which would also get ignored by just skipping
So you'd get mostly the same results I assume. |
|
no worries, let's see how it goes, mainline code with Final Results (3-run average)
Summary
Your guess is right - but it does skip these columns that exist that need to be processed for URLs: if the DB typing was a bit more strict that would be a great simplification of our problem here, but that's not the case in our current state of the WP DB. |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
|
hmm yea, though that but annoying for the other two! (and thank you for taking the time to run those). Interesting to see where most of the savings comes from though, and still feels worth a feature to be able to target column types based on this. I don't think I really agree with the URL specific features here though or how it is designed. I've probably done a thousand of these type of change the URL for the site and I never use a string that starts with http/https because there are many ways that a domain name/"url" get stored in the database that don't look like that. lots of just Also agree that this should be a separate command if it can't be something more generic to general search-replace (like filtering out column types). I don't actually believe its possible to replace a url in a real world meaningful way with a single search-replace command |
Why
When search-replace-command for WP CLI scans for URLs and other bits of text to replace, it searches the whole SQL database for that value. Even tables in the WP core ecosystem that will never have a URL, columns like:
post_statusonwp_postsmeta_keyonwp_postmetaWhile not a big deal for most WordPress sites, this becomes a persistent speed bump to cloning environments when you start dealing with databases with a large amount of rows or databases with multisite setups with a huge amount of tables.
What this does
What this change does is add two parameters to alleviate this pain that a WP-CLI user can call.
--smart-url- skip columns automactically that exist in WP Core that will NEVER have a URL as a value. These columns are statically fixed and exist insrc/WP_CLI/SearchReplace/Non_URL_Columns.php.--analyze-tables- can only be executed if--smart-urlis also present in parameters. This parameter will tell WP-CLI to scan the database for columns that are non text datatypes in SQL (binary,datetime, etc.) and for column names that match the core WP pattern that would also not have a URL in it (*_order,*_quantity, etc.) that will be skipped additionally before search-replace runs. Will be a bit slower, but will capture more columns to skip for custom WP DB setups.You must opt into these performance skips via the parameters above and it is only recommended that you do so if you are replacing URLs in the WP DB.
Performance Gains
I've tested this in multiple local setups. In smaller setups (less than 1-2gb DB size) there was no noticeable difference between the scan speeds. However when the DB grew larger in my benchmarks (10GB) where there were a large amount of rows in
wp_postmetaandwp_poststhere was an average 30-40% savings over a normal search-replace scan speed.Test Run
wp search-replace(standard)wp search-replace --smart-urlTest Details
Standard Command Output
Smart URL Mode Output