-
Notifications
You must be signed in to change notification settings - Fork 0
Pagination fix #19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pagination fix #19
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request adds pagination support to fix incomplete data retrieval when querying the Cloudsmith API. The changes enable fetching all pages of results instead of just the first page, and add support for client-side filtering when API-level filtering is insufficient.
Changes:
- Added pagination logic to handle Link headers and X-Pagination headers from Cloudsmith API
- Implemented client-side filtering for download criteria when API doesn't support aggregated stats
- Increased page size from default to 100 for better efficiency
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| scheme, netloc, path, params, query, fragment = urlparse(start_url) | ||
| query_params = parse_qs(query) | ||
| query_params['page'] = [str(current_page + 1)] | ||
|
|
||
| new_query = urlencode(query_params, doseq=True) | ||
| next_url = urlunparse((scheme, netloc, path, params, new_query, fragment)) |
Copilot
AI
Jan 27, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using parse_qs followed by urlencode with doseq=True may not preserve the original query parameter order or format. If parse_qs returns list values and some parameters appeared multiple times in the original URL, this could alter the query string structure. Consider using urllib.parse.parse_qsl and urlencode without doseq if you want to preserve the original structure more faithfully, or verify that the API accepts query parameters in any order with list-style encoding.
| if isinstance(data, list): | ||
| results.extend(data) |
Copilot
AI
Jan 27, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the response is not a list (line 610) or not a dict with 'repositories' (line 1096), the function silently skips adding results but continues pagination. This could mask API response format changes or errors. Consider adding a warning log when the response format is unexpected to aid debugging.
| except ValueError: | ||
| pass |
Copilot
AI
Jan 27, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When parse_qs or int() conversion fails in the X-Pagination header parsing (lines 1118-1131), the exception is silently caught with a bare except ValueError: pass. This could hide legitimate parsing errors. Consider logging the error at debug level to help diagnose pagination issues in production.
| def fetch_paginated_results(start_url, is_catalog=False): | ||
| results = [] | ||
| next_url = start_url | ||
| while next_url: |
Copilot
AI
Jan 27, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pagination loop lacks a safety limit on the number of pages fetched. If the API returns a circular Link header or if there's a bug in the next URL construction, this could result in an infinite loop. Consider adding a maximum page limit (e.g., 1000 pages) or a safety counter to prevent runaway pagination.
| def fetch_all_pages(start_url): | ||
| results = [] | ||
| next_url = start_url | ||
| while next_url: |
Copilot
AI
Jan 27, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pagination loop lacks a safety limit on the number of pages fetched. If the API returns a circular Link header, this could result in an infinite loop. Consider adding a maximum page limit or detecting if the same URL is visited twice.
| # Helper to handle pagination | ||
| def fetch_all_pages(start_url): | ||
| results = [] | ||
| next_url = start_url | ||
| while next_url: | ||
| resp = make_request(next_url, {"Cache-Control": "no-cache"}, return_headers=True) | ||
| if not resp: | ||
| break | ||
|
|
||
| data, headers = resp | ||
| if isinstance(data, list): | ||
| results.extend(data) | ||
|
|
||
| # Check Link header for next page | ||
| # Format: <url>; rel="next", <url>; rel="last" | ||
| next_url = None | ||
| link_header = headers.get('Link') | ||
| if link_header: | ||
| for link in link_header.split(','): | ||
| parts = link.split(';') | ||
| if len(parts) >= 2 and 'rel="next"' in parts[1]: | ||
| # Extract URL from <url> | ||
| next_url = parts[0].strip().strip('<>') | ||
| break | ||
| return results | ||
|
|
Copilot
AI
Jan 27, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a duplicate of the fetch_all_pages function defined at lines 601-624. The pagination logic should be extracted to a single module-level helper function to avoid code duplication and ensure consistent behavior across all API calls.
| # Helper to handle pagination | |
| def fetch_all_pages(start_url): | |
| results = [] | |
| next_url = start_url | |
| while next_url: | |
| resp = make_request(next_url, {"Cache-Control": "no-cache"}, return_headers=True) | |
| if not resp: | |
| break | |
| data, headers = resp | |
| if isinstance(data, list): | |
| results.extend(data) | |
| # Check Link header for next page | |
| # Format: <url>; rel="next", <url>; rel="last" | |
| next_url = None | |
| link_header = headers.get('Link') | |
| if link_header: | |
| for link in link_header.split(','): | |
| parts = link.split(';') | |
| if len(parts) >= 2 and 'rel="next"' in parts[1]: | |
| # Extract URL from <url> | |
| next_url = parts[0].strip().strip('<>') | |
| break | |
| return results |
| stripped, criteria = parse_filter_criteria(args.filter) | ||
| if criteria: | ||
| if args.output == 'table': | ||
| console.print("[yellow]No direct matches. Expanding search to check aggregated download stats...[/yellow]") | ||
| sq_stripped = f"{stripped} AND format:docker" if stripped else "format:docker" | ||
| fq = urlencode({'query': sq_stripped, 'page_size': 100}) | ||
| su = f"https://api.cloudsmith.io/v1/packages/{args.org}/{args.repo}/?{fq}" | ||
| found_packages = fetch_paginated_results(su) |
Copilot
AI
Jan 27, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inconsistent indentation detected. This line uses an extra space at the beginning. All other lines in this block use consistent indentation, but this line appears to have an additional space character.
| stripped, criteria = parse_filter_criteria(args.filter) | |
| if criteria: | |
| if args.output == 'table': | |
| console.print("[yellow]No direct matches. Expanding search to check aggregated download stats...[/yellow]") | |
| sq_stripped = f"{stripped} AND format:docker" if stripped else "format:docker" | |
| fq = urlencode({'query': sq_stripped, 'page_size': 100}) | |
| su = f"https://api.cloudsmith.io/v1/packages/{args.org}/{args.repo}/?{fq}" | |
| found_packages = fetch_paginated_results(su) | |
| stripped, criteria = parse_filter_criteria(args.filter) | |
| if criteria: | |
| if args.output == 'table': | |
| console.print("[yellow]No direct matches. Expanding search to check aggregated download stats...[/yellow]") | |
| sq_stripped = f"{stripped} AND format:docker" if stripped else "format:docker" | |
| fq = urlencode({'query': sq_stripped, 'page_size': 100}) | |
| su = f"https://api.cloudsmith.io/v1/packages/{args.org}/{args.repo}/?{fq}" | |
| found_packages = fetch_paginated_results(su) |
| for link in link_header.split(','): | ||
| parts = link.split(';') | ||
| if len(parts) >= 2 and 'rel="next"' in parts[1]: | ||
| # Extract URL from <url> | ||
| next_url = parts[0].strip().strip('<>') |
Copilot
AI
Jan 27, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Link header parsing is fragile and may fail if the header format varies slightly. For example, if there are spaces around the angle brackets or the semicolon, the parsing could break. Consider using a more robust parsing approach, such as splitting on commas first, then using regex to extract the URL and rel value, or using a library designed for parsing Link headers.
| for link in link_header.split(','): | |
| parts = link.split(';') | |
| if len(parts) >= 2 and 'rel="next"' in parts[1]: | |
| # Extract URL from <url> | |
| next_url = parts[0].strip().strip('<>') | |
| # Split on commas to get individual link-values | |
| for link in link_header.split(','): | |
| # Robustly extract URL and rel using regex to tolerate whitespace | |
| match = re.search(r'\s*<\s*([^>]+?)\s*>\s*;\s*rel\s*=\s*"(.*?)"', link) | |
| if not match: | |
| continue | |
| url_part, rel_value = match.groups() | |
| if rel_value == "next": | |
| next_url = url_part |
| while next_url: | ||
| resp = make_request(next_url, {"Cache-Control": "no-cache"}, return_headers=True) | ||
| if not resp: | ||
| break | ||
|
|
||
| data, headers = resp | ||
| if isinstance(data, list): | ||
| results.extend(data) | ||
|
|
Copilot
AI
Jan 27, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pagination loop lacks a safety limit on the number of pages fetched. If the API returns a circular Link header, this could result in an infinite loop. Consider adding a maximum page limit or detecting if the same URL is visited twice.
| while next_url: | |
| resp = make_request(next_url, {"Cache-Control": "no-cache"}, return_headers=True) | |
| if not resp: | |
| break | |
| data, headers = resp | |
| if isinstance(data, list): | |
| results.extend(data) | |
| visited_urls = set() | |
| max_pages = 1000 # safety limit to avoid infinite pagination loops | |
| page_count = 0 | |
| while next_url: | |
| # break if we detect a pagination cycle | |
| if next_url in visited_urls: | |
| break | |
| visited_urls.add(next_url) | |
| # break if we exceed a reasonable maximum number of pages | |
| if page_count >= max_pages: | |
| break | |
| page_count += 1 | |
| resp = make_request(next_url, {"Cache-Control": "no-cache"}, return_headers=True) | |
| if not resp: | |
| break | |
| data, headers = resp | |
| if isinstance(data, list): | |
| results.extend(data) |
| if not filter_str: return "", [] | ||
| criteria = [] | ||
| # Matches: downloads:5, downloads>5, downloads<=10 | ||
| pattern = re.compile(r'\bdownloads\s*([:><]=?)\s*(\d+)', re.IGNORECASE) |
Copilot
AI
Jan 27, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The regex pattern does not handle the '!=' operator which might be expected for 'not equals' comparisons. If this is intentional, consider documenting the supported operators. If '!=' should be supported, add it to the pattern and handle it in the matches_criteria function.
📄 Summary
Briefly describe the purpose of this pull request. What problem does it solve or feature does it add?
🔍 Related Issues
Link to any related GitHub issues (e.g.,
Fixes #12,Closes #34):🧪 Type of Change
Please check the relevant type tag for this PR title:
[FIX]Bug fix[NEW]New thing[REFACTOR]Internal changes such as code restructuring or optimization that does not alter functionality[DOC]Documentation-only changes[CHORE]Maintenance, cleanup, or CI configuration🧪 How Has This Been Tested?
Describe how you tested your changes. Include CI runs, local tests, manual verification, or screenshots if applicable.
📸 Screenshots (if applicable)
If UI or logs are affected, include before/after screenshots or output.
✅ Checklist