Skip to content

Performance: Replace slow ORDER BY RAND() with COUNT+OFFSET#1065

Merged
audiodude merged 2 commits intoopenzim:mainfrom
ARCoder181105:QueryOptimize
Feb 21, 2026
Merged

Performance: Replace slow ORDER BY RAND() with COUNT+OFFSET#1065
audiodude merged 2 commits intoopenzim:mainfrom
ARCoder181105:QueryOptimize

Conversation

@ARCoder181105
Copy link
Copy Markdown
Contributor

This PR addresses the performance concerns regarding ORDER BY RAND() discussed in #1057.

Fixes #1058

Context:
As noted in the issue, ORDER BY RAND() forces a sort of all matching rows, which is inefficient. While this hasn't been critical due to low traffic, it is a known SQL anti-pattern.

Solution:
I have replaced the query with a COUNT() + OFFSET approach. This utilizes the existing indexes to count and fetch a single row without sorting the entire result set, effectively resolving the performance bottleneck.

Verification:
Ran wp1/logic/rating_test.py locally and all 28 tests passed.

image

@ARCoder181105
Copy link
Copy Markdown
Contributor Author

I replaced ORDER BY RAND() with a COUNT + OFFSET approach. ORDER BY RAND() is inefficient because it forces the database to sort every matching row before picking one. The new method skips the sort entirely, making it significantly faster and more scalable for large datasets.

@ARCoder181105
Copy link
Copy Markdown
Contributor Author

Hi @audiodude, @rao107

Please Review my PR,
If there is any improvement then please guide me , I would be really happy to implement that😉

Comment thread wp1/logic/rating.py
@ARCoder181105
Copy link
Copy Markdown
Contributor Author

Response to Index Concern

I've verified that the COUNT query does NOT do a table scan. Here's the proof:

Current Index Structure:

The ratings table has a PRIMARY KEY on (r_project, r_namespace, r_article).

EXPLAIN Analysis:

Query 1: COUNT with project filter

EXPLAIN SELECT COUNT(*) FROM ratings WHERE r_project = 'en.wikipedia.org';

Result:

  • type: ref (efficient index lookup, NOT a table scan)
  • key: PRIMARY (using the primary key index)
  • rows: 1 (only scanning 1 row)
  • Extra: Using where; Using index (using index-only scan)

Query 2: SELECT with OFFSET

EXPLAIN SELECT * FROM ratings WHERE r_project = 'en.wikipedia.org' LIMIT 1 OFFSET 100;

Result:

  • type: ref (efficient)
  • key: PRIMARY
  • rows: 1

The PRIMARY KEY's first column (r_project) allows MySQL to efficiently filter rows without scanning the entire table. The COUNT + OFFSET approach leverages this existing index structure.

Performance Comparison:

  • Old approach (ORDER BY RAND()): Forces sorting of ALL matching rows
  • New approach (COUNT + OFFSET): Uses index for COUNT, then uses index again for OFFSET

No additional indices are needed.

@ARCoder181105
Copy link
Copy Markdown
Contributor Author

Update: No Migration Needed ✅

I've verified using EXPLAIN that the queries do NOT perform table scans.

Evidence:

The ratings table already has a PRIMARY KEY on (r_project, r_namespace, r_article).

COUNT query EXPLAIN output:

+------+-------------+---------+------+---------------+---------+---------+-------+------+--------------------------+
| id   | select_type | table   | type | possible_keys | key     | key_len | ref   | rows | Extra                    |
+------+-------------+---------+------+---------------+---------+---------+-------+------+--------------------------+
|    1 | SIMPLE      | ratings | ref  | PRIMARY       | PRIMARY | 65      | const | 1    | Using where; Using index |
+------+-------------+---------+------+---------------+---------+---------+-------+------+--------------------------+

SELECT with OFFSET EXPLAIN output:

+------+-------------+---------+------+---------------+---------+---------+-------+------+-------------+
| id   | select_type | table   | type | possible_keys | key     | key_len | ref   | rows | Extra       |
+------+-------------+---------+------+---------------+---------+---------+-------+------+-------------+
|    1 | SIMPLE      | ratings | ref  | PRIMARY       | PRIMARY | 65      | const | 1    | Using where |
+------+-------------+---------+------+---------------+---------+---------+-------+------+-------------+

Key observations:

  • type: ref - Uses index for efficient lookup (not a full table scan)
  • key: PRIMARY - Uses the existing primary key index
  • rows: 1 - Only scans 1 row instead of all matching rows
  • Extra: Using index - Index-only scan (even more efficient)

The PRIMARY KEY's first column (r_project) allows the database to efficiently use the index without any additional indices needed.

Performance comparison:

  • OLD (ORDER BY RAND()): Sorts ALL matching rows before selecting one
  • NEW (COUNT + OFFSET): Uses index for both COUNT and OFFSET operations

No migration or new indices are required. The existing PRIMARY KEY provides optimal performance.

@audiodude
Copy link
Copy Markdown
Member

We can merge this, but I encourage you to review our https://github.com/openzim/wp1/blob/main/CONTRIBUTING.md, especially the sections on the use of LLMs/AI.

@codecov
Copy link
Copy Markdown

codecov Bot commented Jan 23, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 92.78%. Comparing base (814eb46) to head (12a5b82).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1065      +/-   ##
==========================================
- Coverage   92.78%   92.78%   -0.01%     
==========================================
  Files          74       74              
  Lines        4297     4308      +11     
==========================================
+ Hits         3987     3997      +10     
- Misses        310      311       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@ARCoder181105
Copy link
Copy Markdown
Contributor Author

@audiodude
Yes I already read the LLM policy, I used AI to optimize the query that what I did

@audiodude
Copy link
Copy Markdown
Member

@ARCoder181105 thanks for being transparent about your LLM use. However, it seems your responses/explanations to my questions where directly copy/pasted from an LLM. Please try to use your own reasoning and voice in the future.

@ARCoder181105
Copy link
Copy Markdown
Contributor Author

Yeah, Understood👍👍
@audiodude

@audiodude audiodude merged commit 7b7af91 into openzim:main Feb 21, 2026
6 checks passed
@ARCoder181105 ARCoder181105 deleted the QueryOptimize branch February 21, 2026 20:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SQL query to get random articles is inefficient

2 participants