Performance: Replace slow ORDER BY RAND() with COUNT+OFFSET by ARCoder181105 · Pull Request #1065 · openzim/wp1

ARCoder181105 · 2026-01-19T18:58:55Z

This PR addresses the performance concerns regarding ORDER BY RAND() discussed in #1057.

Context:
As noted in the issue, ORDER BY RAND() forces a sort of all matching rows, which is inefficient. While this hasn't been critical due to low traffic, it is a known SQL anti-pattern.

Solution:
I have replaced the query with a COUNT() + OFFSET approach. This utilizes the existing indexes to count and fetch a single row without sorting the entire result set, effectively resolving the performance bottleneck.

Verification:
Ran wp1/logic/rating_test.py locally and all 28 tests passed.

ARCoder181105 · 2026-01-19T18:59:19Z

I replaced ORDER BY RAND() with a COUNT + OFFSET approach. ORDER BY RAND() is inefficient because it forces the database to sort every matching row before picking one. The new method skips the sort entirely, making it significantly faster and more scalable for large datasets.

ARCoder181105 · 2026-01-19T19:01:12Z

Hi @audiodude, @rao107

Please Review my PR,
If there is any improvement then please guide me , I would be really happy to implement that😉

ARCoder181105 · 2026-01-20T14:08:22Z

Response to Index Concern

I've verified that the COUNT query does NOT do a table scan. Here's the proof:

Current Index Structure:

The ratings table has a PRIMARY KEY on (r_project, r_namespace, r_article).

EXPLAIN Analysis:

Query 1: COUNT with project filter

EXPLAIN SELECT COUNT(*) FROM ratings WHERE r_project = 'en.wikipedia.org';

Result:

type: ref (efficient index lookup, NOT a table scan)
key: PRIMARY (using the primary key index)
rows: 1 (only scanning 1 row)
Extra: Using where; Using index (using index-only scan)

Query 2: SELECT with OFFSET

EXPLAIN SELECT * FROM ratings WHERE r_project = 'en.wikipedia.org' LIMIT 1 OFFSET 100;

Result:

type: ref (efficient)
key: PRIMARY
rows: 1

The PRIMARY KEY's first column (r_project) allows MySQL to efficiently filter rows without scanning the entire table. The COUNT + OFFSET approach leverages this existing index structure.

Performance Comparison:

Old approach (ORDER BY RAND()): Forces sorting of ALL matching rows
New approach (COUNT + OFFSET): Uses index for COUNT, then uses index again for OFFSET

No additional indices are needed.

ARCoder181105 · 2026-01-20T14:09:34Z

Update: No Migration Needed ✅

I've verified using EXPLAIN that the queries do NOT perform table scans.

Evidence:

The ratings table already has a PRIMARY KEY on (r_project, r_namespace, r_article).

COUNT query EXPLAIN output:

+------+-------------+---------+------+---------------+---------+---------+-------+------+--------------------------+
| id   | select_type | table   | type | possible_keys | key     | key_len | ref   | rows | Extra                    |
+------+-------------+---------+------+---------------+---------+---------+-------+------+--------------------------+
|    1 | SIMPLE      | ratings | ref  | PRIMARY       | PRIMARY | 65      | const | 1    | Using where; Using index |
+------+-------------+---------+------+---------------+---------+---------+-------+------+--------------------------+

SELECT with OFFSET EXPLAIN output:

+------+-------------+---------+------+---------------+---------+---------+-------+------+-------------+
| id   | select_type | table   | type | possible_keys | key     | key_len | ref   | rows | Extra       |
+------+-------------+---------+------+---------------+---------+---------+-------+------+-------------+
|    1 | SIMPLE      | ratings | ref  | PRIMARY       | PRIMARY | 65      | const | 1    | Using where |
+------+-------------+---------+------+---------------+---------+---------+-------+------+-------------+

Key observations:

✅ type: ref - Uses index for efficient lookup (not a full table scan)
✅ key: PRIMARY - Uses the existing primary key index
✅ rows: 1 - Only scans 1 row instead of all matching rows
✅ Extra: Using index - Index-only scan (even more efficient)

The PRIMARY KEY's first column (r_project) allows the database to efficiently use the index without any additional indices needed.

Performance comparison:

OLD (ORDER BY RAND()): Sorts ALL matching rows before selecting one
NEW (COUNT + OFFSET): Uses index for both COUNT and OFFSET operations

No migration or new indices are required. The existing PRIMARY KEY provides optimal performance.

audiodude · 2026-01-23T14:54:01Z

We can merge this, but I encourage you to review our https://github.com/openzim/wp1/blob/main/CONTRIBUTING.md, especially the sections on the use of LLMs/AI.

codecov · 2026-01-23T14:56:50Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 92.78%. Comparing base (814eb46) to head (12a5b82).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1065      +/-   ##
==========================================
- Coverage   92.78%   92.78%   -0.01%     
==========================================
  Files          74       74              
  Lines        4297     4308      +11     
==========================================
+ Hits         3987     3997      +10     
- Misses        310      311       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

ARCoder181105 · 2026-01-25T14:15:01Z

@audiodude
Yes I already read the LLM policy, I used AI to optimize the query that what I did

audiodude · 2026-01-25T20:57:18Z

@ARCoder181105 thanks for being transparent about your LLM use. However, it seems your responses/explanations to my questions where directly copy/pasted from an LLM. Please try to use your own reasoning and voice in the future.

ARCoder181105 · 2026-02-04T14:28:04Z

Yeah, Understood👍👍
@audiodude

audiodude requested changes Jan 19, 2026

View reviewed changes

Comment thread wp1/logic/rating.py

ARCoder181105 requested a review from audiodude January 20, 2026 14:22

audiodude approved these changes Jan 23, 2026

View reviewed changes

ARCoder181105 added 2 commits February 21, 2026 12:10

Performance: Replace slow ORDER BY RAND() with COUNT+OFFSET

7c9e356

Fixed

12a5b82

audiodude force-pushed the QueryOptimize branch from e8575de to 12a5b82 Compare February 21, 2026 20:10

audiodude merged commit 7b7af91 into openzim:main Feb 21, 2026
6 checks passed

ARCoder181105 deleted the QueryOptimize branch February 21, 2026 20:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance: Replace slow ORDER BY RAND() with COUNT+OFFSET#1065

Performance: Replace slow ORDER BY RAND() with COUNT+OFFSET#1065
audiodude merged 2 commits intoopenzim:mainfrom
ARCoder181105:QueryOptimize

ARCoder181105 commented Jan 19, 2026

Uh oh!

ARCoder181105 commented Jan 19, 2026

Uh oh!

ARCoder181105 commented Jan 19, 2026

Uh oh!

Uh oh!

ARCoder181105 commented Jan 20, 2026

Uh oh!

ARCoder181105 commented Jan 20, 2026

Uh oh!

audiodude commented Jan 23, 2026

Uh oh!

codecov Bot commented Jan 23, 2026 •

edited

Loading

Uh oh!

ARCoder181105 commented Jan 25, 2026

Uh oh!

audiodude commented Jan 25, 2026

Uh oh!

ARCoder181105 commented Feb 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

ARCoder181105 commented Jan 19, 2026

Uh oh!

ARCoder181105 commented Jan 19, 2026

Uh oh!

ARCoder181105 commented Jan 19, 2026

Uh oh!

Uh oh!

ARCoder181105 commented Jan 20, 2026

Response to Index Concern

Current Index Structure:

EXPLAIN Analysis:

Performance Comparison:

Uh oh!

ARCoder181105 commented Jan 20, 2026

Update: No Migration Needed ✅

Evidence:

Uh oh!

audiodude commented Jan 23, 2026

Uh oh!

codecov Bot commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ARCoder181105 commented Jan 25, 2026

Uh oh!

audiodude commented Jan 25, 2026

Uh oh!

ARCoder181105 commented Feb 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov Bot commented Jan 23, 2026 •

edited

Loading