Optimized noclvag for RX 7900 XTX OC.
This could was created by the Gleasonator himself.
I am not claiming any of his work as my own, only using this repo to A/B hardware hack for faster outputs.
Use this software at your own risk, it was already as is software before I played with it.
After upgrading my GPU and 2.5 hours of agentic cooking, I went from 186m key/s to almost 900m keys/s.
# Choose your name
023456789acdefghjklmnpqrstuvwxyz
# Not availble
1,b,i,o,
# Open a terminal in the root of the project
cd ~/Documents/Code/noclvag
# Compile if you haven't
make
# Verify both platforms show up
./noclvag-search -l
# Clear the last run if you already ran
rm -f noclvag-secrets.txt noclvag-request.txt noclvag-response.txt
# Generate the search request
./noclvag-tool --step1 npub1danksw00ps
# required "npub1" afterwards is your vanity name.
# Run the search on ROCm at ~898 Mkey/s
./noclvag-search -p 1 -d 0 -w 1024 -2
# Finalize the key once a match is found
./noclvag-tool --step3
As you can see below, it is a privlage to have a 10 char name. Given moores law, in time this will become more normal. Anything beyond 10 is either a FANNG level company getting into NOSTR for brand reasons or absolute SPOONED rng where you hit 11 and maybe 12 while looking for 10. There will be very few who farm beyond 10 and client devs should not encourage npub vanity beyond 10 chars for reasonable engagement with all users. 10 seems to be the magic number to age into gracefully but not be too exclusive in modern day.
50% probability = 0.693 x 32^N keys. Based on ~900 Mkey/s per RX 7900 XTX.
| Chars | key/s needed | Spec range | Price range |
|---|---|---|---|
| 1 | instant | Any device | $0 |
| 2 | instant | Any device | $0 |
| 3 | instant | Any device | $0 |
| 4 | ~1 /s | Any device | $0 |
| 5 | ~27 /s | Any device | $0 |
| 6 | ~861 /s | Any CPU | $0 |
| 7 | ~28 Kkey/s | Any CPU | $0 |
| 8 | ~882 Kkey/s | Low-end GPU | $200-300 |
| 9 | ~28 Mkey/s | Budget GPU (RX 6600) | $200-350 |
| 10 | ~903 Mkey/s | High-end GPU (RX 7900 XTX) | $800-1500 |
| 11 | ~29 Gkey/s | ~32 high-end GPUs | $25K-50K |
| 12 | ~925 Gkey/s | ~1,000 GPUs | Not feasible |
| 13 | ~30 Tkey/s | ~33,000 GPUs | Not feasible |
| 14 | ~947 Tkey/s | ~1,000,000 GPUs | Not feasible |
| Chars | key/s needed | Spec range | Price range |
|---|---|---|---|
| 1 | instant | Any device | $0 |
| 2 | instant | Any device | $0 |
| 3 | instant | Any device | $0 |
| 4 | instant | Any device | $0 |
| 5 | ~1 /s | Any device | $0 |
| 6 | ~24 /s | Any device | $0 |
| 7 | ~755 /s | Any CPU | $0 |
| 8 | ~24 Kkey/s | Multi-core CPU | $0 |
| 9 | ~773 Kkey/s | Low-end GPU | $200-300 |
| 10 | ~25 Mkey/s | Budget GPU (RX 6600) | $200-350 |
| 11 | ~792 Mkey/s | High-end GPU (RX 7900 XTX) | $800-1500 |
| 12 | ~25 Gkey/s | ~28 high-end GPUs | $22K-42K |
| 13 | ~811 Gkey/s | ~900 GPUs | Not feasible |
| 14 | ~26 Tkey/s | ~29,000 GPUs | Not feasible |
- GPU: Asus TUF RX 7900 XTX OC 24GB (RDNA 3, navi31, gfx1100, 96 CUs, wave32)
- CPU: AMD Ryzen 7 3700X, RAM: 32GB, Mobo: Asus TUF B450M (PCIe 3.0 x16)
- Baseline: 595 Mkey/s (rusticl, unmodified code)
- Current: 898 Mkey/s (ROCm, optimized code) — +51%
- Rusticl fallback: 717 Mkey/s (+20%)
# Step 1: Generate search request
./noclvag-tool --step1 npub1<pattern>
# Step 2: Search (ROCm, 898 Mkey/s)
./noclvag-search -p 1 -d 0 -w 1024 -2
# Step 3: Finalize key
./noclvag-tool --step3
# Rusticl fallback (no ROCm needed)
./noclvag-search -p 0 -d 0 -2
# List available platforms
./noclvag-search -lVLIW-style 3-pass add/subtract was designed for pre-GCN TeraScale (HD 2000-6000). RDNA 3 is scalar — VLIW codegen is actively harmful. Detection checks device name for navi/gfx9/gfx10/gfx11/Vega/Polaris/RX 5/6/7 strings.
Full unrolling of Montgomery multiplication outer loop causes register spilling on wave32 SIMD, killing occupancy. Same detection logic as above.
Lets auto-tuner pick larger grids before hitting the worksize cap.
memsize /= 2 changed to memsize = memsize * 3 / 4. Uses ~18GB of 24GB.
Root cause: rekey_max was hardcoded to 100M. When grid round size
(rows * cols) exceeded it, the condition (npoints + round) < rekey_at
was never true, creating an infinite CPU-only rekey loop with 0% GPU work.
ROCm auto-selected 12288x8192 = 100.7M which crossed this threshold.
Fix: Base increased to 1B, plus safety: if (rekey_max < round * 2) rekey_max = round * 2.
- ROCm is 25% faster than rusticl for this kernel on RDNA 3
- ROCm prefers small worksize (
-w 1024= 898), rusticl prefers large (-w 65536= 717) - ROCm reports 48 CUs (WGPs), rusticl reports 96 CUs — different grid auto-selection
- Auto-selected 6144x4096 is near-optimal for both platforms
- Smaller grids (4096x4096) lose ~8% throughput
- Larger grids hit rusticl's 2GB CL_DEVICE_MAX_MEM_ALLOC_SIZE limit and crash
- Never override
-i(iteration count) — defaults are optimal
RUSTICL_FEATURES=fp64— no measurable improvementACCESS_STRIDEchanges — current 1024/128 already coalesces for wave32- Manual
-ioverride — auto-selected values beat all manual settings
wget https://repo.radeon.com/amdgpu-install/7.1.1/ubuntu/noble/amdgpu-install_7.1.1.70101-1_all.deb
sudo apt install ./amdgpu-install_7.1.1.70101-1_all.deb
sudo apt update
sudo amdgpu-install --usecase=opencl --no-dkms
sudo usermod -a -G render,video $LOGNAME
# Reboot required. Platforms: 0=rusticl, 1=ROCm- Progress output uses
\rnot\n— pipe throughtr '\r' '\n'to parse -vflag spams "GPU idle" lines that hide Mkey/s — omit for clean output- .oclbin files are kernel caches keyed by MD5(platform+device+flags+source) — safe to delete
- First run after deleting .oclbin takes minutes for kernel JIT compilation
- If GPU hangs after failed CL_INVALID_BUFFER_SIZE,
pkill -9 noclvag-searchand retry