Conversation
* w/o replacement is currently implemnented in R * w/ replacement uses either probabilistic sampling or the alias method
|
Problematic benchmark from #52 looks much better now: library(dqrng)
m <- 1e6
n <- 1e4
prob <- dqrunif(m)
bm <- bench::mark(sample.int(m, n, replace = TRUE, prob = prob),
dqsample.int(m, n, replace = TRUE, prob = prob),
check = FALSE)
bm[, 1:4]
#> # A tibble: 2 × 4
#> expression min median `itr/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl>
#> 1 sample.int(m, n, replace = TRUE, prob = prob) 22.42ms 25.5ms 38.3
#> 2 dqsample.int(m, n, replace = TRUE, prob = prob) 7.96ms 8.78ms 114.
m <- 1e1
prob <- dqrunif(m)
bm <- bench::mark(sample.int(m, n, replace = TRUE, prob = prob),
dqsample.int(m, n, replace = TRUE, prob = prob),
check = FALSE)
bm[, 1:4]
#> # A tibble: 2 × 4
#> expression min median `itr/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl>
#> 1 sample.int(m, n, replace = TRUE, prob = prob) 227µs 245µs 3976.
#> 2 dqsample.int(m, n, replace = TRUE, prob = prob) 113µs 125µs 7508.Created on 2023-10-07 with reprex v2.0.2 However, there is still some potential for improvement in the case of uneven weight distribution: library(dqrng)
m <- 1e6
n <- 1e4
prob <- dqsample(m)
prob[which.max(prob)] <- m * m
bm <- bench::mark(sample.int(m, n, replace = TRUE, prob = prob),
dqsample.int(m, n, replace = TRUE, prob = prob),
check = FALSE)
bm[, 1:4]
#> # A tibble: 2 × 4
#> expression min median `itr/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl>
#> 1 sample.int(m, n, replace = TRUE, prob = prob) 18.3ms 20.5ms 47.5
#> 2 dqsample.int(m, n, replace = TRUE, prob = prob) 21.7ms 22.5ms 43.0
m <- 1e1
prob <- dqsample(m)
prob[which.max(prob)] <- m * m
bm <- bench::mark(sample.int(m, n, replace = TRUE, prob = prob),
dqsample.int(m, n, replace = TRUE, prob = prob),
check = FALSE)
bm[, 1:4]
#> # A tibble: 2 × 4
#> expression min median `itr/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl>
#> 1 sample.int(m, n, replace = TRUE, prob = prob) 161µs 189µs 4914.
#> 2 dqsample.int(m, n, replace = TRUE, prob = prob) 122µs 135µs 7011.Created on 2023-10-07 with reprex v2.0.2 |
Similar to unweighted case. Two variants with stochastic acceptance (fast for even weight distribution) and alias method. These methods seem to be interesting for selection ratios < 0.5 (also similar to unweighted case).
|
Interestingly the methods doing set-based rejection sampling from the last commit have better performance than the exponential rank. At least when |
|
For unweighted sampling the |
Recreate RcppExports.cpp with current development version of Rcpp to fix WARN on CRAN
Merge branch 'master' into feature/weighted-sampling-2 # Conflicts: # DESCRIPTION # NEWS.md
|
This is how benchmark results would change (along with a 95% confidence interval in relative change) if 43b718d is merged into main: |
|
This is how benchmark results would change (along with a 95% confidence interval in relative change) if 43b718d is merged into main: |
|
This is how benchmark results would change (along with a 95% confidence interval in relative change) if 128a3cd is merged into main: |
|
This is how benchmark results would change (along with a 95% confidence interval in relative change) if c5c07e5 is merged into main: |
|
Something to consider here as well: https://notstatschat.rbind.io/2024/08/26/another-way-to-not-sample-with-replacement/ |
w/o replacement is currently implemented in Rfixes #18
fixes #45
fixes #52
n < 1000 * sizecut-over point between bitset and hashset