diff --git a/CMakeLists.txt b/CMakeLists.txt index 96a4782..44c6729 100644 --- a/CMakeLists.txt +++ b/CMakeLists.txt @@ -9,6 +9,7 @@ set(CMAKE_CXX_STANDARD 11) list(APPEND CUDA_NVCC_FLAGS_DEBUG -G -g) list(APPEND CUDA_NVCC_FLAGS_RELWITHDEBUGINFO -lineinfo) +list(APPEND CUDA_NVCC_FLAGS "-std=c++11") # Crucial magic for CUDA linking find_package(Threads REQUIRED) diff --git a/README.md b/README.md index b71c458..ef0cbdc 100644 --- a/README.md +++ b/README.md @@ -3,11 +3,468 @@ CUDA Stream Compaction **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2** -* (TODO) YOUR NAME HERE -* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab) +* Ruoyu Fan +* Tested on: Windows 10 x64, i7-6700K @ 4.00GHz 16GB, GTX 970 4096MB (girlfriend's machine) + * compiled with Visual Studio 2013 and CUDA 7.5 -### (TODO: Your README) +NOTE: if the program crashes when entering the test for naive sort, try reducing the array sizes in `main.cpp`. (`SIZE` and `SORT_SIZE`.) -Include analysis, etc. (Remember, this is public, so don't put -anything here that you don't want to share with the world.) +![preview](/screenshots/preview_optimized.gif) + +### Things I have done + +* Implemented __CPU scan and compaction__, __compaction__, __GPU naive scan__, __GPU work-efficient scan__, __GPU work-efficient compaction__, __GPU radix sort (extra)__, and compared my scan algorithms with thrust implemention + +* I optimized my work efficient scan, and __speed is increased to 270%__ of my original implementation, please refer to __Optimization__ section. + +* I also wrote an __inclusive version__ of __work-efficient scan__ - because i misunderstood the requirement at first! The difference of the inclusive method is that it creates a buffer that is 1 element larger and swap the last(0) and and second last elements before downsweeping. Although I corrected my implemention to exclusive scan, the inclusive scan can still be called by passing ScanType::inclusive to scan_implenmention method in efficient.cu. + +* __Radix sort__ assumes inputs are between [0, a_given_maximum) . I compared my radix sort with std::sort and thrust's unstable and stable sort. + +* I added a helper class `PerformanceTimer` in common.h which is used to do performance measurement. + + +### Original Questions +``` +* Roughly optimize the block sizes of each of your implementations for minimal + run time on your GPU. + * (You shouldn't compare unoptimized implementations to each other!) + +* Compare all of these GPU Scan implementations (Naive, Work-Efficient, and + Thrust) to the serial CPU version of Scan. Plot a graph of the comparison + (with array size on the independent axis). + * You should use CUDA events for timing GPU code. Be sure **not** to include + any *initial/final* memory operations (`cudaMalloc`, `cudaMemcpy`) in your + performance measurements, for comparability. Note that CUDA events cannot + time CPU code. + * You can use the C++11 `std::chrono` API for timing CPU code. See this + [Stack Overflow answer](http://stackoverflow.com/a/23000049) for an example. + Note that `std::chrono` may not provide high-precision timing. If it does + not, you can either use it to time many iterations, or use another method. + * To guess at what might be happening inside the Thrust implementation (e.g. + allocation, memory copy), take a look at the Nsight timeline for its + execution. Your analysis here doesn't have to be detailed, since you aren't + even looking at the code for the implementation. +``` + +* Please refer to __Performance__ section. + +``` +* Write a brief explanation of the phenomena you see here. + * Can you find the performance bottlenecks? Is it memory I/O? Computation? Is + it different for each implementation? +``` + +* ~~I notice that I couldn't get a good measurement for scan and sort of __Thrust__. I have trouble measuring `thrust::exclusive` with std::chrono, while I can use `std::chrono` to measure `thrust::scan` but the results from CUDA events seems off.~~ I passed a host array to thrust so in my original tests thrust is using a CPU sort algorithm, fixed and updated result in `result_radix_max_100.txt` and `result_radix_max_100000000.txt` + +* With max possible value increased (so does MSB), besides the run time of radix sort, that of std::sort also increased significantly, which is unexpected. Does bit length also affect the time for read/write or other operations? or the implementation of std::sort use bit level information? + +* I think the bottleneck for blocksize is the warp size inside GPU. + +* My original work-efficient scan implementation was slower than CPU scan, and then I optimized it by __minimizing wasted threads__. and it now runs __faster than CPU scan__, please refer to __Optimization__ section below. + +``` +* Paste the output of the test program into a triple-backtick block in your + README. + * If you add your own tests (e.g. for radix sort or to test additional corner + cases), be sure to mention it explicitly. + +These questions should help guide you in performance analysis on future +assignments, as well. +``` + +* The tests are done with arrays of lengths `2^26` (67108864) and `2^26-3` (67108861). The array generation uses current time as random seed. + +* I added tests for __radix sort__, which compares with `std::sort` as well as __Thrust__'s `thrust::sort` + + +### Performance + +#### Blocksize + +When block size is smaller than 16, my application suffers from performance drop, which is recorded in `test_results` folder. I decided to just use `cudaOccupancyMaxPotentialBlockSize` for each device functions, which is almost 1024 on my computer. + +![chart_blocksize](/screenshots/chart_blocksize.png) + +| Block size | Naïve Scan(ms) | Work-efficient Scan (ms) | Reference CPU Scan (ms) | +|------------|----------------|--------------------------|-------------------------| +| 4 | 755.433 | 76.5717 | 134.408 | +| 8 | 379.897 | 52.1942 | 134.408 | +| 16 | 212.542 | 44.3224 | 134.408 | +| 32 | 133.116 | 43.4925 | 134.408 | +| 64 | 114.7 | 44.7172 | 134.408 | +| 128 | 113.802 | 44.7841 | 134.408 | +| 256 | 114.575 | 45.3902 | 134.408 | +| 512 | 114.593 | 44.3084 | 134.408 | +| 1024 | 113.867 | 44.2941 | 134.408 | + +#### Array length + +All the test are done with block size of 1024. The possible max value for sorting is 100. + +__Scan__ : this work-efficient scan implementation is faster than cpu scan on large input but slower than Thrust's + +![chart_scan](/screenshots/chart_scan.png) + +| Input array length power of 2 | CPU Scan (ms) | Naïve Scan (ms) | Work-efficient scan (ms) | Thrust scan (ms) (CUDA event) | +|-------------------------------|---------------|-----------------|--------------------------|-------------------------------| +| 12 | 0 | 0.047872 | 0.110784 | 0.028192 | +| 16 | 0 | 0.09216 | 0.180064 | 0.166432 | +| 20 | 0 | 1.29402 | 0.818528 | 0.30768 | +| 24 | 23.5625 | 25.7094 | 11.3379 | 2.00646 | +| 26 | 159.925 | 133.61 | 44.4511 | 7.7283 | + +__Compaction__ : this work-efficient compaction implementation is faster than cpu's + +![chart_compact](/screenshots/chart_compact.png) + +| Input array length power of 2 | CPU compact (ms) | CPU scan compact (ms) | Work-efficient compact (ms) | +|-------------------------------|------------------|------------------------|-----------------------------| +| 12 | 0 | 0 | 0.118368 | +| 16 | 0 | 0 | 0.193408 | +| 20 | 0 | 0 | 0.966944 | +| 24 | 39.0743 | 38.6241 | 13.3034 | +| 26 | 155.406 | 422.524 | 53.2659 | + +__Sort__ : radix sort on GPU is faster than std::sort + +![chart_sort](/screenshots/chart_sort.png) + +| Input array length power of 2 | std::sort (ms) | Radix sort (ms) | +|-------------------------------|----------------|-----------------| +| 12 | 0 | 0.883328 | +| 16 | 0 | 1.44288 | +| 20 | 22.1649 | 7.61686 | +| 24 | 378.025 | 105.222 | +| 26 | 1481.8 | 419.345 | + +#### Data maximum value and radix sort + +__NOTE: THIS TEST RUNS ON A DIFFERENT MACHINE: Windows 7, Xeon(R) E5-1630 @ 3.70GHz 32GB, GTX 1070 8192MB (Moore 103 SigLab)__ + +All the test are done with block size of 1024; array length is ~~67108864~~ 33554432. + +| max value | std::sort (ms) | Radix sort (ms) | thrust::sort (ms) | +|------------|----------------|-----------------|-------------------| +| 100 | 917 | 143.992 | 20.2813ms | +| 1000000000 | 2173 | 1023.97 | 20.6702ms | + +With max possible value increased, besides the run time of radix sort, that of std::sort and thrust sorts also increased. + +__I peeked at thrust's inner function call and found thrust is using a radix sort algorithm.__ + + +### Sample Output + +``` +* THIS TEST RAN ON A DIFFERENT MACHINE: + Windows 7, Xeon(R) E5-1630 @ 3.70GHz 32GB, GTX 1070 8192MB (Moore 103 SigLab) + +CIS-565 HW2 CUDA Stream Compaction Test (Ruoyu Fan) + Block size for naive scan: 1024 + Block size for up-sweep: 1024 + Block size for down-sweep: 1024 + Block size for boolean mapping: 1024 + Block size for scattering: 1024 + Block sizes for radix sort: 1024 1024 1024 1024 + +**************** +** SCAN TESTS ** +**************** +Array size (power of two): 33554432 +Array size (non-power of two): 33554429 + [ 13 29 47 7 2 28 44 49 35 46 2 49 9 ... 37 0 ] +==== cpu scan, power-of-two ==== + elapsed time: 93ms (std::chrono Measured) + [ 0 13 42 89 96 98 126 170 219 254 300 302 351 ... 821752752 821752789 ] +==== cpu scan, non-power-of-two ==== + [ 0 13 42 89 96 98 126 170 219 254 300 302 351 ... 821752661 821752701 ] + elapsed time: 87ms (std::chrono Measured) + passed +==== naive scan, power-of-two ==== + [ 0 13 42 89 96 98 126 170 219 254 300 302 351 ... 821752752 821752789 ] + elapsed time: 38.1036ms (CUDA Measured) + passed +==== naive scan, non-power-of-two ==== + [ 0 13 42 89 96 98 126 170 219 254 300 302 351 ... 821752661 821752701 ] + elapsed time: 38.112ms (CUDA Measured) + passed +==== work-efficient scan, power-of-two ==== + [ 0 13 42 89 96 98 126 170 219 254 300 302 351 ... 821752752 821752789 ] + elapsed time: 15.0276ms (CUDA Measured) + passed +==== work-efficient scan, non-power-of-two ==== + [ 0 13 42 89 96 98 126 170 219 254 300 302 351 ... 821752661 821752701 ] + elapsed time: 15.0576ms (CUDA Measured) + passed +==== thrust scan, power-of-two ==== + [ 0 13 42 89 96 98 126 170 219 254 300 302 351 ... 821752752 821752789 ] + elapsed time: 8.90237ms (CUDA Measured) + passed +==== thrust scan, non-power-of-two ==== + [ 0 13 42 89 96 98 126 170 219 254 300 302 351 ... 821752661 821752701 ] + elapsed time: 2.74368ms (CUDA Measured) + passed + +***************************** +** STREAM COMPACTION TESTS ** +***************************** +Array size (power of two): 33554432 +Array size (non-power of two): 33554429 + [ 1 1 3 1 1 3 0 2 3 2 0 1 0 ... 3 0 ] +==== cpu compact without scan, power-of-two ==== + [ 1 1 3 1 1 3 2 3 2 1 3 2 2 ... 2 3 ] + elapsed time: 83ms (std::chrono Measured) + passed +==== cpu compact without scan, non-power-of-two ==== + [ 1 1 3 1 1 3 2 3 2 1 3 2 2 ... 3 2 ] + elapsed time: 84ms (std::chrono Measured) + passed +==== cpu compact with scan ==== + [ 1 1 3 1 1 3 2 3 2 1 3 2 2 ... 2 3 ] + elapsed time: 237ms (std::chrono Measured) + passed +==== work-efficient compact, power-of-two ==== + [ 1 1 3 1 1 3 2 3 2 1 3 2 2 ... 2 3 ] + elapsed time: 18.2364ms (CUDA Measured) + passed +==== work-efficient compact, non-power-of-two ==== + [ 1 1 3 1 1 3 2 3 2 1 3 2 2 ... 3 2 ] + elapsed time: 18.2344ms (CUDA Measured) + passed + +***************************** +** RADIX SORT TESTS ** +***************************** +Array size (power of two): 33554432 +Array size (non-power of two): 33554429 +Max value: 1000000000 + [ 21520 17257 9407 8648 31232 11282 11169 22994 15890 9350 22656 25538 29919 ... 23658 0 ] +==== std::sort, power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 32767 32767 ] + elapsed time: 2173ms (std::chrono Measured) +==== thrust::sort (which calls Thrust's radix sort), power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 32767 32767 ] + elapsed time: 20.6702ms (CUDA Measured) + passed +==== radix sort, power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 32767 32767 ] + elapsed time: 628.72ms (CUDA Measured) + passed +==== std::sort, non power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 32767 32767 ] + elapsed time: 2153ms (std::chrono Measured) +==== radix sort, non power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 32767 32767 ] + elapsed time: 617.616ms (CUDA Measured) + passed +``` + + +### Optimization + +#### Run less threads for work-efficient scan + +For work-efficient scan, my original implementation was using the same of amount of threads for every up sweep and down sweeps. Then I optimized it by using only necessary amount of threads for each iteration. + +The performance for scanning an array of length 67108861 using work-efficient approach boosted __from ~120.5ms to ~44.4ms__, which is __270% speed__ of my original approach. You can see the data in the files under __test_results/__ folder + +![chart_scan_optimization](/screenshots/chart_scan_optimization.png) + +Original implementation (in which `(index + 1) % (add_distance * 2) == 0` is `false` at many threads so __these threads were wasted__): + +```c++ +// running unnecessary threads +__global__ void kernScanUpSweepPass(int N, int add_distance, int* buffer) +{ + auto index = threadIdx.x + blockIdx.x * blockDim.x; + if (index >= N) { return; } + + if ((index + 1) % (add_distance * 2) == 0) + { + buffer[index] = buffer[index] + buffer[index - add_distance]; + } +} + +__global__ void kernScanDownSweepPass(int N, int distance, int* buffer) +{ + auto index = threadIdx.x + blockIdx.x * blockDim.x; + if (index >= N) { return; } + + if ((index + 1) % (distance * 2) == 0) + { + auto temp = buffer[index - distance]; + buffer[index - distance] = buffer[index]; + buffer[index] = temp + buffer[index]; + } +} +``` + +New implementation: + +```c++ +// optimized: only launch necessary amount of threads in host code +__global__ void kernScanUpSweepPass(int max_thread_index, int add_distance, int* buffer) +{ + auto tindex = threadIdx.x + blockIdx.x * blockDim.x; + + if (tindex >= max_thread_index) { return; } + + // I encountered overflow problem with index < N here so I changed to tindex < max_thread_index + size_t index = (add_distance * 2) * (1 + tindex) - 1; + + buffer[index] = buffer[index] + buffer[index - add_distance]; +} + +// optimized: only launch necessary amount of threads in host code +__global__ void kernScanDownSweepPass(int max_thread_index, int distance, int* buffer) +{ + auto tindex = threadIdx.x + blockIdx.x * blockDim.x; + + if (tindex >= max_thread_index) { return; } + + size_t index = (distance * 2) * (1 + tindex) - 1; + + auto temp = buffer[index - distance]; + buffer[index - distance] = buffer[index]; + buffer[index] = temp + buffer[index]; +} +``` + +And I calculated the number of threads needed as well as the maximum thread index for every up-sweep and down-sweep pass. + +Originally I was still using length of buffer as first parameter, but when I was calculating indices for a thread by using the condition of `(distance * 2) * (1 + tindex) - 1 > N`. There can come some weird result because of the multiplication result is out of bound (even for `size_t` - it took me 2 hours to debug that). So lessons learned, and I'll use more `n > b/a` instead of `a*n > b` as condition in the future. + +#### Helper class for performance measurement + +I create a RAII `PerformanceTimer` class for performance measurement. Which is like: + +```c++ +/** +* This class is used for timing the performance +* Uncopyable and unmovable +*/ +class PerformanceTimer +{ +public: + PerformanceTimer() + { + cudaEventCreate(&event_start); + cudaEventCreate(&event_end); + } + + ~PerformanceTimer() + { + cudaEventDestroy(event_start); + cudaEventDestroy(event_end); + } + + void startCpuTimer() + { + if (cpu_timer_started) { throw std::runtime_error("CPU timer already started"); } + cpu_timer_started = true; + + time_start_cpu = std::chrono::high_resolution_clock::now(); + } + + void endCpuTimer() + { + time_end_cpu = std::chrono::high_resolution_clock::now(); + + if (!cpu_timer_started) { throw std::runtime_error("CPU timer not started"); } + + std::chrono::duration duro = time_end_cpu - time_start_cpu; + prev_elapsed_time_cpu_milliseconds = + static_cast(duro.count()); + + cpu_timer_started = false; + } + + void startGpuTimer() + { + if (gpu_timer_started) { throw std::runtime_error("GPU timer already started"); } + gpu_timer_started = true; + + cudaEventRecord(event_start); + } + + void endGpuTimer() + { + cudaEventRecord(event_end); + cudaEventSynchronize(event_end); + + if (!gpu_timer_started) { throw std::runtime_error("GPU timer not started"); } + + cudaEventElapsedTime(&prev_elapsed_time_gpu_milliseconds, event_start, event_end); + gpu_timer_started = false; + } + + float getCpuElapsedTimeForPreviousOperation() noexcept + { + return prev_elapsed_time_cpu_milliseconds; + } + + float getGpuElapsedTimeForPreviousOperation() noexcept + { + return prev_elapsed_time_gpu_milliseconds; + } + + // remove copy and move functions + PerformanceTimer(const PerformanceTimer&) = delete; + PerformanceTimer(PerformanceTimer&&) = delete; + PerformanceTimer& operator=(const PerformanceTimer&) = delete; + PerformanceTimer& operator=(PerformanceTimer&&) = delete; + +private: + cudaEvent_t event_start = nullptr; + cudaEvent_t event_end = nullptr; + + using time_point_t = std::chrono::high_resolution_clock::time_point; + time_point_t time_start_cpu; + time_point_t time_end_cpu; + + bool cpu_timer_started = false; + bool gpu_timer_started = false; + + float prev_elapsed_time_cpu_milliseconds = 0.f; + float prev_elapsed_time_gpu_milliseconds = 0.f; +}; +``` + +And inside a module I have: + +```c++ +using StreamCompaction::Common::PerformanceTimer; +PerformanceTimer& timer() +{ + // not thread-safe + static PerformanceTimer timer; + return timer; +} +``` + +Therefore, I can use + +```c++ +void someFunc() +{ + allocateYourBuffers() + + timer().startGpuTimer(); + + doYourGpuScan(); + + timer().endGpuTimer(); + + endYourJob(); +} +``` + +and + +```c++ +timer().getGpuElapsedTimeForPreviousOperation(); +``` + +to get the measured elapsed time for the operation. diff --git a/screenshots/chart_blocksize.png b/screenshots/chart_blocksize.png new file mode 100644 index 0000000..5135747 Binary files /dev/null and b/screenshots/chart_blocksize.png differ diff --git a/screenshots/chart_compact.png b/screenshots/chart_compact.png new file mode 100644 index 0000000..3530d46 Binary files /dev/null and b/screenshots/chart_compact.png differ diff --git a/screenshots/chart_scan.png b/screenshots/chart_scan.png new file mode 100644 index 0000000..349f583 Binary files /dev/null and b/screenshots/chart_scan.png differ diff --git a/screenshots/chart_scan_optimization.png b/screenshots/chart_scan_optimization.png new file mode 100644 index 0000000..efc8e32 Binary files /dev/null and b/screenshots/chart_scan_optimization.png differ diff --git a/screenshots/chart_sort.png b/screenshots/chart_sort.png new file mode 100644 index 0000000..4c4fe78 Binary files /dev/null and b/screenshots/chart_sort.png differ diff --git a/screenshots/preview.gif b/screenshots/preview.gif new file mode 100644 index 0000000..4941094 Binary files /dev/null and b/screenshots/preview.gif differ diff --git a/screenshots/preview_optimized.gif b/screenshots/preview_optimized.gif new file mode 100644 index 0000000..a120522 Binary files /dev/null and b/screenshots/preview_optimized.gif differ diff --git a/src/main.cpp b/src/main.cpp index 675da35..0170cd4 100644 --- a/src/main.cpp +++ b/src/main.cpp @@ -1,89 +1,131 @@ /** * @file main.cpp * @brief Stream compaction test program - * @authors Kai Ninomiya - * @date 2015 + * @authors Kai Ninomiya, Ruoyu Fan + * @date 2015, 2016 * @copyright University of Pennsylvania */ -#include #include #include #include #include +#include + #include "testing_helpers.hpp" +#include +#include +#include + +// size of 1 << 26 could on a 970 desktop +// but crashed on my laptop (970m) so I reduced array size +// change the array size if there is still problem +const int SIZE = 1 << 25; //constexpr +//const int SIZE = 1 << 24; +const int NPOT = SIZE - 3; +const int SCAN_MAX = 50; +const int COMPACTION_MAX = 4; + +const int SORT_SIZE = 1 << 25; +//const int SORT_SIZE = 1 << 24; +const int SORT_NPOT = SORT_SIZE - 3; +const int SORT_MAX = 1000000000; + + +int a[SIZE], b[SIZE], c[SIZE], d[SORT_SIZE], e[SORT_SIZE], f[SORT_SIZE]; int main(int argc, char* argv[]) { - const int SIZE = 1 << 8; - const int NPOT = SIZE - 3; - int a[SIZE], b[SIZE], c[SIZE]; // Scan tests + + std::cout << "CIS-565 HW2 CUDA Stream Compaction Test (Ruoyu Fan)"; + std::cout << std::endl; + std::cout << " Block size for naive scan: " << StreamCompaction::Naive::getNaiveScanBlockSize() << std::endl; + std::cout << " Block size for up-sweep: " << StreamCompaction::Efficient::getUpSweepBlockSize() << std::endl; + std::cout << " Block size for down-sweep: " << StreamCompaction::Efficient::getDownSweepBlockSize() << std::endl; + std::cout << " Block size for boolean mapping: " << StreamCompaction::Common::getMapToBooleanBlockSize() << std::endl; + std::cout << " Block size for scattering: " << StreamCompaction::Common::getScatterBlocksize() << std::endl; + std::cout << " Block sizes for radix sort: " + << StreamCompaction::RadixSort::getComputeBArrayBlockSize() << " " + << StreamCompaction::RadixSort::getComputeDArrayBlockSize() << " " + << StreamCompaction::RadixSort::getComputeEArrayBlockSize() << " " + << StreamCompaction::RadixSort::getReshuffleBlockSize() << std::endl; printf("\n"); printf("****************\n"); printf("** SCAN TESTS **\n"); printf("****************\n"); + std::cout << "Array size (power of two): " << SIZE << std::endl; + std::cout << "Array size (non-power of two): " << NPOT << std::endl; - genArray(SIZE - 1, a, 50); // Leave a 0 at the end to test that edge case + genArray(SIZE - 1, a, SCAN_MAX); // result for edge case of 0 looks fine a[SIZE - 1] = 0; printArray(SIZE, a, true); zeroArray(SIZE, b); printDesc("cpu scan, power-of-two"); StreamCompaction::CPU::scan(SIZE, b, a); + printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)"); printArray(SIZE, b, true); zeroArray(SIZE, c); printDesc("cpu scan, non-power-of-two"); StreamCompaction::CPU::scan(NPOT, c, a); printArray(NPOT, b, true); + printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)"); printCmpResult(NPOT, b, c); zeroArray(SIZE, c); printDesc("naive scan, power-of-two"); StreamCompaction::Naive::scan(SIZE, c, a); - //printArray(SIZE, c, true); + printArray(SIZE, c, true); + printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); printCmpResult(SIZE, b, c); zeroArray(SIZE, c); printDesc("naive scan, non-power-of-two"); StreamCompaction::Naive::scan(NPOT, c, a); - //printArray(SIZE, c, true); + printArray(NPOT, c, true); + printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); printCmpResult(NPOT, b, c); zeroArray(SIZE, c); printDesc("work-efficient scan, power-of-two"); StreamCompaction::Efficient::scan(SIZE, c, a); - //printArray(SIZE, c, true); + printArray(SIZE, c, true); + printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); printCmpResult(SIZE, b, c); zeroArray(SIZE, c); printDesc("work-efficient scan, non-power-of-two"); StreamCompaction::Efficient::scan(NPOT, c, a); - //printArray(NPOT, c, true); + printArray(NPOT, c, true); + printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); printCmpResult(NPOT, b, c); zeroArray(SIZE, c); printDesc("thrust scan, power-of-two"); StreamCompaction::Thrust::scan(SIZE, c, a); - //printArray(SIZE, c, true); + printArray(SIZE, c, true); + printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); printCmpResult(SIZE, b, c); zeroArray(SIZE, c); printDesc("thrust scan, non-power-of-two"); StreamCompaction::Thrust::scan(NPOT, c, a); - //printArray(NPOT, c, true); + printArray(NPOT, c, true); + printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); printCmpResult(NPOT, b, c); printf("\n"); printf("*****************************\n"); printf("** STREAM COMPACTION TESTS **\n"); printf("*****************************\n"); - + std::cout << "Array size (power of two): " << SIZE << std::endl; + std::cout << "Array size (non-power of two): " << NPOT << std::endl; // Compaction tests - genArray(SIZE - 1, a, 4); // Leave a 0 at the end to test that edge case + genArray(SIZE - 1, a, COMPACTION_MAX); // result for edge case of 0 looks fine a[SIZE - 1] = 0; printArray(SIZE, a, true); @@ -94,6 +136,7 @@ int main(int argc, char* argv[]) { count = StreamCompaction::CPU::compactWithoutScan(SIZE, b, a); expectedCount = count; printArray(count, b, true); + printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)"); printCmpLenResult(count, expectedCount, b, b); zeroArray(SIZE, c); @@ -101,23 +144,74 @@ int main(int argc, char* argv[]) { count = StreamCompaction::CPU::compactWithoutScan(NPOT, c, a); expectedNPOT = count; printArray(count, c, true); + printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)"); printCmpLenResult(count, expectedNPOT, b, c); zeroArray(SIZE, c); printDesc("cpu compact with scan"); count = StreamCompaction::CPU::compactWithScan(SIZE, c, a); printArray(count, c, true); + printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)"); printCmpLenResult(count, expectedCount, b, c); zeroArray(SIZE, c); printDesc("work-efficient compact, power-of-two"); count = StreamCompaction::Efficient::compact(SIZE, c, a); - //printArray(count, c, true); + printArray(count, c, true); + printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); printCmpLenResult(count, expectedCount, b, c); zeroArray(SIZE, c); printDesc("work-efficient compact, non-power-of-two"); count = StreamCompaction::Efficient::compact(NPOT, c, a); - //printArray(count, c, true); + printArray(count, c, true); + printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); printCmpLenResult(count, expectedNPOT, b, c); + + printf("\n"); + printf("*****************************\n"); + printf("** RADIX SORT TESTS **\n"); + printf("*****************************\n"); + std::cout << "Array size (power of two): " << SORT_SIZE << std::endl; + std::cout << "Array size (non-power of two): " << SORT_NPOT << std::endl; + std::cout << "Max value: " << SORT_MAX << std::endl; + + genArray(SORT_SIZE - 1, d, SORT_MAX); + d[SORT_SIZE - 1] = 0; + printArray(SORT_SIZE, d, true); + + printDesc("std::sort, power-of-two"); + std::copy(std::begin(d), std::end(d), std::begin(e)); + StreamCompaction::CPU::stdSort(std::begin(e), std::end(e)); + printArray(SORT_SIZE, e, true); + printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)"); + + printDesc("thrust::sort (which calls Thrust's radix sort), power-of-two"); + std::copy(std::begin(d), std::end(d), std::begin(f)); + StreamCompaction::Thrust::sort(std::begin(f), std::end(f)); + printArray(SORT_SIZE, f, true); + printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); + printCmpResult(SORT_SIZE, e, f); + // I wanted to compare with thrust's unstable and stable sort, but it uses radix sort! + + printDesc("radix sort, power-of-two"); + std::copy(std::begin(d), std::end(d), std::begin(f)); + StreamCompaction::RadixSort::radixSort(std::begin(f), std::end(f), SORT_MAX); + printArray(SORT_SIZE, f, true); + printElapsedTime(StreamCompaction::RadixSort::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); + printCmpResult(SORT_SIZE, e, f); + + // must be after all power-of-two sorts since it is standard value + printDesc("std::sort, non power-of-two"); + std::copy(std::begin(d), std::end(d), std::begin(e)); + StreamCompaction::CPU::stdSort(std::begin(e), std::begin(e) + SORT_NPOT); + printArray(SORT_NPOT, e, true); + printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)"); + + printDesc("radix sort, non power-of-two"); + std::copy(std::begin(d), std::end(d), std::begin(f)); + StreamCompaction::RadixSort::radixSort(std::begin(f), std::begin(f) + SORT_NPOT, SORT_MAX); + printArray(SORT_NPOT, f, true); + printElapsedTime(StreamCompaction::RadixSort::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); + printCmpResult(SORT_NPOT, e, f); } diff --git a/src/testing_helpers.hpp b/src/testing_helpers.hpp index f6b572f..87b5a6b 100644 --- a/src/testing_helpers.hpp +++ b/src/testing_helpers.hpp @@ -1,6 +1,9 @@ #pragma once #include +#include +#include +#include template int cmpArrays(int n, T *a, T *b) { @@ -17,6 +20,12 @@ void printDesc(const char *desc) { printf("==== %s ====\n", desc); } +template +void printElapsedTime(T time, std::string note = "") +{ + std::cout << " elapsed time: " << time << "ms " << note << std::endl; +} + template void printCmpResult(int n, T *a, T *b) { printf(" %s \n", @@ -40,10 +49,19 @@ void zeroArray(int n, int *a) { } void genArray(int n, int *a, int maxval) { - srand(0); + //srand(0); + srand(time(nullptr)); - for (int i = 0; i < n; i++) { - a[i] = rand() % maxval; + for (int i = 0; i < n; i++) + { + if (maxval == 0) + { + a[i] = 0; + } + else + { + a[i] = rand() % maxval; + } } } diff --git a/stream_compaction/CMakeLists.txt b/stream_compaction/CMakeLists.txt index cdbef77..bc6bdbc 100644 --- a/stream_compaction/CMakeLists.txt +++ b/stream_compaction/CMakeLists.txt @@ -9,6 +9,8 @@ set(SOURCE_FILES "efficient.cu" "thrust.h" "thrust.cu" + "radix_sort.h" + "radix_sort.cu" ) cuda_add_library(stream_compaction diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu index fe872d4..6d49e5d 100644 --- a/stream_compaction/common.cu +++ b/stream_compaction/common.cu @@ -1,5 +1,8 @@ #include "common.h" +#include +#include + void checkCUDAErrorFn(const char *msg, const char *file, int line) { cudaError_t err = cudaGetLastError(); if (cudaSuccess == err) { @@ -22,17 +25,61 @@ namespace Common { * Maps an array to an array of 0s and 1s for stream compaction. Elements * which map to 0 will be removed, and elements which map to 1 will be kept. */ -__global__ void kernMapToBoolean(int n, int *bools, const int *idata) { - // TODO +__global__ void kernMapToBoolean(int n, int *bools, const int *idata) +{ + auto index = threadIdx.x + blockIdx.x * blockDim.x; + if (index >= n) { return; } + + if (idata[index]) + { + bools[index] = 1; + } + else + { + bools[index] = 0; + } } + +//__global__ void kernScatter(int n, int *odata, +// const int *idata, const int *bools, const int *indices) + /** - * Performs scatter on an array. That is, for each element in idata, - * if bools[idx] == 1, it copies idata[idx] to odata[indices[idx]]. - */ +* Performs scatter on an array. That is, for each element in idata, +* if bools[idx] == 1, it copies idata[idx] to odata[indices[idx]]. +*/ __global__ void kernScatter(int n, int *odata, - const int *idata, const int *bools, const int *indices) { - // TODO + const int *idata, const int *indices) +{ + // use one less buffer to save space + auto index = threadIdx.x + blockIdx.x * blockDim.x; + if (index >= n) { return; } + + if (idata[index]) + { + odata[indices[index]] = idata[index]; + } +} + +int getMapToBooleanBlockSize() +{ + // not thread-safe + static int block_size = -1; + if (block_size == -1) + { + block_size = calculateBlockSizeForDeviceFunction(kernMapToBoolean); + } + return block_size; +} +int getScatterBlocksize() +{ + // not thread-safe + static int block_size = -1; + if (block_size == -1) + { + block_size = calculateBlockSizeForDeviceFunction(kernScatter); + } + return block_size; } } diff --git a/stream_compaction/common.h b/stream_compaction/common.h index 4f52663..5a93691 100644 --- a/stream_compaction/common.h +++ b/stream_compaction/common.h @@ -1,8 +1,14 @@ #pragma once +#include +#include + #include #include #include +#include +#include +#include #define FILENAME (strrchr(__FILE__, '/') ? strrchr(__FILE__, '/') + 1 : __FILE__) #define checkCUDAError(msg) checkCUDAErrorFn(msg, FILENAME, __LINE__) @@ -24,12 +30,139 @@ inline int ilog2ceil(int x) { return ilog2(x - 1) + 1; } +inline int fullBlocksPerGrid(int n, int block_size) +{ + return (n + block_size - 1) / block_size; +} + +// use fixed block size or block size from cudaOccupancyMaxPotentialBlockSize +const int FIXED_BLOCK_SIZE = -1; +//const int FIXED_BLOCK_SIZE = 128; + +template +int calculateBlockSizeForDeviceFunction(T func) +{ + if (FIXED_BLOCK_SIZE <= 0) + { + int block_size; + int min_grid_size; + cudaOccupancyMaxPotentialBlockSize(&min_grid_size, &block_size, func); + return block_size; + } + else + { + return FIXED_BLOCK_SIZE; + } +} + +#undef IS_FIXED_BLOCK_SIZE +#undef FIXED_BLOCK_SIZE + namespace StreamCompaction { namespace Common { __global__ void kernMapToBoolean(int n, int *bools, const int *idata); - + + //__global__ void kernScatter(int n, int *odata, + // const int *idata, const int *bools, const int *indices); __global__ void kernScatter(int n, int *odata, - const int *idata, const int *bools, const int *indices); + const int *idata, const int *indices); // use one less buffer to save space + + int getMapToBooleanBlockSize(); + int getScatterBlocksize(); + + /** + * This class is used for timing the performance + * Uncopyable and unmovable + */ + class PerformanceTimer + { + public: + PerformanceTimer() + { + cudaEventCreate(&event_start); + cudaEventCreate(&event_end); + } + + ~PerformanceTimer() + { + cudaEventDestroy(event_start); + cudaEventDestroy(event_end); + } + + void startCpuTimer() + { + if (cpu_timer_started) { throw std::runtime_error("CPU timer already started"); } + cpu_timer_started = true; + + time_start_cpu = std::chrono::high_resolution_clock::now(); + } + + void endCpuTimer() + { + time_end_cpu = std::chrono::high_resolution_clock::now(); + + if (!cpu_timer_started) { throw std::runtime_error("CPU timer not started"); } + + std::chrono::duration duro = time_end_cpu - time_start_cpu; + prev_elapsed_time_cpu_milliseconds = + static_cast(duro.count()); + + cpu_timer_started = false; + } + + void startGpuTimer() + { + if (gpu_timer_started) { throw std::runtime_error("GPU timer already started"); } + gpu_timer_started = true; + + cudaEventRecord(event_start); + } + + void endGpuTimer() + { + cudaEventRecord(event_end); + cudaEventSynchronize(event_end); + + if (!gpu_timer_started) { throw std::runtime_error("GPU timer not started"); } + + cudaEventElapsedTime(&prev_elapsed_time_gpu_milliseconds, event_start, event_end); + gpu_timer_started = false; + } + + float getCpuElapsedTimeForPreviousOperation() //noexcept //(damn I need VS 2015 + { + return prev_elapsed_time_cpu_milliseconds; + } + + float getGpuElapsedTimeForPreviousOperation() //noexcept + { + return prev_elapsed_time_gpu_milliseconds; + } + + // remove copy and move functions + PerformanceTimer(const PerformanceTimer&) = delete; + PerformanceTimer(PerformanceTimer&&) = delete; + PerformanceTimer& operator=(const PerformanceTimer&) = delete; + PerformanceTimer& operator=(PerformanceTimer&&) = delete; + + private: + cudaEvent_t event_start = nullptr; + cudaEvent_t event_end = nullptr; + + using time_point_t = std::chrono::high_resolution_clock::time_point; + time_point_t time_start_cpu; + time_point_t time_end_cpu; + + bool cpu_timer_started = false; + bool gpu_timer_started = false; + + float prev_elapsed_time_cpu_milliseconds = 0.f; + float prev_elapsed_time_gpu_milliseconds = 0.f; + }; + } } + + + diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu index e600c29..867b7b4 100644 --- a/stream_compaction/cpu.cu +++ b/stream_compaction/cpu.cu @@ -1,15 +1,45 @@ -#include #include "cpu.h" +#include "common.h" + +#include +#include +#include + namespace StreamCompaction { namespace CPU { +using StreamCompaction::Common::PerformanceTimer; +PerformanceTimer& timer() +{ + static PerformanceTimer timer; + return timer; +} + +void scan_implementation(int n, int *odata, const int *idata) +{ + if (n <= 0) { return; } + + odata[0] = 0; + + using std::size_t; + for (size_t i = 1; i < n; i++) + { + odata[i] = odata[i - 1] + idata[i - 1]; + } +} + /** * CPU scan (prefix sum). */ -void scan(int n, int *odata, const int *idata) { - // TODO - printf("TODO\n"); +void scan(int n, int *odata, const int *idata) +{ + // DONE + timer().startCpuTimer(); + + scan_implementation(n, odata, idata); + + timer().endCpuTimer(); } /** @@ -17,9 +47,25 @@ void scan(int n, int *odata, const int *idata) { * * @returns the number of elements remaining after compaction. */ -int compactWithoutScan(int n, int *odata, const int *idata) { - // TODO - return -1; +int compactWithoutScan(int n, int *odata, const int *idata) +{ + //DONE + timer().startCpuTimer(); + + using std::size_t; + size_t olength = 0; + for (size_t i = 0; i < n; i++) + { + if (idata[i]) + { + odata[olength] = idata[i]; + olength++; + } + } + + timer().endCpuTimer(); + + return static_cast(olength); } /** @@ -27,9 +73,49 @@ int compactWithoutScan(int n, int *odata, const int *idata) { * * @returns the number of elements remaining after compaction. */ -int compactWithScan(int n, int *odata, const int *idata) { - // TODO - return -1; +int compactWithScan(int n, int *odata, const int *idata) +{ + // DONE + // Run CPU scan + using std::size_t; + std::vector scan_result(n, 0); + + timer().startCpuTimer(); + + for (size_t i = 0; i < n; i++) + { + if (idata[i]) + { + // also use odata as a temp buffer to save space + odata[i] = 1; + } + } + + scan_implementation(n, scan_result.data(), odata); + + size_t olength = 0; + for (size_t i = 0; i < n; i++) + { + if (idata[i]) + { + odata[scan_result[i]] = idata[i]; + olength++; + } + } + + timer().endCpuTimer(); + + return static_cast(olength); +} + +/** +* This just calls std::sort +*/ +void stdSort(int* start, int* end) +{ + timer().startCpuTimer(); + std::sort(start, end); + timer().endCpuTimer(); } } diff --git a/stream_compaction/cpu.h b/stream_compaction/cpu.h index 6348bf3..f83c49a 100644 --- a/stream_compaction/cpu.h +++ b/stream_compaction/cpu.h @@ -1,11 +1,17 @@ #pragma once +#include "common.h" + namespace StreamCompaction { namespace CPU { + StreamCompaction::Common::PerformanceTimer& timer(); + void scan(int n, int *odata, const int *idata); int compactWithoutScan(int n, int *odata, const int *idata); int compactWithScan(int n, int *odata, const int *idata); + + void stdSort(int* start, int* end); } } diff --git a/stream_compaction/efficient.cu b/stream_compaction/efficient.cu index b2f739b..5a359f1 100644 --- a/stream_compaction/efficient.cu +++ b/stream_compaction/efficient.cu @@ -1,19 +1,235 @@ +#include "efficient.h" + +#include "common.h" #include #include -#include "common.h" -#include "efficient.h" + +#include namespace StreamCompaction { namespace Efficient { -// TODO: __global__ +using StreamCompaction::Common::PerformanceTimer; +PerformanceTimer& timer() +{ + // not thread-safe + static PerformanceTimer timer; + return timer; +} + +// DONE: __global__ + +// optimized: only launch necessary amount of threads in host code +__global__ void kernScanUpSweepPass(int max_thread_index, int add_distance, int* buffer) +{ + auto tindex = threadIdx.x + blockIdx.x * blockDim.x; + + if (tindex >= max_thread_index) { return; } + + // I encountered overflow problem with index < N here so I changed to tindex < max_thread_index + size_t index = (add_distance * 2) * (1 + tindex) - 1; + + buffer[index] = buffer[index] + buffer[index - add_distance]; +} + +// optimized: only launch necessary amount of threads in host code +__global__ void kernScanDownSweepPass(int max_thread_index, int distance, int* buffer) +{ + auto tindex = threadIdx.x + blockIdx.x * blockDim.x; + + if (tindex >= max_thread_index) { return; } + + size_t index = (distance * 2) * (1 + tindex) - 1; + + auto temp = buffer[index - distance]; + buffer[index - distance] = buffer[index]; + buffer[index] = temp + buffer[index]; +} + +// // This old functions launch the same number of threads for all passes, which is unnecessary +//__global__ void kernScanUpSweepPass(int N, int add_distance, int* buffer) +//{ +// auto index = threadIdx.x + blockIdx.x * blockDim.x; +// if (index >= N) { return; } +// +// if ((index + 1) % (add_distance * 2) == 0) +// { +// buffer[index] = buffer[index] + buffer[index - add_distance]; +// } +//} +// +//__global__ void kernScanDownSweepPass(int N, int distance, int* buffer) +//{ +// auto index = threadIdx.x + blockIdx.x * blockDim.x; +// if (index >= N) { return; } +// +// if ((index + 1) % (distance * 2) == 0) +// { +// auto temp = buffer[index - distance]; +// buffer[index - distance] = buffer[index]; +// buffer[index] = temp + buffer[index]; +// } +//} + +/** +* Swap value of two array members +* Used for inclusive scan as I misunderstood the requirement +*/ +__global__ void kernSwap(int index1, int index2, int* buffer) +{ + auto thread_index = threadIdx.x + blockIdx.x * blockDim.x; + if (thread_index == 0) + { + buffer[index1] ^= buffer[index2]; + buffer[index2] ^= buffer[index1]; + buffer[index1] ^= buffer[index2]; + } +} + +/** +* Set value of an array member to zero +*/ +__global__ void kernSetZero(int index, int* buffer) +{ + auto thread_index = threadIdx.x + blockIdx.x * blockDim.x; + if (thread_index == 0) + { + buffer[index] = 0; + } +} + + +int getUpSweepBlockSize() +{ + // not thread-safe + static int block_size = -1; + if (block_size == -1) + { + block_size = calculateBlockSizeForDeviceFunction(kernScanUpSweepPass); + } + return block_size; +} + +int getDownSweepBlockSize() +{ + // not thread-safe + static int block_size = -1; + if (block_size == -1) + { + block_size = calculateBlockSizeForDeviceFunction(kernScanDownSweepPass); + } + return block_size; +} + +enum class ScanType{inclusive, exclusive}; + +void scanInPlaceDevice(int extended_n, int* dev_buffer) +{ + auto block_size_up = getUpSweepBlockSize(); + auto full_blocks_per_grid_up = fullBlocksPerGrid(extended_n, block_size_up); + auto block_size_down = getDownSweepBlockSize(); + auto full_blocks_per_grid_down = fullBlocksPerGrid(extended_n, block_size_down); + + // up sweep + auto pass_count = ilog2ceil(extended_n) - 1; + auto max_thread_index = extended_n; + for (int d = 0; d <= pass_count; d++) + { + max_thread_index = extended_n >> (d + 1); + full_blocks_per_grid_up = fullBlocksPerGrid(max_thread_index, block_size_up); // only launch needed threads + kernScanUpSweepPass << > >(max_thread_index, 1 << d, dev_buffer); + } + + // set the last element to zero + kernSetZero <<<1, 1 >>>(extended_n - 1, dev_buffer); + + + // down sweep + for (int d = pass_count; d >= 0; d--) + { + max_thread_index = extended_n >> (d + 1); + full_blocks_per_grid_down = fullBlocksPerGrid(max_thread_index, block_size_down); // only launch needed threads + kernScanDownSweepPass <<>>(max_thread_index, 1 << d, dev_buffer); + } +} + +void scan_implemention(int n, int *odata, const int *idata, ScanType scan_type) +{ + if (n <= 0) { return; } + + // DONE + // round n up to power of two + auto extended_n = std::size_t(1) << ilog2ceil(n); + // plus one for + auto final_buffer_length = + (scan_type == ScanType::inclusive) ? extended_n + 1: extended_n; + + int* dev_buffer; + cudaMalloc((void**)&dev_buffer, final_buffer_length * sizeof(*dev_buffer)); + checkCUDAError("cudaMalloc dev_buffer failed!"); + + // fill zero and copy to device buffer + cudaMemset(dev_buffer, 0, final_buffer_length * sizeof(*idata)); + cudaMemcpy(dev_buffer, idata, n * sizeof(*idata), cudaMemcpyHostToDevice); + + auto block_size_up = getUpSweepBlockSize(); + auto full_blocks_per_grid_up = fullBlocksPerGrid(extended_n, block_size_up); + auto block_size_down = getDownSweepBlockSize(); + auto full_blocks_per_grid_down = fullBlocksPerGrid(extended_n, block_size_down); + + timer().startGpuTimer(); + + // up sweep + auto pass_count = ilog2ceil(extended_n) - 1; + auto max_thread_index = extended_n; + for (int d = 0; d <= pass_count; d++) + { + max_thread_index = extended_n >> (d + 1); + full_blocks_per_grid_up = fullBlocksPerGrid(max_thread_index, block_size_up); // only launch needed threads + kernScanUpSweepPass << > >(max_thread_index, 1 << d, dev_buffer); + } + + if (scan_type == ScanType::inclusive) + { + // swap the last element of up sweep result and the real last element (0) + kernSwap <<<1, 1>>>(extended_n - 1, final_buffer_length - 1, dev_buffer); + } + else + { + // set the last element to zero + kernSetZero <<<1, 1>>>(extended_n - 1, dev_buffer); + } + + // down sweep + for (int d = pass_count; d >= 0; d--) + { + max_thread_index = extended_n >> (d + 1); + full_blocks_per_grid_down = fullBlocksPerGrid(max_thread_index, block_size_down); // only launch needed threads + kernScanDownSweepPass <<>>(max_thread_index, 1 << d, dev_buffer); + } + + timer().endGpuTimer(); + + if (scan_type == ScanType::inclusive) + { + // copy with offset to make it an inclusive scan + cudaMemcpy(odata, dev_buffer + 1, n * sizeof(*odata), cudaMemcpyDeviceToHost); + } + else + { + cudaMemcpy(odata, dev_buffer, n * sizeof(*odata), cudaMemcpyDeviceToHost); + } + + cudaFree(dev_buffer); +} /** * Performs prefix-sum (aka scan) on idata, storing the result into odata. + * This just call scan_implementaion */ -void scan(int n, int *odata, const int *idata) { - // TODO - printf("TODO\n"); +void scan(int n, int *odata, const int *idata) +{ + scan_implemention(n, odata, idata, ScanType::exclusive); } /** @@ -26,8 +242,63 @@ void scan(int n, int *odata, const int *idata) { * @returns The number of elements remaining after compaction. */ int compact(int n, int *odata, const int *idata) { - // TODO - return -1; + if (n <= 0) { return 0; } + + // DONE + int* dev_idata; + cudaMalloc((void**)&dev_idata, n * sizeof(*dev_idata)); + checkCUDAError("cudaMalloc dev_idata failed!"); + cudaMemcpy(dev_idata, idata, n * sizeof(*idata), cudaMemcpyHostToDevice); + + auto extended_n = std::size_t(1) << ilog2ceil(n); // round up to power of two for scanning + + int* dev_bools; // TODO: I could use size_t* (but which will result in a lot of rewrite for the kernel functions) + cudaMalloc((void**)&dev_bools, extended_n * sizeof(*dev_bools)); + checkCUDAError("cudaMalloc dev_bools failed!"); + // fill zero and copy to boolean buffer + cudaMemset(dev_bools, 0, extended_n * sizeof(*dev_bools)); + + // reuse bool buffer as indices buffer + auto dev_indices = dev_bools; + + int* dev_odata; + cudaMalloc((void**)&dev_odata, n * sizeof(*dev_odata)); + checkCUDAError("cudaMalloc dev_odata failed!"); + + auto block_size_booleanize = Common::getMapToBooleanBlockSize(); + auto full_blocks_per_grid_booleanize = fullBlocksPerGrid(n, block_size_booleanize); + auto block_size_scatter = Common::getScatterBlocksize(); + auto full_blocks_per_grid_scatter = fullBlocksPerGrid(n, block_size_scatter); + + timer().startGpuTimer(); + + // map to boolean + Common::kernMapToBoolean <<>>(n, dev_bools, dev_idata); + + // exclusively scan the dev_bools buffer + scanInPlaceDevice(extended_n, dev_bools); + + // scatter + Common::kernScatter <<>>(n, dev_odata, dev_idata, dev_indices); + timer().endGpuTimer(); + + // calculate compacted length + using dev_indices_t = std::remove_reference::type; + dev_indices_t result_length; + cudaMemcpy(&result_length, dev_indices + n - 1, sizeof(result_length), cudaMemcpyDeviceToHost); + if (idata[n - 1]) + { + result_length += 1; + } + + // get compacted result + cudaMemcpy(odata, dev_odata, result_length * sizeof(*odata), cudaMemcpyDeviceToHost); + + cudaFree(dev_idata); + cudaFree(dev_bools); // the same buffer as dev_indices + cudaFree(dev_odata); + + return result_length; } } diff --git a/stream_compaction/efficient.h b/stream_compaction/efficient.h index 395ba10..e9c6b60 100644 --- a/stream_compaction/efficient.h +++ b/stream_compaction/efficient.h @@ -1,9 +1,19 @@ #pragma once +#include "common.h" + namespace StreamCompaction { namespace Efficient { + + StreamCompaction::Common::PerformanceTimer& timer(); + void scan(int n, int *odata, const int *idata); int compact(int n, int *odata, const int *idata); + + int getUpSweepBlockSize(); + int getDownSweepBlockSize(); + + void scanInPlaceDevice(int extended_n, int* dev_buffer); } } diff --git a/stream_compaction/naive.cu b/stream_compaction/naive.cu index 3d86b60..b7ea4de 100644 --- a/stream_compaction/naive.cu +++ b/stream_compaction/naive.cu @@ -1,19 +1,88 @@ +#include "naive.h" + +#include "common.h" #include #include -#include "common.h" -#include "naive.h" +#include namespace StreamCompaction { namespace Naive { -// TODO: __global__ + using StreamCompaction::Common::PerformanceTimer; + PerformanceTimer& timer() + { + static PerformanceTimer timer; + return timer; + } + +// DONE: __global__ +__global__ void kernNaiveScanPass(int N, int offset, int* in_buffer, int* out_buffer) +{ + auto index = threadIdx.x + blockIdx.x * blockDim.x; + if (index >= N) { return; } + + if (index >= offset) + { + out_buffer[index] = in_buffer[index - offset] + in_buffer[index]; + } + else + { + out_buffer[index] = in_buffer[index]; + } +} + +int getNaiveScanBlockSize() +{ + // not thread-safe + static int block_size = -1; + if (block_size == -1) + { + block_size = calculateBlockSizeForDeviceFunction(kernNaiveScanPass); + } + return block_size; +} + /** * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ -void scan(int n, int *odata, const int *idata) { - // TODO - printf("TODO\n"); +void scan(int n, int *odata, const int *idata) +{ + if (n <= 0) { return; } + + auto block_size = getNaiveScanBlockSize(); + auto full_blocks_per_grid = (n + block_size - 1) / block_size; + + // DONE + int* dev_in_buffer; + cudaMalloc((void**)&dev_in_buffer, n * sizeof(*dev_in_buffer)); + checkCUDAError("cudaMalloc dev_in_buffer failed!"); + int* dev_out_buffer; + cudaMalloc((void**)&dev_out_buffer, n * sizeof(*dev_out_buffer)); + checkCUDAError("cudaMalloc dev_out_buffer failed!"); + + cudaMemcpy(dev_in_buffer, idata, n * sizeof(*idata), cudaMemcpyHostToDevice); + + timer().startGpuTimer(); + + auto cap = ilog2ceil(n); + int offset; + for (int d = 1; d <= cap; d++) + { + offset = 1 << (d - 1); + kernNaiveScanPass <<< full_blocks_per_grid, block_size >>>(n, offset, dev_in_buffer, dev_out_buffer); + std::swap(dev_in_buffer, dev_out_buffer); + } + std::swap(dev_in_buffer, dev_out_buffer); + + timer().endGpuTimer(); + + // defered copy because of exclusive scan + cudaMemcpy(odata + 1, dev_out_buffer, (n - 1) * sizeof(*odata), cudaMemcpyDeviceToHost); + odata[0] = 0; + + cudaFree(dev_in_buffer); + cudaFree(dev_out_buffer); } } diff --git a/stream_compaction/naive.h b/stream_compaction/naive.h index 21152d6..2795ea6 100644 --- a/stream_compaction/naive.h +++ b/stream_compaction/naive.h @@ -1,7 +1,13 @@ #pragma once +#include "common.h" + namespace StreamCompaction { namespace Naive { + + StreamCompaction::Common::PerformanceTimer& timer(); void scan(int n, int *odata, const int *idata); + + int getNaiveScanBlockSize(); } } diff --git a/stream_compaction/radix_sort.cu b/stream_compaction/radix_sort.cu new file mode 100644 index 0000000..0e54fc1 --- /dev/null +++ b/stream_compaction/radix_sort.cu @@ -0,0 +1,187 @@ +#include "radix_sort.h" + +#include "efficient.h" +#include "common.h" +#include +#include +#include + +namespace StreamCompaction { +namespace RadixSort { + + +using StreamCompaction::Common::PerformanceTimer; +PerformanceTimer& timer() +{ + static PerformanceTimer timer; + return timer; +} + + +__global__ void kernComputeBArray(const int N, int bit_mask, bool *b, const int *idata) +{ + auto index = threadIdx.x + blockIdx.x * blockDim.x; + if (index >= N) { return; } + + b[index] = ((idata[index] & bit_mask) != 0); +} + +__global__ void kernComputeEArray(const int N, int* e, const bool *b, const int extended_n) +{ + auto index = threadIdx.x + blockIdx.x * blockDim.x; + if (index >= extended_n) { return; } + else if (index >= N) + { + // the length of e (or later f) needs to be power of 2 + e[index] = 0; + } + else + { + // e is int array since directly performing scan on it + e[index] = !b[index]; + } +} + +__global__ void kernComputeDArray(const int N, int* d, const int* f, const bool* b) +{ + auto index = threadIdx.x + blockIdx.x * blockDim.x; + if (index >= N) { return; } + + // t[i] = i-f[i] + total_falses + // total_falses = e[n-1] + f[n-1] = !b[n-1] + f[n-1] + // d[i] = b[i]? t[i] : f[i] + if (b[index]) + { + d[index] = index - f[index] + !b[N - 1] + f[N - 1]; + } + else + { + d[index] = f[index]; + } +} + +__global__ void kernReshuffle(const int N, int* to_buffer, const int* from_buffer, const int* indices) +{ + auto index = threadIdx.x + blockIdx.x * blockDim.x; + if (index >= N) { return; } + + to_buffer[indices[index]] = from_buffer[index]; +} + +int getComputeBArrayBlockSize() +{ + // not thread-safe + static int block_size = -1; + if (block_size == -1) + { + block_size = calculateBlockSizeForDeviceFunction(kernComputeBArray); + } + return block_size; +} + +int getComputeEArrayBlockSize() +{ + // not thread-safe + static int block_size = -1; + if (block_size == -1) + { + block_size = calculateBlockSizeForDeviceFunction(kernComputeEArray); + } + return block_size; +} + +int getComputeDArrayBlockSize() +{ + // not thread-safe + static int block_size = -1; + if (block_size == -1) + { + block_size = calculateBlockSizeForDeviceFunction(kernComputeDArray); + } + return block_size; +} + +int getReshuffleBlockSize() +{ + // not thread-safe + static int block_size = -1; + if (block_size == -1) + { + block_size = calculateBlockSizeForDeviceFunction(kernReshuffle); + } + return block_size; +} + +/** +* performs radix sort, which assumes all data is int type and between [0, max_value) +*/ +void radixSort(int* start, int* end, int max_value) +{ + auto n = static_cast(end - start); + + auto block_size_compute_b = getComputeBArrayBlockSize(); + auto full_blocks_per_grid_compute_b = fullBlocksPerGrid(n, block_size_compute_b); + + auto block_size_compute_e = getComputeBArrayBlockSize(); + auto full_blocks_per_grid_compute_e = fullBlocksPerGrid(n, block_size_compute_e); + + auto block_size_compute_d = getComputeBArrayBlockSize(); + auto full_blocks_per_grid_compute_d = fullBlocksPerGrid(n, block_size_compute_d); + + auto block_size_reshuffle = getReshuffleBlockSize(); + auto full_blocks_per_grid_reshuffle = fullBlocksPerGrid(n, block_size_compute_d); + + auto extended_n = std::size_t(1) << ilog2ceil(n); // round up to power of two for scanning + + int* dev_array; + cudaMalloc((void**)&dev_array, n * sizeof(*dev_array)); + checkCUDAError("cudaMalloc dev_array failed!"); + cudaMemcpy(dev_array, start, n * sizeof(*start), cudaMemcpyHostToDevice); + + int* dev_temp; + cudaMalloc((void**)&dev_temp, n * sizeof(*dev_temp)); + checkCUDAError("cudaMalloc dev_temp failed!"); + + bool* dev_b; // buffer which holds values of b + cudaMalloc((void**)&dev_b, n * sizeof(*dev_b)); + checkCUDAError("cudaMalloc dev_b failed!"); + + int* dev_ef; // buffer which holds values of e and f + cudaMalloc((void**)&dev_ef, extended_n * sizeof(*dev_ef)); + checkCUDAError("cudaMalloc dev_ef failed!"); + + int* dev_d; // buffer which holds values of d + cudaMalloc((void**)&dev_d, n * sizeof(*dev_d)); + checkCUDAError("cudaMalloc dev_d failed!"); + + timer().startGpuTimer(); + // input betweem [0, max_value) + // auto lsb_offset = 0; + auto msb_offset = ilog2ceil(max_value); + for (int offset = 0; offset < msb_offset; offset++) + { + auto bit_mask = 1 << offset; + kernComputeBArray <<>>(n, bit_mask, dev_b, dev_array); + kernComputeEArray <<>>(n, dev_ef, dev_b, extended_n); + + StreamCompaction::Efficient::scanInPlaceDevice(extended_n, dev_ef); + + kernComputeDArray <<>>(n, dev_d, dev_ef, dev_b); + + kernReshuffle <<>>(n, dev_temp, dev_array, dev_d); + std::swap(dev_temp, dev_array); + + } + timer().endGpuTimer(); + + cudaMemcpy(start, dev_array, n * sizeof(*start), cudaMemcpyDeviceToHost); + + cudaFree(dev_array); + cudaFree(dev_temp); + cudaFree(dev_b); + cudaFree(dev_ef); + cudaFree(dev_d); +} + +} +} diff --git a/stream_compaction/radix_sort.h b/stream_compaction/radix_sort.h new file mode 100644 index 0000000..c1e8c29 --- /dev/null +++ b/stream_compaction/radix_sort.h @@ -0,0 +1,15 @@ +#pragma once + +#include "common.h" + +namespace StreamCompaction { +namespace RadixSort { + StreamCompaction::Common::PerformanceTimer& timer(); + void radixSort(int* start, int* end, int max_value); + + int getComputeBArrayBlockSize(); + int getComputeEArrayBlockSize(); + int getComputeDArrayBlockSize(); + int getReshuffleBlockSize(); +} +} diff --git a/stream_compaction/thrust.cu b/stream_compaction/thrust.cu index d8dbb32..5890e01 100644 --- a/stream_compaction/thrust.cu +++ b/stream_compaction/thrust.cu @@ -3,20 +3,67 @@ #include #include #include +#include #include "common.h" #include "thrust.h" namespace StreamCompaction { namespace Thrust { + using StreamCompaction::Common::PerformanceTimer; + PerformanceTimer& timer() + { + static PerformanceTimer timer; + return timer; + } + /** * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ void scan(int n, int *odata, const int *idata) { - // TODO use `thrust::exclusive_scan` + // DONE use `thrust::exclusive_scan` // example: for device_vectors dv_in and dv_out: // thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin()); + + thrust::device_vector thrust_idata(idata, idata + n); + thrust::device_vector thrust_odata(odata, odata + n); + + timer().startGpuTimer(); + + thrust::exclusive_scan(thrust_idata.begin(), thrust_idata.end(), thrust_odata.begin()); + + timer().endGpuTimer(); + + //thrust::host_vector thrust_host_odata = thrust_odata; + thrust::copy(thrust_odata.begin(), thrust_odata.end(), odata); } +void sort(int* start, int* end) +{ + thrust::device_vector thrust_data(start, end); + + timer().startGpuTimer(); + + // I found it calls thrust's radix sort. + thrust::sort(thrust_data.begin(), thrust_data.end()); + + timer().endGpuTimer(); + + thrust::copy(thrust_data.begin(), thrust_data.end(), start); +} + +//void stableSort(int* start, int* end) +//{ +// thrust::device_vector thrust_data(start, end); +// +// timer().startGpuTimer(); +// +// thrust::sort(thrust_data.begin(), thrust_data.end()); +// +// timer().endGpuTimer(); +// +// thrust::copy(thrust_data.begin(), thrust_data.end(), start); +//} + } } diff --git a/stream_compaction/thrust.h b/stream_compaction/thrust.h index 06707f3..6311ce7 100644 --- a/stream_compaction/thrust.h +++ b/stream_compaction/thrust.h @@ -1,7 +1,11 @@ #pragma once +#include "common.h" + namespace StreamCompaction { namespace Thrust { + StreamCompaction::Common::PerformanceTimer& timer(); void scan(int n, int *odata, const int *idata); + void sort(int* start, int* end); } } diff --git a/test_results/charts.xlsx b/test_results/charts.xlsx new file mode 100644 index 0000000..0beab61 Binary files /dev/null and b/test_results/charts.xlsx differ diff --git a/test_results/result_1.txt b/test_results/result_1.txt new file mode 100644 index 0000000..7d9051a --- /dev/null +++ b/test_results/result_1.txt @@ -0,0 +1,106 @@ +CIS-565 HW2 CUDA Stream Compaction Test (Ruoyu Fan) + Block size for naive scan: 1024 + Block size for up-sweep: 1024 + Block size for down-sweep: 1024 + Block size for boolean mapping: 1024 + Block size for scattering: 1024 + Block sizes for radix sort: 1024 1024 1024 1024 + +**************** +** SCAN TESTS ** +**************** +Array size (power of two): 67108864 +Array size (non-power of two): 67108861 + [ 17 27 18 32 43 0 40 49 25 4 22 14 39 ... 47 0 ] +==== cpu scan, power-of-two ==== + elapesd time: 124.311ms (std::chrono Measured) + [ 0 17 44 62 94 137 137 177 226 251 255 277 291 ... 1643657908 1643657955 ] +==== cpu scan, non-power-of-two ==== + [ 0 17 44 62 94 137 137 177 226 251 255 277 291 ... 1643657825 1643657826 ] + elapesd time: 97.2587ms (std::chrono Measured) + passed +==== naive scan, power-of-two ==== + [ 0 17 44 62 94 137 137 177 226 251 255 277 291 ... 1643657908 1643657955 ] + elapesd time: 113.743ms (CUDA Measured) + passed +==== naive scan, non-power-of-two ==== + [ 0 17 44 62 94 137 137 177 226 251 255 277 291 ... 1643657825 1643657826 ] + elapesd time: 113.622ms (CUDA Measured) + passed +==== work-efficient scan, power-of-two ==== + [ 0 17 44 62 94 137 137 177 226 251 255 277 291 ... 1643657908 1643657955 ] + elapesd time: 44.3556ms (CUDA Measured) + passed +==== work-efficient scan, non-power-of-two ==== + [ 0 17 44 62 94 137 137 177 226 251 255 277 291 ... 1643657825 1643657826 ] + elapesd time: 44.3666ms (CUDA Measured) + passed +==== thrust scan, power-of-two ==== + [ 0 17 44 62 94 137 137 177 226 251 255 277 291 ... 1643657908 1643657955 ] + elapesd time: 7.61741ms (CUDA Measured) + elapesd time: 15.6531ms (std::chrono Measured) + passed +==== thrust scan, non-power-of-two ==== + [ 0 17 44 62 94 137 137 177 226 251 255 277 291 ... 1643657825 1643657826 ] + elapesd time: 7.6697ms (CUDA Measured) + elapesd time: 21.6846ms (std::chrono Measured) + passed + +***************************** +** STREAM COMPACTION TESTS ** +***************************** +Array size (power of two): 67108864 +Array size (non-power of two): 67108861 + [ 2 0 1 0 1 2 0 3 1 2 0 2 2 ... 0 0 ] +==== cpu compact without scan, power-of-two ==== + [ 2 1 1 2 3 1 2 2 2 1 3 1 3 ... 1 3 ] + elapesd time: 162.924ms (std::chrono Measured) + passed +==== cpu compact without scan, non-power-of-two ==== + [ 2 1 1 2 3 1 2 2 2 1 3 1 3 ... 3 1 ] + elapesd time: 162.398ms (std::chrono Measured) + passed +==== cpu compact with scan ==== + [ 2 1 1 2 3 1 2 2 2 1 3 1 3 ... 1 3 ] + elapesd time: 413.098ms (std::chrono Measured) + passed +==== work-efficient compact, power-of-two ==== + [ 2 1 1 2 3 1 2 2 2 1 3 1 3 ... 1 3 ] + elapesd time: 53.2036ms (CUDA Measured) + passed +==== work-efficient compact, non-power-of-two ==== + [ 2 1 1 2 3 1 2 2 2 1 3 1 3 ... 3 1 ] + elapesd time: 53.1786ms (CUDA Measured) + passed + +***************************** +** RADIX SORT TESTS ** +***************************** +Array size (power of two): 67108864 +Array size (non-power of two): 67108861 +Max value: 100 + [ 33 33 85 26 30 16 85 80 28 82 32 35 79 ... 47 0 ] +==== std::sort, power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapesd time: 1519.96ms (std::chrono Measured) +==== thrust unstable sort, power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapesd time: 415.289ms (std::chrono Measured) + elapesd time: 0.001536ms (CUDA Measured) + passed +==== thrust stable sort, power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapesd time: 412.907ms (std::chrono Measured) + elapesd time: 0.001408ms (CUDA Measured) + passed +==== radix sort, power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapesd time: 417.083ms (CUDA Measured) + passed +==== std::sort, non power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapesd time: 1520.99ms (std::chrono Measured) +==== radix sort, non power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapesd time: 416.361ms (CUDA Measured) + passed diff --git a/test_results/result_2.txt b/test_results/result_2.txt new file mode 100644 index 0000000..02cf932 --- /dev/null +++ b/test_results/result_2.txt @@ -0,0 +1,106 @@ +CIS-565 HW2 CUDA Stream Compaction Test (Ruoyu Fan) + Block size for naive scan: 1024 + Block size for up-sweep: 1024 + Block size for down-sweep: 1024 + Block size for boolean mapping: 1024 + Block size for scattering: 1024 + Block sizes for radix sort: 1024 1024 1024 1024 + +**************** +** SCAN TESTS ** +**************** +Array size (power of two): 67108864 +Array size (non-power of two): 67108861 + [ 29 21 40 33 12 4 41 16 15 13 11 1 44 ... 26 0 ] +==== cpu scan, power-of-two ==== + elapesd time: 127.84ms (std::chrono Measured) + [ 0 29 50 90 123 135 139 180 196 211 224 235 236 ... 1643379015 1643379041 ] +==== cpu scan, non-power-of-two ==== + [ 0 29 50 90 123 135 139 180 196 211 224 235 236 ... 1643378970 1643378986 ] + elapesd time: 97.2591ms (std::chrono Measured) + passed +==== naive scan, power-of-two ==== + [ 0 29 50 90 123 135 139 180 196 211 224 235 236 ... 1643379015 1643379041 ] + elapesd time: 113.736ms (CUDA Measured) + passed +==== naive scan, non-power-of-two ==== + [ 0 29 50 90 123 135 139 180 196 211 224 235 236 ... 1643378970 1643378986 ] + elapesd time: 113.642ms (CUDA Measured) + passed +==== work-efficient scan, power-of-two ==== + [ 0 29 50 90 123 135 139 180 196 211 224 235 236 ... 1643379015 1643379041 ] + elapesd time: 44.4066ms (CUDA Measured) + passed +==== work-efficient scan, non-power-of-two ==== + [ 0 29 50 90 123 135 139 180 196 211 224 235 236 ... 1643378970 1643378986 ] + elapesd time: 44.2758ms (CUDA Measured) + passed +==== thrust scan, power-of-two ==== + [ 0 29 50 90 123 135 139 180 196 211 224 235 236 ... 1643379015 1643379041 ] + elapesd time: 7.58675ms (CUDA Measured) + elapesd time: 15.629ms (std::chrono Measured) + passed +==== thrust scan, non-power-of-two ==== + [ 0 29 50 90 123 135 139 180 196 211 224 235 236 ... 1643378970 1643378986 ] + elapesd time: 7.5575ms (CUDA Measured) + elapesd time: 15.6058ms (std::chrono Measured) + passed + +***************************** +** STREAM COMPACTION TESTS ** +***************************** +Array size (power of two): 67108864 +Array size (non-power of two): 67108861 + [ 0 0 0 3 1 2 0 1 2 1 2 3 0 ... 1 0 ] +==== cpu compact without scan, power-of-two ==== + [ 3 1 2 1 2 1 2 3 2 3 3 2 1 ... 2 1 ] + elapesd time: 159.916ms (std::chrono Measured) + passed +==== cpu compact without scan, non-power-of-two ==== + [ 3 1 2 1 2 1 2 3 2 3 3 2 1 ... 2 2 ] + elapesd time: 158.894ms (std::chrono Measured) + passed +==== cpu compact with scan ==== + [ 3 1 2 1 2 1 2 3 2 3 3 2 1 ... 2 1 ] + elapesd time: 398.104ms (std::chrono Measured) + passed +==== work-efficient compact, power-of-two ==== + [ 3 1 2 1 2 1 2 3 2 3 3 2 1 ... 2 1 ] + elapesd time: 53.2946ms (CUDA Measured) + passed +==== work-efficient compact, non-power-of-two ==== + [ 3 1 2 1 2 1 2 3 2 3 3 2 1 ... 2 2 ] + elapesd time: 53.2884ms (CUDA Measured) + passed + +***************************** +** RADIX SORT TESTS ** +***************************** +Array size (power of two): 67108864 +Array size (non-power of two): 67108861 +Max value: 100 + [ 98 25 21 41 13 15 98 3 80 92 74 15 5 ... 75 0 ] +==== std::sort, power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapesd time: 1521.05ms (std::chrono Measured) +==== thrust unstable sort, power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapesd time: 417.277ms (std::chrono Measured) + elapesd time: 0.001088ms (CUDA Measured) + passed +==== thrust stable sort, power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapesd time: 427.603ms (std::chrono Measured) + elapesd time: 0.001088ms (CUDA Measured) + passed +==== radix sort, power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapesd time: 416.437ms (CUDA Measured) + passed +==== std::sort, non power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapesd time: 1509.88ms (std::chrono Measured) +==== radix sort, non power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapesd time: 416.197ms (CUDA Measured) + passed diff --git a/test_results/result_3.txt b/test_results/result_3.txt new file mode 100644 index 0000000..e11f67c --- /dev/null +++ b/test_results/result_3.txt @@ -0,0 +1,106 @@ +CIS-565 HW2 CUDA Stream Compaction Test (Ruoyu Fan) + Block size for naive scan: 1024 + Block size for up-sweep: 1024 + Block size for down-sweep: 1024 + Block size for boolean mapping: 1024 + Block size for scattering: 1024 + Block sizes for radix sort: 1024 1024 1024 1024 + +**************** +** SCAN TESTS ** +**************** +Array size (power of two): 67108864 +Array size (non-power of two): 67108861 + [ 37 34 33 38 16 49 29 43 11 2 49 44 27 ... 5 0 ] +==== cpu scan, power-of-two ==== + elapesd time: 115.903ms (std::chrono Measured) + [ 0 37 71 104 142 158 207 236 279 290 292 341 385 ... 1643541232 1643541237 ] +==== cpu scan, non-power-of-two ==== + [ 0 37 71 104 142 158 207 236 279 290 292 341 385 ... 1643541123 1643541155 ] + elapesd time: 84.6896ms (std::chrono Measured) + passed +==== naive scan, power-of-two ==== + [ 0 37 71 104 142 158 207 236 279 290 292 341 385 ... 1643541232 1643541237 ] + elapesd time: 113.689ms (CUDA Measured) + passed +==== naive scan, non-power-of-two ==== + [ 0 37 71 104 142 158 207 236 279 290 292 341 385 ... 1643541123 1643541155 ] + elapesd time: 113.611ms (CUDA Measured) + passed +==== work-efficient scan, power-of-two ==== + [ 0 37 71 104 142 158 207 236 279 290 292 341 385 ... 1643541232 1643541237 ] + elapesd time: 44.3716ms (CUDA Measured) + passed +==== work-efficient scan, non-power-of-two ==== + [ 0 37 71 104 142 158 207 236 279 290 292 341 385 ... 1643541123 1643541155 ] + elapesd time: 44.3758ms (CUDA Measured) + passed +==== thrust scan, power-of-two ==== + [ 0 37 71 104 142 158 207 236 279 290 292 341 385 ... 1643541232 1643541237 ] + elapesd time: 7.61901ms (CUDA Measured) + elapesd time: 15.6295ms (std::chrono Measured) + passed +==== thrust scan, non-power-of-two ==== + [ 0 37 71 104 142 158 207 236 279 290 292 341 385 ... 1643541123 1643541155 ] + elapesd time: 7.69434ms (CUDA Measured) + elapesd time: 0ms (std::chrono Measured) + passed + +***************************** +** STREAM COMPACTION TESTS ** +***************************** +Array size (power of two): 67108864 +Array size (non-power of two): 67108861 + [ 3 3 3 2 0 0 0 2 2 3 3 0 0 ... 3 0 ] +==== cpu compact without scan, power-of-two ==== + [ 3 3 3 2 2 2 3 3 2 3 2 2 1 ... 2 3 ] + elapesd time: 163.452ms (std::chrono Measured) + passed +==== cpu compact without scan, non-power-of-two ==== + [ 3 3 3 2 2 2 3 3 2 3 2 2 1 ... 2 2 ] + elapesd time: 163.926ms (std::chrono Measured) + passed +==== cpu compact with scan ==== + [ 3 3 3 2 2 2 3 3 2 3 2 2 1 ... 2 3 ] + elapesd time: 415.607ms (std::chrono Measured) + passed +==== work-efficient compact, power-of-two ==== + [ 3 3 3 2 2 2 3 3 2 3 2 2 1 ... 2 3 ] + elapesd time: 53.2367ms (CUDA Measured) + passed +==== work-efficient compact, non-power-of-two ==== + [ 3 3 3 2 2 2 3 3 2 3 2 2 1 ... 2 2 ] + elapesd time: 53.1997ms (CUDA Measured) + passed + +***************************** +** RADIX SORT TESTS ** +***************************** +Array size (power of two): 67108864 +Array size (non-power of two): 67108861 +Max value: 100 + [ 54 8 0 33 21 33 41 75 33 79 60 64 18 ... 37 0 ] +==== std::sort, power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapesd time: 1523.91ms (std::chrono Measured) +==== thrust unstable sort, power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapesd time: 424.964ms (std::chrono Measured) + elapesd time: 0.001248ms (CUDA Measured) + passed +==== thrust stable sort, power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapesd time: 424.985ms (std::chrono Measured) + elapesd time: 0.001088ms (CUDA Measured) + passed +==== radix sort, power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapesd time: 416.306ms (CUDA Measured) + passed +==== std::sort, non power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapesd time: 1496.82ms (std::chrono Measured) +==== radix sort, non power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapesd time: 417.934ms (CUDA Measured) + passed diff --git a/test_results/result_before_optimization_1.txt b/test_results/result_before_optimization_1.txt new file mode 100644 index 0000000..f3484a4 --- /dev/null +++ b/test_results/result_before_optimization_1.txt @@ -0,0 +1,106 @@ +CIS-565 HW2 CUDA Stream Compaction Test (Ruoyu Fan) + Block size for naive scan: 1024 + Block size for up-sweep: 1024 + Block size for down-sweep: 1024 + Block size for boolean mapping: 1024 + Block size for scattering: 1024 + Block sizes for radix sort: 1024 1024 1024 1024 + +**************** +** SCAN TESTS ** +**************** +Array size (power of two): 67108864 +Array size (non-power of two): 67108861 + [ 38 19 38 37 5 47 15 35 0 12 3 0 42 ... 38 0 ] +==== cpu scan, power-of-two ==== + elapesd time: 163.936ms (std::chrono Measured) + [ 0 38 57 95 132 137 184 199 234 234 246 249 249 ... 1643794086 1643794124 ] +==== cpu scan, non-power-of-two ==== + [ 0 38 57 95 132 137 184 199 234 234 246 249 249 ... 1643794020 1643794034 ] + elapesd time: 139.391ms (std::chrono Measured) + passed +==== naive scan, power-of-two ==== + [ 0 38 57 95 132 137 184 199 234 234 246 249 249 ... 1643794086 1643794124 ] + elapesd time: 113.742ms (CUDA Measured) + passed +==== naive scan, non-power-of-two ==== + [ 0 38 57 95 132 137 184 199 234 234 246 249 249 ... 1643794020 1643794034 ] + elapesd time: 113.551ms (CUDA Measured) + passed +==== work-efficient scan, power-of-two ==== + [ 0 38 57 95 132 137 184 199 234 234 246 249 249 ... 1643794086 1643794124 ] + elapesd time: 120.501ms (CUDA Measured) + passed +==== work-efficient scan, non-power-of-two ==== + [ 0 38 57 95 132 137 184 199 234 234 246 249 249 ... 1643794020 1643794034 ] + elapesd time: 120.565ms (CUDA Measured) + passed +==== thrust scan, power-of-two ==== + [ 0 38 57 95 132 137 184 199 234 234 246 249 249 ... 1643794086 1643794124 ] + elapesd time: 7.72432ms (CUDA Measured) + elapesd time: 12.0047ms (std::chrono Measured) + passed +==== thrust scan, non-power-of-two ==== + [ 0 38 57 95 132 137 184 199 234 234 246 249 249 ... 1643794020 1643794034 ] + elapesd time: 7.37165ms (CUDA Measured) + elapesd time: 8.0206ms (std::chrono Measured) + passed + +***************************** +** STREAM COMPACTION TESTS ** +***************************** +Array size (power of two): 67108864 +Array size (non-power of two): 67108861 + [ 2 3 2 1 3 1 1 1 2 0 1 0 2 ... 0 0 ] +==== cpu compact without scan, power-of-two ==== + [ 2 3 2 1 3 1 1 1 2 1 2 1 1 ... 2 2 ] + elapesd time: 165.408ms (std::chrono Measured) + passed +==== cpu compact without scan, non-power-of-two ==== + [ 2 3 2 1 3 1 1 1 2 1 2 1 1 ... 2 2 ] + elapesd time: 160.963ms (std::chrono Measured) + passed +==== cpu compact with scan ==== + [ 2 3 2 1 3 1 1 1 2 1 2 1 1 ... 2 2 ] + elapesd time: 411.79ms (std::chrono Measured) + passed +==== work-efficient compact, power-of-two ==== + [ 2 3 2 1 3 1 1 1 2 1 2 1 1 ... 2 2 ] + elapesd time: 129.465ms (CUDA Measured) + passed +==== work-efficient compact, non-power-of-two ==== + [ 2 3 2 1 3 1 1 1 2 1 2 1 1 ... 2 2 ] + elapesd time: 129.529ms (CUDA Measured) + passed + +***************************** +** RADIX SORT TESTS ** +***************************** +Array size (power of two): 67108864 +Array size (non-power of two): 67108861 +Max value: 100 + [ 38 19 38 37 55 97 65 85 50 12 53 0 42 ... 88 0 ] +==== std::sort, power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapesd time: 1500.15ms (std::chrono Measured) +==== thrust unstable sort, power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapesd time: 414.16ms (std::chrono Measured) + elapesd time: 0.001088ms (CUDA Measured) + passed +==== thrust stable sort, power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapesd time: 418.68ms (std::chrono Measured) + elapesd time: 0.001088ms (CUDA Measured) + passed +==== radix sort, power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapesd time: 950.529ms (CUDA Measured) + passed +==== std::sort, non power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapesd time: 1527.59ms (std::chrono Measured) +==== radix sort, non power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapesd time: 950.95ms (CUDA Measured) + passed diff --git a/test_results/result_before_optimization_2.txt b/test_results/result_before_optimization_2.txt new file mode 100644 index 0000000..25ee039 --- /dev/null +++ b/test_results/result_before_optimization_2.txt @@ -0,0 +1,106 @@ +CIS-565 HW2 CUDA Stream Compaction Test (Ruoyu Fan) + Block size for naive scan: 1024 + Block size for up-sweep: 1024 + Block size for down-sweep: 1024 + Block size for boolean mapping: 1024 + Block size for scattering: 1024 + Block sizes for radix sort: 1024 1024 1024 1024 + +**************** +** SCAN TESTS ** +**************** +Array size (power of two): 67108864 +Array size (non-power of two): 67108861 + [ 48 23 33 8 35 33 9 27 24 30 4 30 44 ... 44 0 ] +==== cpu scan, power-of-two ==== + elapesd time: 160.422ms (std::chrono Measured) + [ 0 48 71 104 112 147 180 189 216 240 270 274 304 ... 1643396363 1643396407 ] +==== cpu scan, non-power-of-two ==== + [ 0 48 71 104 112 147 180 189 216 240 270 274 304 ... 1643396286 1643396311 ] + elapesd time: 97.76ms (std::chrono Measured) + passed +==== naive scan, power-of-two ==== + [ 0 48 71 104 112 147 180 189 216 240 270 274 304 ... 1643396363 1643396407 ] + elapesd time: 113.653ms (CUDA Measured) + passed +==== naive scan, non-power-of-two ==== + [ 0 48 71 104 112 147 180 189 216 240 270 274 304 ... 1643396286 1643396311 ] + elapesd time: 113.603ms (CUDA Measured) + passed +==== work-efficient scan, power-of-two ==== + [ 0 48 71 104 112 147 180 189 216 240 270 274 304 ... 1643396363 1643396407 ] + elapesd time: 120.645ms (CUDA Measured) + passed +==== work-efficient scan, non-power-of-two ==== + [ 0 48 71 104 112 147 180 189 216 240 270 274 304 ... 1643396286 1643396311 ] + elapesd time: 120.59ms (CUDA Measured) + passed +==== thrust scan, power-of-two ==== + [ 0 48 71 104 112 147 180 189 216 240 270 274 304 ... 1643396363 1643396407 ] + elapesd time: 7.7217ms (CUDA Measured) + elapesd time: 11.9449ms (std::chrono Measured) + passed +==== thrust scan, non-power-of-two ==== + [ 0 48 71 104 112 147 180 189 216 240 270 274 304 ... 1643396286 1643396311 ] + elapesd time: 7.7703ms (CUDA Measured) + elapesd time: 8.6162ms (std::chrono Measured) + passed + +***************************** +** STREAM COMPACTION TESTS ** +***************************** +Array size (power of two): 67108864 +Array size (non-power of two): 67108861 + [ 0 2 3 0 3 2 0 1 1 2 1 0 1 ... 1 0 ] +==== cpu compact without scan, power-of-two ==== + [ 2 3 3 2 1 1 2 1 1 2 3 2 2 ... 3 1 ] + elapesd time: 161.489ms (std::chrono Measured) + passed +==== cpu compact without scan, non-power-of-two ==== + [ 2 3 3 2 1 1 2 1 1 2 3 2 2 ... 3 1 ] + elapesd time: 159.964ms (std::chrono Measured) + passed +==== cpu compact with scan ==== + [ 2 3 3 2 1 1 2 1 1 2 3 2 2 ... 3 1 ] + elapesd time: 410.573ms (std::chrono Measured) + passed +==== work-efficient compact, power-of-two ==== + [ 2 3 3 2 1 1 2 1 1 2 3 2 2 ... 3 1 ] + elapesd time: 129.561ms (CUDA Measured) + passed +==== work-efficient compact, non-power-of-two ==== + [ 2 3 3 2 1 1 2 1 1 2 3 2 2 ... 3 1 ] + elapesd time: 130.928ms (CUDA Measured) + passed + +***************************** +** RADIX SORT TESTS ** +***************************** +Array size (power of two): 67108864 +Array size (non-power of two): 67108861 +Max value: 100 + [ 17 27 96 98 87 94 33 63 40 9 17 44 5 ... 75 0 ] +==== std::sort, power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapesd time: 1515.05ms (std::chrono Measured) +==== thrust unstable sort, power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapesd time: 415.197ms (std::chrono Measured) + elapesd time: 0.001056ms (CUDA Measured) + passed +==== thrust stable sort, power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapesd time: 419.66ms (std::chrono Measured) + elapesd time: 0.001056ms (CUDA Measured) + passed +==== radix sort, power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapesd time: 951.73ms (CUDA Measured) + passed +==== std::sort, non power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapesd time: 1534.72ms (std::chrono Measured) +==== radix sort, non power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapesd time: 960.985ms (CUDA Measured) + passed diff --git a/test_results/result_before_optimization_3.txt b/test_results/result_before_optimization_3.txt new file mode 100644 index 0000000..bc2a911 --- /dev/null +++ b/test_results/result_before_optimization_3.txt @@ -0,0 +1,106 @@ +CIS-565 HW2 CUDA Stream Compaction Test (Ruoyu Fan) + Block size for naive scan: 1024 + Block size for up-sweep: 1024 + Block size for down-sweep: 1024 + Block size for boolean mapping: 1024 + Block size for scattering: 1024 + Block sizes for radix sort: 1024 1024 1024 1024 + +**************** +** SCAN TESTS ** +**************** +Array size (power of two): 67108864 +Array size (non-power of two): 67108861 + [ 32 23 0 14 49 39 47 12 21 3 40 36 34 ... 17 0 ] +==== cpu scan, power-of-two ==== + elapesd time: 162.452ms (std::chrono Measured) + [ 0 32 55 55 69 118 157 204 216 237 240 280 316 ... 1643620618 1643620635 ] +==== cpu scan, non-power-of-two ==== + [ 0 32 55 55 69 118 157 204 216 237 240 280 316 ... 1643620565 1643620567 ] + elapesd time: 139.879ms (std::chrono Measured) + passed +==== naive scan, power-of-two ==== + [ 0 32 55 55 69 118 157 204 216 237 240 280 316 ... 1643620618 1643620635 ] + elapesd time: 113.688ms (CUDA Measured) + passed +==== naive scan, non-power-of-two ==== + [ 0 32 55 55 69 118 157 204 216 237 240 280 316 ... 1643620565 1643620567 ] + elapesd time: 113.833ms (CUDA Measured) + passed +==== work-efficient scan, power-of-two ==== + [ 0 32 55 55 69 118 157 204 216 237 240 280 316 ... 1643620618 1643620635 ] + elapesd time: 120.611ms (CUDA Measured) + passed +==== work-efficient scan, non-power-of-two ==== + [ 0 32 55 55 69 118 157 204 216 237 240 280 316 ... 1643620565 1643620567 ] + elapesd time: 120.695ms (CUDA Measured) + passed +==== thrust scan, power-of-two ==== + [ 0 32 55 55 69 118 157 204 216 237 240 280 316 ... 1643620618 1643620635 ] + elapesd time: 7.94493ms (CUDA Measured) + elapesd time: 12.4531ms (std::chrono Measured) + passed +==== thrust scan, non-power-of-two ==== + [ 0 32 55 55 69 118 157 204 216 237 240 280 316 ... 1643620565 1643620567 ] + elapesd time: 8.14739ms (CUDA Measured) + elapesd time: 9.2116ms (std::chrono Measured) + passed + +***************************** +** STREAM COMPACTION TESTS ** +***************************** +Array size (power of two): 67108864 +Array size (non-power of two): 67108861 + [ 2 2 2 0 0 1 3 3 0 3 0 2 3 ... 0 0 ] +==== cpu compact without scan, power-of-two ==== + [ 2 2 2 1 3 3 3 2 3 1 2 3 2 ... 2 3 ] + elapesd time: 165.433ms (std::chrono Measured) + passed +==== cpu compact without scan, non-power-of-two ==== + [ 2 2 2 1 3 3 3 2 3 1 2 3 2 ... 2 3 ] + elapesd time: 161.903ms (std::chrono Measured) + passed +==== cpu compact with scan ==== + [ 2 2 2 1 3 3 3 2 3 1 2 3 2 ... 2 3 ] + elapesd time: 411.094ms (std::chrono Measured) + passed +==== work-efficient compact, power-of-two ==== + [ 2 2 2 1 3 3 3 2 3 1 2 3 2 ... 2 3 ] + elapesd time: 129.599ms (CUDA Measured) + passed +==== work-efficient compact, non-power-of-two ==== + [ 2 2 2 1 3 3 3 2 3 1 2 3 2 ... 2 3 ] + elapesd time: 129.598ms (CUDA Measured) + passed + +***************************** +** RADIX SORT TESTS ** +***************************** +Array size (power of two): 67108864 +Array size (non-power of two): 67108861 +Max value: 100 + [ 48 79 34 59 54 23 60 44 43 31 1 6 24 ... 99 0 ] +==== std::sort, power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapesd time: 1523.59ms (std::chrono Measured) +==== thrust unstable sort, power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapesd time: 427.786ms (std::chrono Measured) + elapesd time: 0.001088ms (CUDA Measured) + passed +==== thrust stable sort, power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapesd time: 415.179ms (std::chrono Measured) + elapesd time: 0.00112ms (CUDA Measured) + passed +==== radix sort, power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapesd time: 954.781ms (CUDA Measured) + passed +==== std::sort, non power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapesd time: 1529.83ms (std::chrono Measured) +==== radix sort, non power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapesd time: 963.745ms (CUDA Measured) + passed diff --git a/test_results/result_blocksize_1024.txt b/test_results/result_blocksize_1024.txt new file mode 100644 index 0000000..256ab91 --- /dev/null +++ b/test_results/result_blocksize_1024.txt @@ -0,0 +1,106 @@ +CIS-565 HW2 CUDA Stream Compaction Test (Ruoyu Fan) + Block size for naive scan: 1024 + Block size for up-sweep: 1024 + Block size for down-sweep: 1024 + Block size for boolean mapping: 1024 + Block size for scattering: 1024 + Block sizes for radix sort: 1024 1024 1024 1024 + +**************** +** SCAN TESTS ** +**************** +Array size (power of two): 67108864 +Array size (non-power of two): 67108861 + [ 8 18 37 41 15 25 27 8 36 28 13 40 24 ... 35 0 ] +==== cpu scan, power-of-two ==== + elapesd time: 134.408ms (std::chrono Measured) + [ 0 8 26 63 104 119 144 171 179 215 243 256 296 ... 1643625502 1643625537 ] +==== cpu scan, non-power-of-two ==== + [ 0 8 26 63 104 119 144 171 179 215 243 256 296 ... 1643625408 1643625440 ] + elapesd time: 149.901ms (std::chrono Measured) + passed +==== naive scan, power-of-two ==== + [ 0 8 26 63 104 119 144 171 179 215 243 256 296 ... 1643625502 1643625537 ] + elapesd time: 113.867ms (CUDA Measured) + passed +==== naive scan, non-power-of-two ==== + [ 0 8 26 63 104 119 144 171 179 215 243 256 296 ... 1643625408 1643625440 ] + elapesd time: 113.687ms (CUDA Measured) + passed +==== work-efficient scan, power-of-two ==== + [ 0 8 26 63 104 119 144 171 179 215 243 256 296 ... 1643625502 1643625537 ] + elapesd time: 44.2491ms (CUDA Measured) + passed +==== work-efficient scan, non-power-of-two ==== + [ 0 8 26 63 104 119 144 171 179 215 243 256 296 ... 1643625408 1643625440 ] + elapesd time: 44.3104ms (CUDA Measured) + passed +==== thrust scan, power-of-two ==== + [ 0 8 26 63 104 119 144 171 179 215 243 256 296 ... 1643625502 1643625537 ] + elapesd time: 7.73741ms (CUDA Measured) + elapesd time: 0ms (std::chrono Measured) + passed +==== thrust scan, non-power-of-two ==== + [ 0 8 26 63 104 119 144 171 179 215 243 256 296 ... 1643625408 1643625440 ] + elapesd time: 7.74371ms (CUDA Measured) + elapesd time: 0ms (std::chrono Measured) + passed + +***************************** +** STREAM COMPACTION TESTS ** +***************************** +Array size (power of two): 67108864 +Array size (non-power of two): 67108861 + [ 0 1 0 3 3 1 2 1 1 2 1 0 3 ... 3 0 ] +==== cpu compact without scan, power-of-two ==== + [ 1 3 3 1 2 1 1 2 1 3 3 1 3 ... 2 3 ] + elapesd time: 155.403ms (std::chrono Measured) + passed +==== cpu compact without scan, non-power-of-two ==== + [ 1 3 3 1 2 1 1 2 1 3 3 1 3 ... 2 2 ] + elapesd time: 154.901ms (std::chrono Measured) + passed +==== cpu compact with scan ==== + [ 1 3 3 1 2 1 1 2 1 3 3 1 3 ... 2 3 ] + elapesd time: 421.621ms (std::chrono Measured) + passed +==== work-efficient compact, power-of-two ==== + [ 1 3 3 1 2 1 1 2 1 3 3 1 3 ... 2 3 ] + elapesd time: 54.2043ms (CUDA Measured) + passed +==== work-efficient compact, non-power-of-two ==== + [ 1 3 3 1 2 1 1 2 1 3 3 1 3 ... 2 2 ] + elapesd time: 54.1137ms (CUDA Measured) + passed + +***************************** +** RADIX SORT TESTS ** +***************************** +Array size (power of two): 67108864 +Array size (non-power of two): 67108861 +Max value: 100 + [ 78 22 68 49 66 85 83 63 52 58 25 5 35 ... 84 0 ] +==== std::sort, power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapesd time: 1522.95ms (std::chrono Measured) +==== thrust unstable sort, power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapesd time: 429.389ms (std::chrono Measured) + elapesd time: 0.001184ms (CUDA Measured) + passed +==== thrust stable sort, power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapesd time: 418.915ms (std::chrono Measured) + elapesd time: 0.001216ms (CUDA Measured) + passed +==== radix sort, power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapesd time: 419.691ms (CUDA Measured) + passed +==== std::sort, non power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapesd time: 1516.39ms (std::chrono Measured) +==== radix sort, non power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapesd time: 416.676ms (CUDA Measured) + passed diff --git a/test_results/result_blocksize_128.txt b/test_results/result_blocksize_128.txt new file mode 100644 index 0000000..a483fc1 --- /dev/null +++ b/test_results/result_blocksize_128.txt @@ -0,0 +1,106 @@ +CIS-565 HW2 CUDA Stream Compaction Test (Ruoyu Fan) + Block size for naive scan: 128 + Block size for up-sweep: 128 + Block size for down-sweep: 128 + Block size for boolean mapping: 128 + Block size for scattering: 128 + Block sizes for radix sort: 128 128 128 128 + +**************** +** SCAN TESTS ** +**************** +Array size (power of two): 67108864 +Array size (non-power of two): 67108861 + [ 46 41 34 49 43 33 6 10 32 34 49 28 4 ... 5 0 ] +==== cpu scan, power-of-two ==== + elapesd time: 158.005ms (std::chrono Measured) + [ 0 46 87 121 170 213 246 252 262 294 328 377 405 ... 1643517674 1643517679 ] +==== cpu scan, non-power-of-two ==== + [ 0 46 87 121 170 213 246 252 262 294 328 377 405 ... 1643517602 1643517613 ] + elapesd time: 96.7737ms (std::chrono Measured) + passed +==== naive scan, power-of-two ==== + [ 0 46 87 121 170 213 246 252 262 294 328 377 405 ... 1643517674 1643517679 ] + elapesd time: 113.802ms (CUDA Measured) + passed +==== naive scan, non-power-of-two ==== + [ 0 46 87 121 170 213 246 252 262 294 328 377 405 ... 1643517602 1643517613 ] + elapesd time: 113.643ms (CUDA Measured) + passed +==== work-efficient scan, power-of-two ==== + [ 0 46 87 121 170 213 246 252 262 294 328 377 405 ... 1643517674 1643517679 ] + elapesd time: 44.7409ms (CUDA Measured) + passed +==== work-efficient scan, non-power-of-two ==== + [ 0 46 87 121 170 213 246 252 262 294 328 377 405 ... 1643517602 1643517613 ] + elapesd time: 44.7841ms (CUDA Measured) + passed +==== thrust scan, power-of-two ==== + [ 0 46 87 121 170 213 246 252 262 294 328 377 405 ... 1643517674 1643517679 ] + elapesd time: 7.69293ms (CUDA Measured) + elapesd time: 0ms (std::chrono Measured) + passed +==== thrust scan, non-power-of-two ==== + [ 0 46 87 121 170 213 246 252 262 294 328 377 405 ... 1643517602 1643517613 ] + elapesd time: 7.59293ms (CUDA Measured) + elapesd time: 0ms (std::chrono Measured) + passed + +***************************** +** STREAM COMPACTION TESTS ** +***************************** +Array size (power of two): 67108864 +Array size (non-power of two): 67108861 + [ 2 0 3 3 2 1 1 3 3 2 0 0 3 ... 2 0 ] +==== cpu compact without scan, power-of-two ==== + [ 2 3 3 2 1 1 3 3 2 3 3 2 1 ... 1 2 ] + elapesd time: 163.926ms (std::chrono Measured) + passed +==== cpu compact without scan, non-power-of-two ==== + [ 2 3 3 2 1 1 3 3 2 3 3 2 1 ... 1 1 ] + elapesd time: 163.929ms (std::chrono Measured) + passed +==== cpu compact with scan ==== + [ 2 3 3 2 1 1 3 3 2 3 3 2 1 ... 1 2 ] + elapesd time: 412.474ms (std::chrono Measured) + passed +==== work-efficient compact, power-of-two ==== + [ 2 3 3 2 1 1 3 3 2 3 3 2 1 ... 1 2 ] + elapesd time: 53.6893ms (CUDA Measured) + passed +==== work-efficient compact, non-power-of-two ==== + [ 2 3 3 2 1 1 3 3 2 3 3 2 1 ... 1 1 ] + elapesd time: 53.726ms (CUDA Measured) + passed + +***************************** +** RADIX SORT TESTS ** +***************************** +Array size (power of two): 67108864 +Array size (non-power of two): 67108861 +Max value: 100 + [ 15 46 15 57 62 25 63 97 98 14 12 92 16 ... 4 0 ] +==== std::sort, power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapesd time: 1503.52ms (std::chrono Measured) +==== thrust unstable sort, power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapesd time: 414.828ms (std::chrono Measured) + elapesd time: 0.001088ms (CUDA Measured) + passed +==== thrust stable sort, power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapesd time: 417.44ms (std::chrono Measured) + elapesd time: 0.00112ms (CUDA Measured) + passed +==== radix sort, power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapesd time: 418.162ms (CUDA Measured) + passed +==== std::sort, non power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapesd time: 1518.31ms (std::chrono Measured) +==== radix sort, non power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapesd time: 418.752ms (CUDA Measured) + passed diff --git a/test_results/result_blocksize_16.txt b/test_results/result_blocksize_16.txt new file mode 100644 index 0000000..047696c --- /dev/null +++ b/test_results/result_blocksize_16.txt @@ -0,0 +1,106 @@ +CIS-565 HW2 CUDA Stream Compaction Test (Ruoyu Fan) + Block size for naive scan: 16 + Block size for up-sweep: 16 + Block size for down-sweep: 16 + Block size for boolean mapping: 16 + Block size for scattering: 16 + Block sizes for radix sort: 16 16 16 16 + +**************** +** SCAN TESTS ** +**************** +Array size (power of two): 67108864 +Array size (non-power of two): 67108861 + [ 43 42 6 30 14 3 5 12 45 14 24 29 33 ... 40 0 ] +==== cpu scan, power-of-two ==== + elapesd time: 120.874ms (std::chrono Measured) + [ 0 43 85 91 121 135 138 143 155 200 214 238 267 ... 1643602997 1643603037 ] +==== cpu scan, non-power-of-two ==== + [ 0 43 85 91 121 135 138 143 155 200 214 238 267 ... 1643602942 1643602945 ] + elapesd time: 94.2564ms (std::chrono Measured) + passed +==== naive scan, power-of-two ==== + [ 0 43 85 91 121 135 138 143 155 200 214 238 267 ... 1643602997 1643603037 ] + elapesd time: 212.542ms (CUDA Measured) + passed +==== naive scan, non-power-of-two ==== + [ 0 43 85 91 121 135 138 143 155 200 214 238 267 ... 1643602942 1643602945 ] + elapesd time: 197.078ms (CUDA Measured) + passed +==== work-efficient scan, power-of-two ==== + [ 0 43 85 91 121 135 138 143 155 200 214 238 267 ... 1643602997 1643603037 ] + elapesd time: 44.3224ms (CUDA Measured) + passed +==== work-efficient scan, non-power-of-two ==== + [ 0 43 85 91 121 135 138 143 155 200 214 238 267 ... 1643602942 1643602945 ] + elapesd time: 44.2977ms (CUDA Measured) + passed +==== thrust scan, power-of-two ==== + [ 0 43 85 91 121 135 138 143 155 200 214 238 267 ... 1643602997 1643603037 ] + elapesd time: 7.68762ms (CUDA Measured) + elapesd time: 15.6526ms (std::chrono Measured) + passed +==== thrust scan, non-power-of-two ==== + [ 0 43 85 91 121 135 138 143 155 200 214 238 267 ... 1643602942 1643602945 ] + elapesd time: 7.66259ms (CUDA Measured) + elapesd time: 15.6029ms (std::chrono Measured) + passed + +***************************** +** STREAM COMPACTION TESTS ** +***************************** +Array size (power of two): 67108864 +Array size (non-power of two): 67108861 + [ 0 1 3 1 0 0 3 3 3 2 3 3 3 ... 2 0 ] +==== cpu compact without scan, power-of-two ==== + [ 1 3 1 3 3 3 2 3 3 3 1 1 2 ... 3 2 ] + elapesd time: 162.406ms (std::chrono Measured) + passed +==== cpu compact without scan, non-power-of-two ==== + [ 1 3 1 3 3 3 2 3 3 3 1 1 2 ... 2 3 ] + elapesd time: 161.925ms (std::chrono Measured) + passed +==== cpu compact with scan ==== + [ 1 3 1 3 3 3 2 3 3 3 1 1 2 ... 3 2 ] + elapesd time: 407.585ms (std::chrono Measured) + passed +==== work-efficient compact, power-of-two ==== + [ 1 3 1 3 3 3 2 3 3 3 1 1 2 ... 3 2 ] + elapesd time: 62.2979ms (CUDA Measured) + passed +==== work-efficient compact, non-power-of-two ==== + [ 1 3 1 3 3 3 2 3 3 3 1 1 2 ... 2 3 ] + elapesd time: 62.2192ms (CUDA Measured) + passed + +***************************** +** RADIX SORT TESTS ** +***************************** +Array size (power of two): 67108864 +Array size (non-power of two): 67108861 +Max value: 100 + [ 66 26 83 83 12 40 92 89 54 46 39 88 15 ... 88 0 ] +==== std::sort, power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapesd time: 1498.49ms (std::chrono Measured) +==== thrust unstable sort, power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapesd time: 426.322ms (std::chrono Measured) + elapesd time: 0.001472ms (CUDA Measured) + passed +==== thrust stable sort, power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapesd time: 411.847ms (std::chrono Measured) + elapesd time: 0.001152ms (CUDA Measured) + passed +==== radix sort, power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapesd time: 545.908ms (CUDA Measured) + passed +==== std::sort, non power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapesd time: 1494.83ms (std::chrono Measured) +==== radix sort, non power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapesd time: 554.777ms (CUDA Measured) + passed diff --git a/test_results/result_blocksize_256.txt b/test_results/result_blocksize_256.txt new file mode 100644 index 0000000..7ae2b41 --- /dev/null +++ b/test_results/result_blocksize_256.txt @@ -0,0 +1,106 @@ +CIS-565 HW2 CUDA Stream Compaction Test (Ruoyu Fan) + Block size for naive scan: 256 + Block size for up-sweep: 256 + Block size for down-sweep: 256 + Block size for boolean mapping: 256 + Block size for scattering: 256 + Block sizes for radix sort: 256 256 256 256 + +**************** +** SCAN TESTS ** +**************** +Array size (power of two): 67108864 +Array size (non-power of two): 67108861 + [ 13 11 23 13 16 28 2 9 25 46 3 0 5 ... 14 0 ] +==== cpu scan, power-of-two ==== + elapesd time: 162.003ms (std::chrono Measured) + [ 0 13 24 47 60 76 104 106 115 140 186 189 189 ... 1643295327 1643295341 ] +==== cpu scan, non-power-of-two ==== + [ 0 13 24 47 60 76 104 106 115 140 186 189 189 ... 1643295226 1643295260 ] + elapesd time: 94.2996ms (std::chrono Measured) + passed +==== naive scan, power-of-two ==== + [ 0 13 24 47 60 76 104 106 115 140 186 189 189 ... 1643295327 1643295341 ] + elapesd time: 114.575ms (CUDA Measured) + passed +==== naive scan, non-power-of-two ==== + [ 0 13 24 47 60 76 104 106 115 140 186 189 189 ... 1643295226 1643295260 ] + elapesd time: 113.791ms (CUDA Measured) + passed +==== work-efficient scan, power-of-two ==== + [ 0 13 24 47 60 76 104 106 115 140 186 189 189 ... 1643295327 1643295341 ] + elapesd time: 45.3902ms (CUDA Measured) + passed +==== work-efficient scan, non-power-of-two ==== + [ 0 13 24 47 60 76 104 106 115 140 186 189 189 ... 1643295226 1643295260 ] + elapesd time: 45.3006ms (CUDA Measured) + passed +==== thrust scan, power-of-two ==== + [ 0 13 24 47 60 76 104 106 115 140 186 189 189 ... 1643295327 1643295341 ] + elapesd time: 7.74496ms (CUDA Measured) + elapesd time: 12.5349ms (std::chrono Measured) + passed +==== thrust scan, non-power-of-two ==== + [ 0 13 24 47 60 76 104 106 115 140 186 189 189 ... 1643295226 1643295260 ] + elapesd time: 7.92486ms (CUDA Measured) + elapesd time: 15.6367ms (std::chrono Measured) + passed + +***************************** +** STREAM COMPACTION TESTS ** +***************************** +Array size (power of two): 67108864 +Array size (non-power of two): 67108861 + [ 2 0 0 0 3 0 2 2 3 3 1 0 1 ... 3 0 ] +==== cpu compact without scan, power-of-two ==== + [ 2 3 2 2 3 3 1 1 3 2 1 1 2 ... 2 3 ] + elapesd time: 163.934ms (std::chrono Measured) + passed +==== cpu compact without scan, non-power-of-two ==== + [ 2 3 2 2 3 3 1 1 3 2 1 1 2 ... 2 2 ] + elapesd time: 162.403ms (std::chrono Measured) + passed +==== cpu compact with scan ==== + [ 2 3 2 2 3 3 1 1 3 2 1 1 2 ... 2 3 ] + elapesd time: 411.574ms (std::chrono Measured) + passed +==== work-efficient compact, power-of-two ==== + [ 2 3 2 2 3 3 1 1 3 2 1 1 2 ... 2 3 ] + elapesd time: 53.4933ms (CUDA Measured) + passed +==== work-efficient compact, non-power-of-two ==== + [ 2 3 2 2 3 3 1 1 3 2 1 1 2 ... 2 2 ] + elapesd time: 53.47ms (CUDA Measured) + passed + +***************************** +** RADIX SORT TESTS ** +***************************** +Array size (power of two): 67108864 +Array size (non-power of two): 67108861 +Max value: 100 + [ 32 65 36 71 68 70 59 63 90 75 65 65 66 ... 14 0 ] +==== std::sort, power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapesd time: 1495.48ms (std::chrono Measured) +==== thrust unstable sort, power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapesd time: 432.425ms (std::chrono Measured) + elapesd time: 0.001152ms (CUDA Measured) + passed +==== thrust stable sort, power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapesd time: 431.506ms (std::chrono Measured) + elapesd time: 0.001152ms (CUDA Measured) + passed +==== radix sort, power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapesd time: 417.745ms (CUDA Measured) + passed +==== std::sort, non power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapesd time: 1519.54ms (std::chrono Measured) +==== radix sort, non power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapesd time: 416.848ms (CUDA Measured) + passed diff --git a/test_results/result_blocksize_32.txt b/test_results/result_blocksize_32.txt new file mode 100644 index 0000000..e6047aa --- /dev/null +++ b/test_results/result_blocksize_32.txt @@ -0,0 +1,106 @@ +CIS-565 HW2 CUDA Stream Compaction Test (Ruoyu Fan) + Block size for naive scan: 32 + Block size for up-sweep: 32 + Block size for down-sweep: 32 + Block size for boolean mapping: 32 + Block size for scattering: 32 + Block sizes for radix sort: 32 32 32 32 + +**************** +** SCAN TESTS ** +**************** +Array size (power of two): 67108864 +Array size (non-power of two): 67108861 + [ 10 18 28 43 38 19 1 46 6 26 42 28 46 ... 22 0 ] +==== cpu scan, power-of-two ==== + elapesd time: 162.953ms (std::chrono Measured) + [ 0 10 28 56 99 137 156 157 203 209 235 277 305 ... 1643743873 1643743895 ] +==== cpu scan, non-power-of-two ==== + [ 0 10 28 56 99 137 156 157 203 209 235 277 305 ... 1643743749 1643743780 ] + elapesd time: 96.7571ms (std::chrono Measured) + passed +==== naive scan, power-of-two ==== + [ 0 10 28 56 99 137 156 157 203 209 235 277 305 ... 1643743873 1643743895 ] + elapesd time: 133.116ms (CUDA Measured) + passed +==== naive scan, non-power-of-two ==== + [ 0 10 28 56 99 137 156 157 203 209 235 277 305 ... 1643743749 1643743780 ] + elapesd time: 128.137ms (CUDA Measured) + passed +==== work-efficient scan, power-of-two ==== + [ 0 10 28 56 99 137 156 157 203 209 235 277 305 ... 1643743873 1643743895 ] + elapesd time: 43.4925ms (CUDA Measured) + passed +==== work-efficient scan, non-power-of-two ==== + [ 0 10 28 56 99 137 156 157 203 209 235 277 305 ... 1643743749 1643743780 ] + elapesd time: 43.3876ms (CUDA Measured) + passed +==== thrust scan, power-of-two ==== + [ 0 10 28 56 99 137 156 157 203 209 235 277 305 ... 1643743873 1643743895 ] + elapesd time: 7.7208ms (CUDA Measured) + elapesd time: 0ms (std::chrono Measured) + passed +==== thrust scan, non-power-of-two ==== + [ 0 10 28 56 99 137 156 157 203 209 235 277 305 ... 1643743749 1643743780 ] + elapesd time: 7.51955ms (CUDA Measured) + elapesd time: 0ms (std::chrono Measured) + passed + +***************************** +** STREAM COMPACTION TESTS ** +***************************** +Array size (power of two): 67108864 +Array size (non-power of two): 67108861 + [ 0 3 3 3 2 0 3 2 1 0 0 2 3 ... 0 0 ] +==== cpu compact without scan, power-of-two ==== + [ 3 3 3 2 3 2 1 2 3 3 1 3 3 ... 2 2 ] + elapesd time: 164.934ms (std::chrono Measured) + passed +==== cpu compact without scan, non-power-of-two ==== + [ 3 3 3 2 3 2 1 2 3 3 1 3 3 ... 3 2 ] + elapesd time: 163.927ms (std::chrono Measured) + passed +==== cpu compact with scan ==== + [ 3 3 3 2 3 2 1 2 3 3 1 3 3 ... 2 2 ] + elapesd time: 414.604ms (std::chrono Measured) + passed +==== work-efficient compact, power-of-two ==== + [ 3 3 3 2 3 2 1 2 3 3 1 3 3 ... 2 2 ] + elapesd time: 54.2754ms (CUDA Measured) + passed +==== work-efficient compact, non-power-of-two ==== + [ 3 3 3 2 3 2 1 2 3 3 1 3 3 ... 3 2 ] + elapesd time: 54.2756ms (CUDA Measured) + passed + +***************************** +** RADIX SORT TESTS ** +***************************** +Array size (power of two): 67108864 +Array size (non-power of two): 67108861 +Max value: 100 + [ 30 72 59 33 89 80 8 32 72 56 4 42 8 ... 21 0 ] +==== std::sort, power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapesd time: 1493.97ms (std::chrono Measured) +==== thrust unstable sort, power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapesd time: 425.906ms (std::chrono Measured) + elapesd time: 0.001152ms (CUDA Measured) + passed +==== thrust stable sort, power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapesd time: 426.963ms (std::chrono Measured) + elapesd time: 0.001184ms (CUDA Measured) + passed +==== radix sort, power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapesd time: 450.604ms (CUDA Measured) + passed +==== std::sort, non power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapesd time: 1505.41ms (std::chrono Measured) +==== radix sort, non power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapesd time: 442.399ms (CUDA Measured) + passed diff --git a/test_results/result_blocksize_4.txt b/test_results/result_blocksize_4.txt new file mode 100644 index 0000000..4283668 --- /dev/null +++ b/test_results/result_blocksize_4.txt @@ -0,0 +1,106 @@ +CIS-565 HW2 CUDA Stream Compaction Test (Ruoyu Fan) + Block size for naive scan: 4 + Block size for up-sweep: 4 + Block size for down-sweep: 4 + Block size for boolean mapping: 4 + Block size for scattering: 4 + Block sizes for radix sort: 4 4 4 4 + +**************** +** SCAN TESTS ** +**************** +Array size (power of two): 67108864 +Array size (non-power of two): 67108861 + [ 10 40 34 29 3 18 43 40 36 6 41 24 29 ... 27 0 ] +==== cpu scan, power-of-two ==== + elapesd time: 131.091ms (std::chrono Measured) + [ 0 10 50 84 113 116 134 177 217 253 259 300 324 ... 1643543450 1643543477 ] +==== cpu scan, non-power-of-two ==== + [ 0 10 50 84 113 116 134 177 217 253 259 300 324 ... 1643543353 1643543374 ] + elapesd time: 97.2582ms (std::chrono Measured) + passed +==== naive scan, power-of-two ==== + [ 0 10 50 84 113 116 134 177 217 253 259 300 324 ... 1643543450 1643543477 ] + elapesd time: 755.433ms (CUDA Measured) + passed +==== naive scan, non-power-of-two ==== + [ 0 10 50 84 113 116 134 177 217 253 259 300 324 ... 1643543353 1643543374 ] + elapesd time: 756.067ms (CUDA Measured) + passed +==== work-efficient scan, power-of-two ==== + [ 0 10 50 84 113 116 134 177 217 253 259 300 324 ... 1643543450 1643543477 ] + elapesd time: 76.5717ms (CUDA Measured) + passed +==== work-efficient scan, non-power-of-two ==== + [ 0 10 50 84 113 116 134 177 217 253 259 300 324 ... 1643543353 1643543374 ] + elapesd time: 76.4466ms (CUDA Measured) + passed +==== thrust scan, power-of-two ==== + [ 0 10 50 84 113 116 134 177 217 253 259 300 324 ... 1643543450 1643543477 ] + elapesd time: 7.60525ms (CUDA Measured) + elapesd time: 15.6265ms (std::chrono Measured) + passed +==== thrust scan, non-power-of-two ==== + [ 0 10 50 84 113 116 134 177 217 253 259 300 324 ... 1643543353 1643543374 ] + elapesd time: 7.60605ms (CUDA Measured) + elapesd time: 8.9781ms (std::chrono Measured) + passed + +***************************** +** STREAM COMPACTION TESTS ** +***************************** +Array size (power of two): 67108864 +Array size (non-power of two): 67108861 + [ 3 2 1 3 3 2 2 1 0 0 0 0 0 ... 1 0 ] +==== cpu compact without scan, power-of-two ==== + [ 3 2 1 3 3 2 2 1 1 2 3 2 2 ... 3 1 ] + elapesd time: 164.4ms (std::chrono Measured) + passed +==== cpu compact without scan, non-power-of-two ==== + [ 3 2 1 3 3 2 2 1 1 2 3 2 2 ... 1 3 ] + elapesd time: 152.59ms (std::chrono Measured) + passed +==== cpu compact with scan ==== + [ 3 2 1 3 3 2 2 1 1 2 3 2 2 ... 3 1 ] + elapesd time: 411.575ms (std::chrono Measured) + passed +==== work-efficient compact, power-of-two ==== + [ 3 2 1 3 3 2 2 1 1 2 3 2 2 ... 3 1 ] + elapesd time: 143.747ms (CUDA Measured) + passed +==== work-efficient compact, non-power-of-two ==== + [ 3 2 1 3 3 2 2 1 1 2 3 2 2 ... 1 3 ] + elapesd time: 143.446ms (CUDA Measured) + passed + +***************************** +** RADIX SORT TESTS ** +***************************** +Array size (power of two): 67108864 +Array size (non-power of two): 67108861 +Max value: 100 + [ 33 43 61 82 69 55 63 67 95 87 56 32 61 ... 76 0 ] +==== std::sort, power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapesd time: 1503.12ms (std::chrono Measured) +==== thrust unstable sort, power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapesd time: 416.707ms (std::chrono Measured) + elapesd time: 0.001088ms (CUDA Measured) + passed +==== thrust stable sort, power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapesd time: 421.296ms (std::chrono Measured) + elapesd time: 0.001248ms (CUDA Measured) + passed +==== radix sort, power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapesd time: 1430.52ms (CUDA Measured) + passed +==== std::sort, non power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapesd time: 1512.72ms (std::chrono Measured) +==== radix sort, non power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapesd time: 1424.39ms (CUDA Measured) + passed diff --git a/test_results/result_blocksize_512.txt b/test_results/result_blocksize_512.txt new file mode 100644 index 0000000..d2f91f2 --- /dev/null +++ b/test_results/result_blocksize_512.txt @@ -0,0 +1,106 @@ +CIS-565 HW2 CUDA Stream Compaction Test (Ruoyu Fan) + Block size for naive scan: 512 + Block size for up-sweep: 512 + Block size for down-sweep: 512 + Block size for boolean mapping: 512 + Block size for scattering: 512 + Block sizes for radix sort: 512 512 512 512 + +**************** +** SCAN TESTS ** +**************** +Array size (power of two): 67108864 +Array size (non-power of two): 67108861 + [ 8 48 0 22 46 42 42 6 29 43 24 12 33 ... 3 0 ] +==== cpu scan, power-of-two ==== + elapesd time: 113.379ms (std::chrono Measured) + [ 0 8 56 56 78 124 166 208 214 243 286 310 322 ... 1643320954 1643320957 ] +==== cpu scan, non-power-of-two ==== + [ 0 8 56 56 78 124 166 208 214 243 286 310 322 ... 1643320916 1643320916 ] + elapesd time: 94.7561ms (std::chrono Measured) + passed +==== naive scan, power-of-two ==== + [ 0 8 56 56 78 124 166 208 214 243 286 310 322 ... 1643320954 1643320957 ] + elapesd time: 114.593ms (CUDA Measured) + passed +==== naive scan, non-power-of-two ==== + [ 0 8 56 56 78 124 166 208 214 243 286 310 322 ... 1643320916 1643320916 ] + elapesd time: 113.794ms (CUDA Measured) + passed +==== work-efficient scan, power-of-two ==== + [ 0 8 56 56 78 124 166 208 214 243 286 310 322 ... 1643320954 1643320957 ] + elapesd time: 44.3084ms (CUDA Measured) + passed +==== work-efficient scan, non-power-of-two ==== + [ 0 8 56 56 78 124 166 208 214 243 286 310 322 ... 1643320916 1643320916 ] + elapesd time: 44.2844ms (CUDA Measured) + passed +==== thrust scan, power-of-two ==== + [ 0 8 56 56 78 124 166 208 214 243 286 310 322 ... 1643320954 1643320957 ] + elapesd time: 7.65802ms (CUDA Measured) + elapesd time: 15.6002ms (std::chrono Measured) + passed +==== thrust scan, non-power-of-two ==== + [ 0 8 56 56 78 124 166 208 214 243 286 310 322 ... 1643320916 1643320916 ] + elapesd time: 7.88176ms (CUDA Measured) + elapesd time: 0ms (std::chrono Measured) + passed + +***************************** +** STREAM COMPACTION TESTS ** +***************************** +Array size (power of two): 67108864 +Array size (non-power of two): 67108861 + [ 1 1 0 1 2 3 0 1 1 3 0 2 1 ... 2 0 ] +==== cpu compact without scan, power-of-two ==== + [ 1 1 1 2 3 1 1 3 2 1 2 2 2 ... 2 2 ] + elapesd time: 164.929ms (std::chrono Measured) + passed +==== cpu compact without scan, non-power-of-two ==== + [ 1 1 1 2 3 1 1 3 2 1 2 2 2 ... 2 2 ] + elapesd time: 164.423ms (std::chrono Measured) + passed +==== cpu compact with scan ==== + [ 1 1 1 2 3 1 1 3 2 1 2 2 2 ... 2 2 ] + elapesd time: 412.597ms (std::chrono Measured) + passed +==== work-efficient compact, power-of-two ==== + [ 1 1 1 2 3 1 1 3 2 1 2 2 2 ... 2 2 ] + elapesd time: 53.2704ms (CUDA Measured) + passed +==== work-efficient compact, non-power-of-two ==== + [ 1 1 1 2 3 1 1 3 2 1 2 2 2 ... 2 2 ] + elapesd time: 53.2295ms (CUDA Measured) + passed + +***************************** +** RADIX SORT TESTS ** +***************************** +Array size (power of two): 67108864 +Array size (non-power of two): 67108861 +Max value: 100 + [ 31 51 77 75 62 11 62 83 56 75 89 70 65 ... 51 0 ] +==== std::sort, power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapesd time: 1523.42ms (std::chrono Measured) +==== thrust unstable sort, power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapesd time: 421.785ms (std::chrono Measured) + elapesd time: 0.001632ms (CUDA Measured) + passed +==== thrust stable sort, power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapesd time: 425.961ms (std::chrono Measured) + elapesd time: 0.001184ms (CUDA Measured) + passed +==== radix sort, power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapesd time: 418.947ms (CUDA Measured) + passed +==== std::sort, non power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapesd time: 1494.14ms (std::chrono Measured) +==== radix sort, non power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapesd time: 416.928ms (CUDA Measured) + passed diff --git a/test_results/result_blocksize_64.txt b/test_results/result_blocksize_64.txt new file mode 100644 index 0000000..33d7ec0 --- /dev/null +++ b/test_results/result_blocksize_64.txt @@ -0,0 +1,106 @@ +CIS-565 HW2 CUDA Stream Compaction Test (Ruoyu Fan) + Block size for naive scan: 64 + Block size for up-sweep: 64 + Block size for down-sweep: 64 + Block size for boolean mapping: 64 + Block size for scattering: 64 + Block sizes for radix sort: 64 64 64 64 + +**************** +** SCAN TESTS ** +**************** +Array size (power of two): 67108864 +Array size (non-power of two): 67108861 + [ 30 36 13 2 8 41 42 35 42 40 47 44 18 ... 49 0 ] +==== cpu scan, power-of-two ==== + elapesd time: 118.412ms (std::chrono Measured) + [ 0 30 66 79 81 89 130 172 207 249 289 336 380 ... 1643522704 1643522753 ] +==== cpu scan, non-power-of-two ==== + [ 0 30 66 79 81 89 130 172 207 249 289 336 380 ... 1643522605 1643522641 ] + elapesd time: 94.2541ms (std::chrono Measured) + passed +==== naive scan, power-of-two ==== + [ 0 30 66 79 81 89 130 172 207 249 289 336 380 ... 1643522704 1643522753 ] + elapesd time: 114.7ms (CUDA Measured) + passed +==== naive scan, non-power-of-two ==== + [ 0 30 66 79 81 89 130 172 207 249 289 336 380 ... 1643522605 1643522641 ] + elapesd time: 113.626ms (CUDA Measured) + passed +==== work-efficient scan, power-of-two ==== + [ 0 30 66 79 81 89 130 172 207 249 289 336 380 ... 1643522704 1643522753 ] + elapesd time: 44.7172ms (CUDA Measured) + passed +==== work-efficient scan, non-power-of-two ==== + [ 0 30 66 79 81 89 130 172 207 249 289 336 380 ... 1643522605 1643522641 ] + elapesd time: 44.5198ms (CUDA Measured) + passed +==== thrust scan, power-of-two ==== + [ 0 30 66 79 81 89 130 172 207 249 289 336 380 ... 1643522704 1643522753 ] + elapesd time: 7.42698ms (CUDA Measured) + elapesd time: 15.6355ms (std::chrono Measured) + passed +==== thrust scan, non-power-of-two ==== + [ 0 30 66 79 81 89 130 172 207 249 289 336 380 ... 1643522605 1643522641 ] + elapesd time: 7.6359ms (CUDA Measured) + elapesd time: 15.6267ms (std::chrono Measured) + passed + +***************************** +** STREAM COMPACTION TESTS ** +***************************** +Array size (power of two): 67108864 +Array size (non-power of two): 67108861 + [ 3 3 0 3 1 2 3 2 1 1 1 3 0 ... 0 0 ] +==== cpu compact without scan, power-of-two ==== + [ 3 3 3 1 2 3 2 1 1 1 3 1 1 ... 1 2 ] + elapesd time: 162.432ms (std::chrono Measured) + passed +==== cpu compact without scan, non-power-of-two ==== + [ 3 3 3 1 2 3 2 1 1 1 3 1 1 ... 1 2 ] + elapesd time: 162.447ms (std::chrono Measured) + passed +==== cpu compact with scan ==== + [ 3 3 3 1 2 3 2 1 1 1 3 1 1 ... 1 2 ] + elapesd time: 409.609ms (std::chrono Measured) + passed +==== work-efficient compact, power-of-two ==== + [ 3 3 3 1 2 3 2 1 1 1 3 1 1 ... 1 2 ] + elapesd time: 53.5306ms (CUDA Measured) + passed +==== work-efficient compact, non-power-of-two ==== + [ 3 3 3 1 2 3 2 1 1 1 3 1 1 ... 1 2 ] + elapesd time: 53.686ms (CUDA Measured) + passed + +***************************** +** RADIX SORT TESTS ** +***************************** +Array size (power of two): 67108864 +Array size (non-power of two): 67108861 +Max value: 100 + [ 50 40 76 42 10 83 16 71 8 69 9 59 29 ... 30 0 ] +==== std::sort, power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapesd time: 1528.95ms (std::chrono Measured) +==== thrust unstable sort, power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapesd time: 425.366ms (std::chrono Measured) + elapesd time: 0.001088ms (CUDA Measured) + passed +==== thrust stable sort, power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapesd time: 426.134ms (std::chrono Measured) + elapesd time: 0.00112ms (CUDA Measured) + passed +==== radix sort, power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapesd time: 418.522ms (CUDA Measured) + passed +==== std::sort, non power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapesd time: 1516.8ms (std::chrono Measured) +==== radix sort, non power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapesd time: 418.378ms (CUDA Measured) + passed diff --git a/test_results/result_blocksize_8.txt b/test_results/result_blocksize_8.txt new file mode 100644 index 0000000..74dbe54 --- /dev/null +++ b/test_results/result_blocksize_8.txt @@ -0,0 +1,106 @@ +CIS-565 HW2 CUDA Stream Compaction Test (Ruoyu Fan) + Block size for naive scan: 8 + Block size for up-sweep: 8 + Block size for down-sweep: 8 + Block size for boolean mapping: 8 + Block size for scattering: 8 + Block sizes for radix sort: 8 8 8 8 + +**************** +** SCAN TESTS ** +**************** +Array size (power of two): 67108864 +Array size (non-power of two): 67108861 + [ 8 14 36 6 10 19 34 1 39 40 27 32 18 ... 35 0 ] +==== cpu scan, power-of-two ==== + elapesd time: 107.325ms (std::chrono Measured) + [ 0 8 22 58 64 74 93 127 128 167 207 234 266 ... 1643593142 1643593177 ] +==== cpu scan, non-power-of-two ==== + [ 0 8 22 58 64 74 93 127 128 167 207 234 266 ... 1643593071 1643593071 ] + elapesd time: 100.297ms (std::chrono Measured) + passed +==== naive scan, power-of-two ==== + [ 0 8 22 58 64 74 93 127 128 167 207 234 266 ... 1643593142 1643593177 ] + elapesd time: 379.897ms (CUDA Measured) + passed +==== naive scan, non-power-of-two ==== + [ 0 8 22 58 64 74 93 127 128 167 207 234 266 ... 1643593071 1643593071 ] + elapesd time: 379.789ms (CUDA Measured) + passed +==== work-efficient scan, power-of-two ==== + [ 0 8 22 58 64 74 93 127 128 167 207 234 266 ... 1643593142 1643593177 ] + elapesd time: 52.1942ms (CUDA Measured) + passed +==== work-efficient scan, non-power-of-two ==== + [ 0 8 22 58 64 74 93 127 128 167 207 234 266 ... 1643593071 1643593071 ] + elapesd time: 52.1319ms (CUDA Measured) + passed +==== thrust scan, power-of-two ==== + [ 0 8 22 58 64 74 93 127 128 167 207 234 266 ... 1643593142 1643593177 ] + elapesd time: 7.71994ms (CUDA Measured) + elapesd time: 15.6528ms (std::chrono Measured) + passed +==== thrust scan, non-power-of-two ==== + [ 0 8 22 58 64 74 93 127 128 167 207 234 266 ... 1643593071 1643593071 ] + elapesd time: 7.41206ms (CUDA Measured) + elapesd time: 21.3408ms (std::chrono Measured) + passed + +***************************** +** STREAM COMPACTION TESTS ** +***************************** +Array size (power of two): 67108864 +Array size (non-power of two): 67108861 + [ 3 1 2 2 0 2 1 2 0 2 0 2 3 ... 2 0 ] +==== cpu compact without scan, power-of-two ==== + [ 3 1 2 2 2 1 2 2 2 3 2 2 2 ... 2 2 ] + elapesd time: 165.426ms (std::chrono Measured) + passed +==== cpu compact without scan, non-power-of-two ==== + [ 3 1 2 2 2 1 2 2 2 3 2 2 2 ... 1 2 ] + elapesd time: 165.433ms (std::chrono Measured) + passed +==== cpu compact with scan ==== + [ 3 1 2 2 2 1 2 2 2 3 2 2 2 ... 2 2 ] + elapesd time: 411.595ms (std::chrono Measured) + passed +==== work-efficient compact, power-of-two ==== + [ 3 1 2 2 2 1 2 2 2 3 2 2 2 ... 2 2 ] + elapesd time: 86.2233ms (CUDA Measured) + passed +==== work-efficient compact, non-power-of-two ==== + [ 3 1 2 2 2 1 2 2 2 3 2 2 2 ... 1 2 ] + elapesd time: 86.2171ms (CUDA Measured) + passed + +***************************** +** RADIX SORT TESTS ** +***************************** +Array size (power of two): 67108864 +Array size (non-power of two): 67108861 +Max value: 100 + [ 81 67 63 60 8 38 71 60 66 22 92 90 68 ... 83 0 ] +==== std::sort, power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapesd time: 1485.43ms (std::chrono Measured) +==== thrust unstable sort, power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapesd time: 400.919ms (std::chrono Measured) + elapesd time: 0.001088ms (CUDA Measured) + passed +==== thrust stable sort, power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapesd time: 418.473ms (std::chrono Measured) + elapesd time: 0.00112ms (CUDA Measured) + passed +==== radix sort, power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapesd time: 818.39ms (CUDA Measured) + passed +==== std::sort, non power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapesd time: 1515.04ms (std::chrono Measured) +==== radix sort, non power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapesd time: 818.379ms (CUDA Measured) + passed diff --git a/test_results/result_radix_max_100.txt b/test_results/result_radix_max_100.txt new file mode 100644 index 0000000..2124dec --- /dev/null +++ b/test_results/result_radix_max_100.txt @@ -0,0 +1,101 @@ +* THIS TEST RAN ON A DIFFERENT MACHINE: + Windows 7, Xeon(R) E5-1630 @ 3.70GHz 32GB, GTX 1070 8192MB (Moore 103 SigLab) + +CIS-565 HW2 CUDA Stream Compaction Test (Ruoyu Fan) + Block size for naive scan: 1024 + Block size for up-sweep: 1024 + Block size for down-sweep: 1024 + Block size for boolean mapping: 1024 + Block size for scattering: 1024 + Block sizes for radix sort: 1024 1024 1024 1024 + +**************** +** SCAN TESTS ** +**************** +Array size (power of two): 33554432 +Array size (non-power of two): 33554429 + [ 19 37 1 47 44 49 35 24 1 15 49 9 28 ... 37 0 ] +==== cpu scan, power-of-two ==== + elapsed time: 94ms (std::chrono Measured) + [ 0 19 56 57 104 148 197 232 256 257 272 321 330 ... 821783866 821783903 ] +==== cpu scan, non-power-of-two ==== + [ 0 19 56 57 104 148 197 232 256 257 272 321 330 ... 821783806 821783812 ] + elapsed time: 87ms (std::chrono Measured) + passed +==== naive scan, power-of-two ==== + [ 0 19 56 57 104 148 197 232 256 257 272 321 330 ... 821783866 821783903 ] + elapsed time: 38.0587ms (CUDA Measured) + passed +==== naive scan, non-power-of-two ==== + [ 0 19 56 57 104 148 197 232 256 257 272 321 330 ... 821783806 821783812 ] + elapsed time: 38.0423ms (CUDA Measured) + passed +==== work-efficient scan, power-of-two ==== + [ 0 19 56 57 104 148 197 232 256 257 272 321 330 ... 821783866 821783903 ] + elapsed time: 15.0559ms (CUDA Measured) + passed +==== work-efficient scan, non-power-of-two ==== + [ 0 19 56 57 104 148 197 232 256 257 272 321 330 ... 821783806 821783812 ] + elapsed time: 15.0186ms (CUDA Measured) + passed +==== thrust scan, power-of-two ==== + [ 0 19 56 57 104 148 197 232 256 257 272 321 330 ... 821783866 821783903 ] + elapsed time: 9.03926ms (CUDA Measured) + passed +==== thrust scan, non-power-of-two ==== + [ 0 19 56 57 104 148 197 232 256 257 272 321 330 ... 821783806 821783812 ] + elapsed time: 2.79245ms (CUDA Measured) + passed + +***************************** +** STREAM COMPACTION TESTS ** +***************************** +Array size (power of two): 33554432 +Array size (non-power of two): 33554429 + [ 1 2 0 2 3 0 3 3 2 3 3 3 1 ... 0 0 ] +==== cpu compact without scan, power-of-two ==== + [ 1 2 2 3 3 3 2 3 3 3 1 3 1 ... 2 2 ] + elapsed time: 86ms (std::chrono Measured) + passed +==== cpu compact without scan, non-power-of-two ==== + [ 1 2 2 3 3 3 2 3 3 3 1 3 1 ... 2 2 ] + elapsed time: 87ms (std::chrono Measured) + passed +==== cpu compact with scan ==== + [ 1 2 2 3 3 3 2 3 3 3 1 3 1 ... 2 2 ] + elapsed time: 236ms (std::chrono Measured) + passed +==== work-efficient compact, power-of-two ==== + [ 1 2 2 3 3 3 2 3 3 3 1 3 1 ... 2 2 ] + elapsed time: 18.2723ms (CUDA Measured) + passed +==== work-efficient compact, non-power-of-two ==== + [ 1 2 2 3 3 3 2 3 3 3 1 3 1 ... 2 2 ] + elapsed time: 18.2446ms (CUDA Measured) + passed + +***************************** +** RADIX SORT TESTS ** +***************************** +Array size (power of two): 33554432 +Array size (non-power of two): 33554429 +Max value: 100 + [ 76 34 12 37 5 35 59 88 6 87 3 47 20 ... 8 0 ] +==== std::sort, power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapsed time: 917ms (std::chrono Measured) +==== thrust::sort (which calls Thrust's radix sort), power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapsed time: 20.2813ms (CUDA Measured) + passed +==== radix sort, power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapsed time: 143.992ms (CUDA Measured) + passed +==== std::sort, non power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapsed time: 923ms (std::chrono Measured) +==== radix sort, non power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapsed time: 143.374ms (CUDA Measured) + passed diff --git a/test_results/result_radix_max_1000000000.txt b/test_results/result_radix_max_1000000000.txt new file mode 100644 index 0000000..0efa4b5 --- /dev/null +++ b/test_results/result_radix_max_1000000000.txt @@ -0,0 +1,101 @@ +* THIS TEST RAN ON A DIFFERENT MACHINE: + Windows 7, Xeon(R) E5-1630 @ 3.70GHz 32GB, GTX 1070 8192MB (Moore 103 SigLab) + +CIS-565 HW2 CUDA Stream Compaction Test (Ruoyu Fan) + Block size for naive scan: 1024 + Block size for up-sweep: 1024 + Block size for down-sweep: 1024 + Block size for boolean mapping: 1024 + Block size for scattering: 1024 + Block sizes for radix sort: 1024 1024 1024 1024 + +**************** +** SCAN TESTS ** +**************** +Array size (power of two): 33554432 +Array size (non-power of two): 33554429 + [ 13 29 47 7 2 28 44 49 35 46 2 49 9 ... 37 0 ] +==== cpu scan, power-of-two ==== + elapsed time: 93ms (std::chrono Measured) + [ 0 13 42 89 96 98 126 170 219 254 300 302 351 ... 821752752 821752789 ] +==== cpu scan, non-power-of-two ==== + [ 0 13 42 89 96 98 126 170 219 254 300 302 351 ... 821752661 821752701 ] + elapsed time: 87ms (std::chrono Measured) + passed +==== naive scan, power-of-two ==== + [ 0 13 42 89 96 98 126 170 219 254 300 302 351 ... 821752752 821752789 ] + elapsed time: 38.1036ms (CUDA Measured) + passed +==== naive scan, non-power-of-two ==== + [ 0 13 42 89 96 98 126 170 219 254 300 302 351 ... 821752661 821752701 ] + elapsed time: 38.112ms (CUDA Measured) + passed +==== work-efficient scan, power-of-two ==== + [ 0 13 42 89 96 98 126 170 219 254 300 302 351 ... 821752752 821752789 ] + elapsed time: 15.0276ms (CUDA Measured) + passed +==== work-efficient scan, non-power-of-two ==== + [ 0 13 42 89 96 98 126 170 219 254 300 302 351 ... 821752661 821752701 ] + elapsed time: 15.0576ms (CUDA Measured) + passed +==== thrust scan, power-of-two ==== + [ 0 13 42 89 96 98 126 170 219 254 300 302 351 ... 821752752 821752789 ] + elapsed time: 8.90237ms (CUDA Measured) + passed +==== thrust scan, non-power-of-two ==== + [ 0 13 42 89 96 98 126 170 219 254 300 302 351 ... 821752661 821752701 ] + elapsed time: 2.74368ms (CUDA Measured) + passed + +***************************** +** STREAM COMPACTION TESTS ** +***************************** +Array size (power of two): 33554432 +Array size (non-power of two): 33554429 + [ 1 1 3 1 1 3 0 2 3 2 0 1 0 ... 3 0 ] +==== cpu compact without scan, power-of-two ==== + [ 1 1 3 1 1 3 2 3 2 1 3 2 2 ... 2 3 ] + elapsed time: 83ms (std::chrono Measured) + passed +==== cpu compact without scan, non-power-of-two ==== + [ 1 1 3 1 1 3 2 3 2 1 3 2 2 ... 3 2 ] + elapsed time: 84ms (std::chrono Measured) + passed +==== cpu compact with scan ==== + [ 1 1 3 1 1 3 2 3 2 1 3 2 2 ... 2 3 ] + elapsed time: 237ms (std::chrono Measured) + passed +==== work-efficient compact, power-of-two ==== + [ 1 1 3 1 1 3 2 3 2 1 3 2 2 ... 2 3 ] + elapsed time: 18.2364ms (CUDA Measured) + passed +==== work-efficient compact, non-power-of-two ==== + [ 1 1 3 1 1 3 2 3 2 1 3 2 2 ... 3 2 ] + elapsed time: 18.2344ms (CUDA Measured) + passed + +***************************** +** RADIX SORT TESTS ** +***************************** +Array size (power of two): 33554432 +Array size (non-power of two): 33554429 +Max value: 1000000000 + [ 21520 17257 9407 8648 31232 11282 11169 22994 15890 9350 22656 25538 29919 ... 23658 0 ] +==== std::sort, power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 32767 32767 ] + elapsed time: 2173ms (std::chrono Measured) +==== thrust::sort (which calls Thrust's radix sort), power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 32767 32767 ] + elapsed time: 20.6702ms (CUDA Measured) + passed +==== radix sort, power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 32767 32767 ] + elapsed time: 628.72ms (CUDA Measured) + passed +==== std::sort, non power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 32767 32767 ] + elapsed time: 2153ms (std::chrono Measured) +==== radix sort, non power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 32767 32767 ] + elapsed time: 617.616ms (CUDA Measured) + passed diff --git a/test_results/result_size_12_power_of_2.txt b/test_results/result_size_12_power_of_2.txt new file mode 100644 index 0000000..715d4a7 --- /dev/null +++ b/test_results/result_size_12_power_of_2.txt @@ -0,0 +1,106 @@ +CIS-565 HW2 CUDA Stream Compaction Test (Ruoyu Fan) + Block size for naive scan: 1024 + Block size for up-sweep: 1024 + Block size for down-sweep: 1024 + Block size for boolean mapping: 1024 + Block size for scattering: 1024 + Block sizes for radix sort: 1024 1024 1024 1024 + +**************** +** SCAN TESTS ** +**************** +Array size (power of two): 4096 +Array size (non-power of two): 4093 + [ 33 37 5 39 24 31 22 36 22 21 32 43 4 ... 9 0 ] +==== cpu scan, power-of-two ==== + elapsed time: 0ms (std::chrono Measured) + [ 0 33 70 75 114 138 169 191 227 249 270 302 345 ... 101088 101097 ] +==== cpu scan, non-power-of-two ==== + [ 0 33 70 75 114 138 169 191 227 249 270 302 345 ... 100984 101004 ] + elapsed time: 0ms (std::chrono Measured) + passed +==== naive scan, power-of-two ==== + [ 0 33 70 75 114 138 169 191 227 249 270 302 345 ... 101088 101097 ] + elapsed time: 0.047872ms (CUDA Measured) + passed +==== naive scan, non-power-of-two ==== + [ 0 33 70 75 114 138 169 191 227 249 270 302 345 ... 100984 101004 ] + elapsed time: 0.046368ms (CUDA Measured) + passed +==== work-efficient scan, power-of-two ==== + [ 0 33 70 75 114 138 169 191 227 249 270 302 345 ... 101088 101097 ] + elapsed time: 0.110784ms (CUDA Measured) + passed +==== work-efficient scan, non-power-of-two ==== + [ 0 33 70 75 114 138 169 191 227 249 270 302 345 ... 100984 101004 ] + elapsed time: 0.1096ms (CUDA Measured) + passed +==== thrust scan, power-of-two ==== + [ 0 33 70 75 114 138 169 191 227 249 270 302 345 ... 101088 101097 ] + elapsed time: 0.028192ms (CUDA Measured) + elapsed time: 0ms (std::chrono Measured) + passed +==== thrust scan, non-power-of-two ==== + [ 0 33 70 75 114 138 169 191 227 249 270 302 345 ... 100984 101004 ] + elapsed time: 0.019264ms (CUDA Measured) + elapsed time: 0ms (std::chrono Measured) + passed + +***************************** +** STREAM COMPACTION TESTS ** +***************************** +Array size (power of two): 4096 +Array size (non-power of two): 4093 + [ 3 1 3 1 2 1 0 2 0 1 0 1 2 ... 1 0 ] +==== cpu compact without scan, power-of-two ==== + [ 3 1 3 1 2 1 2 1 1 2 1 3 2 ... 1 1 ] + elapsed time: 0ms (std::chrono Measured) + passed +==== cpu compact without scan, non-power-of-two ==== + [ 3 1 3 1 2 1 2 1 1 2 1 3 2 ... 2 3 ] + elapsed time: 0ms (std::chrono Measured) + passed +==== cpu compact with scan ==== + [ 3 1 3 1 2 1 2 1 1 2 1 3 2 ... 1 1 ] + elapsed time: 0ms (std::chrono Measured) + passed +==== work-efficient compact, power-of-two ==== + [ 3 1 3 1 2 1 2 1 1 2 1 3 2 ... 1 1 ] + elapsed time: 0.118368ms (CUDA Measured) + passed +==== work-efficient compact, non-power-of-two ==== + [ 3 1 3 1 2 1 2 1 1 2 1 3 2 ... 2 3 ] + elapsed time: 0.117536ms (CUDA Measured) + passed + +***************************** +** RADIX SORT TESTS ** +***************************** +Array size (power of two): 4096 +Array size (non-power of two): 4093 +Max value: 100 + [ 83 37 55 89 74 81 72 86 72 21 32 93 54 ... 9 0 ] +==== std::sort, power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapsed time: 0ms (std::chrono Measured) +==== thrust unstable sort, power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapsed time: 0ms (std::chrono Measured) + elapsed time: 0.001152ms (CUDA Measured) + passed +==== thrust stable sort, power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapsed time: 0ms (std::chrono Measured) + elapsed time: 0.001152ms (CUDA Measured) + passed +==== radix sort, power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapsed time: 0.883328ms (CUDA Measured) + passed +==== std::sort, non power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapsed time: 0ms (std::chrono Measured) +==== radix sort, non power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapsed time: 0.881664ms (CUDA Measured) + passed diff --git a/test_results/result_size_16_power_of_2.txt b/test_results/result_size_16_power_of_2.txt new file mode 100644 index 0000000..6bce6c6 --- /dev/null +++ b/test_results/result_size_16_power_of_2.txt @@ -0,0 +1,106 @@ +CIS-565 HW2 CUDA Stream Compaction Test (Ruoyu Fan) + Block size for naive scan: 1024 + Block size for up-sweep: 1024 + Block size for down-sweep: 1024 + Block size for boolean mapping: 1024 + Block size for scattering: 1024 + Block sizes for radix sort: 1024 1024 1024 1024 + +**************** +** SCAN TESTS ** +**************** +Array size (power of two): 65536 +Array size (non-power of two): 65533 + [ 48 48 40 36 40 29 35 9 23 32 24 24 29 ... 19 0 ] +==== cpu scan, power-of-two ==== + elapsed time: 0ms (std::chrono Measured) + [ 0 48 96 136 172 212 241 276 285 308 340 364 388 ... 1607161 1607180 ] +==== cpu scan, non-power-of-two ==== + [ 0 48 96 136 172 212 241 276 285 308 340 364 388 ... 1607103 1607131 ] + elapsed time: 0ms (std::chrono Measured) + passed +==== naive scan, power-of-two ==== + [ 0 48 96 136 172 212 241 276 285 308 340 364 388 ... 1607161 1607180 ] + elapsed time: 0.09216ms (CUDA Measured) + passed +==== naive scan, non-power-of-two ==== + [ 0 48 96 136 172 212 241 276 285 308 340 364 388 ... 1607103 1607131 ] + elapsed time: 0.08976ms (CUDA Measured) + passed +==== work-efficient scan, power-of-two ==== + [ 0 48 96 136 172 212 241 276 285 308 340 364 388 ... 1607161 1607180 ] + elapsed time: 0.180064ms (CUDA Measured) + passed +==== work-efficient scan, non-power-of-two ==== + [ 0 48 96 136 172 212 241 276 285 308 340 364 388 ... 1607103 1607131 ] + elapsed time: 0.179712ms (CUDA Measured) + passed +==== thrust scan, power-of-two ==== + [ 0 48 96 136 172 212 241 276 285 308 340 364 388 ... 1607161 1607180 ] + elapsed time: 0.166432ms (CUDA Measured) + elapsed time: 0ms (std::chrono Measured) + passed +==== thrust scan, non-power-of-two ==== + [ 0 48 96 136 172 212 241 276 285 308 340 364 388 ... 1607103 1607131 ] + elapsed time: 0.170848ms (CUDA Measured) + elapsed time: 0ms (std::chrono Measured) + passed + +***************************** +** STREAM COMPACTION TESTS ** +***************************** +Array size (power of two): 65536 +Array size (non-power of two): 65533 + [ 0 2 2 0 2 3 1 1 3 0 2 2 3 ... 3 0 ] +==== cpu compact without scan, power-of-two ==== + [ 2 2 2 3 1 1 3 2 2 3 3 2 2 ... 2 3 ] + elapsed time: 0ms (std::chrono Measured) + passed +==== cpu compact without scan, non-power-of-two ==== + [ 2 2 2 3 1 1 3 2 2 3 3 2 2 ... 3 2 ] + elapsed time: 0ms (std::chrono Measured) + passed +==== cpu compact with scan ==== + [ 2 2 2 3 1 1 3 2 2 3 3 2 2 ... 2 3 ] + elapsed time: 0ms (std::chrono Measured) + passed +==== work-efficient compact, power-of-two ==== + [ 2 2 2 3 1 1 3 2 2 3 3 2 2 ... 2 3 ] + elapsed time: 0.193408ms (CUDA Measured) + passed +==== work-efficient compact, non-power-of-two ==== + [ 2 2 2 3 1 1 3 2 2 3 3 2 2 ... 3 2 ] + elapsed time: 0.193824ms (CUDA Measured) + passed + +***************************** +** RADIX SORT TESTS ** +***************************** +Array size (power of two): 65536 +Array size (non-power of two): 65533 +Max value: 100 + [ 48 98 90 36 90 79 85 9 23 32 74 74 79 ... 19 0 ] +==== std::sort, power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapsed time: 0ms (std::chrono Measured) +==== thrust unstable sort, power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapsed time: 0ms (std::chrono Measured) + elapsed time: 0.00112ms (CUDA Measured) + passed +==== thrust stable sort, power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapsed time: 0ms (std::chrono Measured) + elapsed time: 0.001152ms (CUDA Measured) + passed +==== radix sort, power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapsed time: 1.44288ms (CUDA Measured) + passed +==== std::sort, non power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapsed time: 0ms (std::chrono Measured) +==== radix sort, non power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapsed time: 1.44154ms (CUDA Measured) + passed diff --git a/test_results/result_size_20_power_of_2.txt b/test_results/result_size_20_power_of_2.txt new file mode 100644 index 0000000..33945cf --- /dev/null +++ b/test_results/result_size_20_power_of_2.txt @@ -0,0 +1,106 @@ +CIS-565 HW2 CUDA Stream Compaction Test (Ruoyu Fan) + Block size for naive scan: 1024 + Block size for up-sweep: 1024 + Block size for down-sweep: 1024 + Block size for boolean mapping: 1024 + Block size for scattering: 1024 + Block sizes for radix sort: 1024 1024 1024 1024 + +**************** +** SCAN TESTS ** +**************** +Array size (power of two): 1048576 +Array size (non-power of two): 1048573 + [ 46 20 9 39 15 9 5 45 33 48 37 45 35 ... 16 0 ] +==== cpu scan, power-of-two ==== + elapsed time: 0ms (std::chrono Measured) + [ 0 46 66 75 114 129 138 143 188 221 269 306 351 ... 25669615 25669631 ] +==== cpu scan, non-power-of-two ==== + [ 0 46 66 75 114 129 138 143 188 221 269 306 351 ... 25669544 25669584 ] + elapsed time: 1.5056ms (std::chrono Measured) + passed +==== naive scan, power-of-two ==== + [ 0 46 66 75 114 129 138 143 188 221 269 306 351 ... 25669615 25669631 ] + elapsed time: 1.29402ms (CUDA Measured) + passed +==== naive scan, non-power-of-two ==== + [ 0 46 66 75 114 129 138 143 188 221 269 306 351 ... 25669544 25669584 ] + elapsed time: 1.29389ms (CUDA Measured) + passed +==== work-efficient scan, power-of-two ==== + [ 0 46 66 75 114 129 138 143 188 221 269 306 351 ... 25669615 25669631 ] + elapsed time: 0.818528ms (CUDA Measured) + passed +==== work-efficient scan, non-power-of-two ==== + [ 0 46 66 75 114 129 138 143 188 221 269 306 351 ... 25669544 25669584 ] + elapsed time: 0.816128ms (CUDA Measured) + passed +==== thrust scan, power-of-two ==== + [ 0 46 66 75 114 129 138 143 188 221 269 306 351 ... 25669615 25669631 ] + elapsed time: 0.30768ms (CUDA Measured) + elapsed time: 15.6263ms (std::chrono Measured) + passed +==== thrust scan, non-power-of-two ==== + [ 0 46 66 75 114 129 138 143 188 221 269 306 351 ... 25669544 25669584 ] + elapsed time: 0.304096ms (CUDA Measured) + elapsed time: 0ms (std::chrono Measured) + passed + +***************************** +** STREAM COMPACTION TESTS ** +***************************** +Array size (power of two): 1048576 +Array size (non-power of two): 1048573 + [ 2 2 3 3 3 1 3 3 3 2 1 3 1 ... 2 0 ] +==== cpu compact without scan, power-of-two ==== + [ 2 2 3 3 3 1 3 3 3 2 1 3 1 ... 1 2 ] + elapsed time: 0ms (std::chrono Measured) + passed +==== cpu compact without scan, non-power-of-two ==== + [ 2 2 3 3 3 1 3 3 3 2 1 3 1 ... 3 2 ] + elapsed time: 0ms (std::chrono Measured) + passed +==== cpu compact with scan ==== + [ 2 2 3 3 3 1 3 3 3 2 1 3 1 ... 1 2 ] + elapsed time: 0ms (std::chrono Measured) + passed +==== work-efficient compact, power-of-two ==== + [ 2 2 3 3 3 1 3 3 3 2 1 3 1 ... 1 2 ] + elapsed time: 0.966944ms (CUDA Measured) + passed +==== work-efficient compact, non-power-of-two ==== + [ 2 2 3 3 3 1 3 3 3 2 1 3 1 ... 3 2 ] + elapsed time: 0.963168ms (CUDA Measured) + passed + +***************************** +** RADIX SORT TESTS ** +***************************** +Array size (power of two): 1048576 +Array size (non-power of two): 1048573 +Max value: 100 + [ 46 70 59 39 15 9 55 95 83 98 37 95 85 ... 66 0 ] +==== std::sort, power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapsed time: 22.1649ms (std::chrono Measured) +==== thrust unstable sort, power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapsed time: 0ms (std::chrono Measured) + elapsed time: 0.001152ms (CUDA Measured) + passed +==== thrust stable sort, power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapsed time: 0ms (std::chrono Measured) + elapsed time: 0.001248ms (CUDA Measured) + passed +==== radix sort, power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapsed time: 7.61686ms (CUDA Measured) + passed +==== std::sort, non power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapsed time: 15.6456ms (std::chrono Measured) +==== radix sort, non power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapsed time: 7.64022ms (CUDA Measured) + passed diff --git a/test_results/result_size_24_power_of_2.txt b/test_results/result_size_24_power_of_2.txt new file mode 100644 index 0000000..314dfdf --- /dev/null +++ b/test_results/result_size_24_power_of_2.txt @@ -0,0 +1,106 @@ +CIS-565 HW2 CUDA Stream Compaction Test (Ruoyu Fan) + Block size for naive scan: 1024 + Block size for up-sweep: 1024 + Block size for down-sweep: 1024 + Block size for boolean mapping: 1024 + Block size for scattering: 1024 + Block sizes for radix sort: 1024 1024 1024 1024 + +**************** +** SCAN TESTS ** +**************** +Array size (power of two): 16777216 +Array size (non-power of two): 16777213 + [ 18 9 5 44 10 43 43 30 40 12 33 14 21 ... 6 0 ] +==== cpu scan, power-of-two ==== + elapsed time: 23.5625ms (std::chrono Measured) + [ 0 18 27 32 76 86 129 172 202 242 254 287 301 ... 411025443 411025449 ] +==== cpu scan, non-power-of-two ==== + [ 0 18 27 32 76 86 129 172 202 242 254 287 301 ... 411025387 411025398 ] + elapsed time: 41.6108ms (std::chrono Measured) + passed +==== naive scan, power-of-two ==== + [ 0 18 27 32 76 86 129 172 202 242 254 287 301 ... 411025443 411025449 ] + elapsed time: 25.7094ms (CUDA Measured) + passed +==== naive scan, non-power-of-two ==== + [ 0 18 27 32 76 86 129 172 202 242 254 287 301 ... 411025387 411025398 ] + elapsed time: 25.6795ms (CUDA Measured) + passed +==== work-efficient scan, power-of-two ==== + [ 0 18 27 32 76 86 129 172 202 242 254 287 301 ... 411025443 411025449 ] + elapsed time: 11.3379ms (CUDA Measured) + passed +==== work-efficient scan, non-power-of-two ==== + [ 0 18 27 32 76 86 129 172 202 242 254 287 301 ... 411025387 411025398 ] + elapsed time: 11.165ms (CUDA Measured) + passed +==== thrust scan, power-of-two ==== + [ 0 18 27 32 76 86 129 172 202 242 254 287 301 ... 411025443 411025449 ] + elapsed time: 2.00646ms (CUDA Measured) + elapsed time: 0ms (std::chrono Measured) + passed +==== thrust scan, non-power-of-two ==== + [ 0 18 27 32 76 86 129 172 202 242 254 287 301 ... 411025387 411025398 ] + elapsed time: 2.12765ms (CUDA Measured) + elapsed time: 0ms (std::chrono Measured) + passed + +***************************** +** STREAM COMPACTION TESTS ** +***************************** +Array size (power of two): 16777216 +Array size (non-power of two): 16777213 + [ 1 3 3 3 2 0 1 3 2 0 3 0 3 ... 3 0 ] +==== cpu compact without scan, power-of-two ==== + [ 1 3 3 3 2 1 3 2 3 3 2 1 3 ... 1 3 ] + elapsed time: 39.0743ms (std::chrono Measured) + passed +==== cpu compact without scan, non-power-of-two ==== + [ 1 3 3 3 2 1 3 2 3 3 2 1 3 ... 1 1 ] + elapsed time: 38.6241ms (std::chrono Measured) + passed +==== cpu compact with scan ==== + [ 1 3 3 3 2 1 3 2 3 3 2 1 3 ... 1 3 ] + elapsed time: 105.781ms (std::chrono Measured) + passed +==== work-efficient compact, power-of-two ==== + [ 1 3 3 3 2 1 3 2 3 3 2 1 3 ... 1 3 ] + elapsed time: 13.3034ms (CUDA Measured) + passed +==== work-efficient compact, non-power-of-two ==== + [ 1 3 3 3 2 1 3 2 3 3 2 1 3 ... 1 1 ] + elapsed time: 13.3877ms (CUDA Measured) + passed + +***************************** +** RADIX SORT TESTS ** +***************************** +Array size (power of two): 16777216 +Array size (non-power of two): 16777213 +Max value: 100 + [ 21 39 19 39 74 20 5 3 2 64 35 8 59 ... 51 0 ] +==== std::sort, power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapsed time: 378.025ms (std::chrono Measured) +==== thrust unstable sort, power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapsed time: 109.329ms (std::chrono Measured) + elapsed time: 0.00112ms (CUDA Measured) + passed +==== thrust stable sort, power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapsed time: 100.279ms (std::chrono Measured) + elapsed time: 0.00112ms (CUDA Measured) + passed +==== radix sort, power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapsed time: 105.222ms (CUDA Measured) + passed +==== std::sort, non power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapsed time: 385.305ms (std::chrono Measured) +==== radix sort, non power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapsed time: 104.685ms (CUDA Measured) + passed diff --git a/test_results/result_size_26_power_of_2.txt b/test_results/result_size_26_power_of_2.txt new file mode 100644 index 0000000..e9bf9c2 --- /dev/null +++ b/test_results/result_size_26_power_of_2.txt @@ -0,0 +1,106 @@ +CIS-565 HW2 CUDA Stream Compaction Test (Ruoyu Fan) + Block size for naive scan: 1024 + Block size for up-sweep: 1024 + Block size for down-sweep: 1024 + Block size for boolean mapping: 1024 + Block size for scattering: 1024 + Block sizes for radix sort: 1024 1024 1024 1024 + +**************** +** SCAN TESTS ** +**************** +Array size (power of two): 67108864 +Array size (non-power of two): 67108861 + [ 16 31 42 47 35 41 12 16 18 28 46 35 27 ... 16 0 ] +==== cpu scan, power-of-two ==== + elapsed time: 159.925ms (std::chrono Measured) + [ 0 16 47 89 136 171 212 224 240 258 286 332 367 ... 1643670879 1643670895 ] +==== cpu scan, non-power-of-two ==== + [ 0 16 47 89 136 171 212 224 240 258 286 332 367 ... 1643670804 1643670832 ] + elapsed time: 133.354ms (std::chrono Measured) + passed +==== naive scan, power-of-two ==== + [ 0 16 47 89 136 171 212 224 240 258 286 332 367 ... 1643670879 1643670895 ] + elapsed time: 113.61ms (CUDA Measured) + passed +==== naive scan, non-power-of-two ==== + [ 0 16 47 89 136 171 212 224 240 258 286 332 367 ... 1643670804 1643670832 ] + elapsed time: 113.677ms (CUDA Measured) + passed +==== work-efficient scan, power-of-two ==== + [ 0 16 47 89 136 171 212 224 240 258 286 332 367 ... 1643670879 1643670895 ] + elapsed time: 44.4511ms (CUDA Measured) + passed +==== work-efficient scan, non-power-of-two ==== + [ 0 16 47 89 136 171 212 224 240 258 286 332 367 ... 1643670804 1643670832 ] + elapsed time: 44.4368ms (CUDA Measured) + passed +==== thrust scan, power-of-two ==== + [ 0 16 47 89 136 171 212 224 240 258 286 332 367 ... 1643670879 1643670895 ] + elapsed time: 7.72387ms (CUDA Measured) + elapsed time: 15.6024ms (std::chrono Measured) + passed +==== thrust scan, non-power-of-two ==== + [ 0 16 47 89 136 171 212 224 240 258 286 332 367 ... 1643670804 1643670832 ] + elapsed time: 7.77056ms (CUDA Measured) + elapsed time: 15.6503ms (std::chrono Measured) + passed + +***************************** +** STREAM COMPACTION TESTS ** +***************************** +Array size (power of two): 67108864 +Array size (non-power of two): 67108861 + [ 2 0 1 1 1 3 0 2 1 2 2 2 0 ... 1 0 ] +==== cpu compact without scan, power-of-two ==== + [ 2 1 1 1 3 2 1 2 2 2 2 2 3 ... 1 1 ] + elapsed time: 155.406ms (std::chrono Measured) + passed +==== cpu compact without scan, non-power-of-two ==== + [ 2 1 1 1 3 2 1 2 2 2 2 2 3 ... 3 2 ] + elapsed time: 153.902ms (std::chrono Measured) + passed +==== cpu compact with scan ==== + [ 2 1 1 1 3 2 1 2 2 2 2 2 3 ... 1 1 ] + elapsed time: 422.524ms (std::chrono Measured) + passed +==== work-efficient compact, power-of-two ==== + [ 2 1 1 1 3 2 1 2 2 2 2 2 3 ... 1 1 ] + elapsed time: 53.2659ms (CUDA Measured) + passed +==== work-efficient compact, non-power-of-two ==== + [ 2 1 1 1 3 2 1 2 2 2 2 2 3 ... 3 2 ] + elapsed time: 53.1911ms (CUDA Measured) + passed + +***************************** +** RADIX SORT TESTS ** +***************************** +Array size (power of two): 67108864 +Array size (non-power of two): 67108861 +Max value: 100 + [ 35 85 73 87 36 84 69 53 83 58 58 50 88 ... 15 0 ] +==== std::sort, power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapsed time: 1481.8ms (std::chrono Measured) +==== thrust unstable sort, power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapsed time: 434.399ms (std::chrono Measured) + elapsed time: 0.001184ms (CUDA Measured) + passed +==== thrust stable sort, power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapsed time: 423.933ms (std::chrono Measured) + elapsed time: 0.001248ms (CUDA Measured) + passed +==== radix sort, power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapsed time: 419.345ms (CUDA Measured) + passed +==== std::sort, non power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapsed time: 1519.37ms (std::chrono Measured) +==== radix sort, non power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ] + elapsed time: 416.354ms (CUDA Measured) + passed diff --git a/test_results/result_size_8_power_of_2.txt b/test_results/result_size_8_power_of_2.txt new file mode 100644 index 0000000..81dcd4d --- /dev/null +++ b/test_results/result_size_8_power_of_2.txt @@ -0,0 +1,106 @@ +CIS-565 HW2 CUDA Stream Compaction Test (Ruoyu Fan) + Block size for naive scan: 1024 + Block size for up-sweep: 1024 + Block size for down-sweep: 1024 + Block size for boolean mapping: 1024 + Block size for scattering: 1024 + Block sizes for radix sort: 1024 1024 1024 1024 + +**************** +** SCAN TESTS ** +**************** +Array size (power of two): 256 +Array size (non-power of two): 253 + [ 4 13 47 25 0 4 9 20 40 2 18 9 28 ... 43 0 ] +==== cpu scan, power-of-two ==== + elapsed time: 0ms (std::chrono Measured) + [ 0 4 17 64 89 89 93 102 122 162 164 182 191 ... 6036 6079 ] +==== cpu scan, non-power-of-two ==== + [ 0 4 17 64 89 89 93 102 122 162 164 182 191 ... 5984 5984 ] + elapsed time: 0ms (std::chrono Measured) + passed +==== naive scan, power-of-two ==== + [ 0 4 17 64 89 89 93 102 122 162 164 182 191 ... 6036 6079 ] + elapsed time: 0.031232ms (CUDA Measured) + passed +==== naive scan, non-power-of-two ==== + [ 0 4 17 64 89 89 93 102 122 162 164 182 191 ... 5984 5984 ] + elapsed time: 0.030816ms (CUDA Measured) + passed +==== work-efficient scan, power-of-two ==== + [ 0 4 17 64 89 89 93 102 122 162 164 182 191 ... 6036 6079 ] + elapsed time: 0.064288ms (CUDA Measured) + passed +==== work-efficient scan, non-power-of-two ==== + [ 0 4 17 64 89 89 93 102 122 162 164 182 191 ... 5984 5984 ] + elapsed time: 0.06384ms (CUDA Measured) + passed +==== thrust scan, power-of-two ==== + [ 0 4 17 64 89 89 93 102 122 162 164 182 191 ... 6036 6079 ] + elapsed time: 0.02128ms (CUDA Measured) + elapsed time: 0ms (std::chrono Measured) + passed +==== thrust scan, non-power-of-two ==== + [ 0 4 17 64 89 89 93 102 122 162 164 182 191 ... 5984 5984 ] + elapsed time: 0.013152ms (CUDA Measured) + elapsed time: 0ms (std::chrono Measured) + passed + +***************************** +** STREAM COMPACTION TESTS ** +***************************** +Array size (power of two): 256 +Array size (non-power of two): 253 + [ 2 1 3 1 2 2 1 0 0 2 2 1 0 ... 3 0 ] +==== cpu compact without scan, power-of-two ==== + [ 2 1 3 1 2 2 1 2 2 1 1 1 2 ... 3 3 ] + elapsed time: 0ms (std::chrono Measured) + passed +==== cpu compact without scan, non-power-of-two ==== + [ 2 1 3 1 2 2 1 2 2 1 1 1 2 ... 2 3 ] + elapsed time: 0ms (std::chrono Measured) + passed +==== cpu compact with scan ==== + [ 2 1 3 1 2 2 1 2 2 1 1 1 2 ... 3 3 ] + elapsed time: 0ms (std::chrono Measured) + passed +==== work-efficient compact, power-of-two ==== + [ 2 1 3 1 2 2 1 2 2 1 1 1 2 ... 3 3 ] + elapsed time: 0.072192ms (CUDA Measured) + passed +==== work-efficient compact, non-power-of-two ==== + [ 2 1 3 1 2 2 1 2 2 1 1 1 2 ... 2 3 ] + elapsed time: 0.071296ms (CUDA Measured) + passed + +***************************** +** RADIX SORT TESTS ** +***************************** +Array size (power of two): 256 +Array size (non-power of two): 253 +Max value: 100 + [ 54 13 47 25 50 54 9 20 40 2 18 9 28 ... 43 0 ] +==== std::sort, power-of-two ==== + [ 0 0 1 1 1 2 2 2 2 2 3 3 4 ... 98 98 ] + elapsed time: 0ms (std::chrono Measured) +==== thrust unstable sort, power-of-two ==== + [ 0 0 1 1 1 2 2 2 2 2 3 3 4 ... 98 98 ] + elapsed time: 0ms (std::chrono Measured) + elapsed time: 0.001152ms (CUDA Measured) + passed +==== thrust stable sort, power-of-two ==== + [ 0 0 1 1 1 2 2 2 2 2 3 3 4 ... 98 98 ] + elapsed time: 0ms (std::chrono Measured) + elapsed time: 0.001152ms (CUDA Measured) + passed +==== radix sort, power-of-two ==== + [ 0 0 1 1 1 2 2 2 2 2 3 3 4 ... 98 98 ] + elapsed time: 0.603968ms (CUDA Measured) + passed +==== std::sort, non power-of-two ==== + [ 0 1 1 1 2 2 2 2 2 3 3 4 4 ... 98 98 ] + elapsed time: 0ms (std::chrono Measured) +==== radix sort, non power-of-two ==== + [ 0 1 1 1 2 2 2 2 2 3 3 4 4 ... 98 98 ] + elapsed time: 0.605184ms (CUDA Measured) + passed