diff --git a/CMakeLists.txt b/CMakeLists.txt
index 96a4782..44c6729 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -9,6 +9,7 @@ set(CMAKE_CXX_STANDARD 11)
 
 list(APPEND CUDA_NVCC_FLAGS_DEBUG -G -g)
 list(APPEND CUDA_NVCC_FLAGS_RELWITHDEBUGINFO -lineinfo)
+list(APPEND CUDA_NVCC_FLAGS "-std=c++11")
 
 # Crucial magic for CUDA linking
 find_package(Threads REQUIRED)
diff --git a/README.md b/README.md
index b71c458..ef0cbdc 100644
--- a/README.md
+++ b/README.md
@@ -3,11 +3,468 @@ CUDA Stream Compaction
 
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**
 
-* (TODO) YOUR NAME HERE
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Ruoyu Fan
+* Tested on: Windows 10 x64, i7-6700K @ 4.00GHz 16GB, GTX 970 4096MB (girlfriend's machine)
+  * compiled with Visual Studio 2013 and CUDA 7.5
 
-### (TODO: Your README)
+NOTE: if the program crashes when entering the test for naive sort, try reducing the array sizes in `main.cpp`. (`SIZE` and `SORT_SIZE`.)
 
-Include analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+![preview](/screenshots/preview_optimized.gif)
 
+
+### Things I have done
+
+* Implemented __CPU scan and compaction__, __compaction__, __GPU naive scan__, __GPU work-efficient scan__, __GPU work-efficient compaction__, __GPU radix sort (extra)__, and compared my scan algorithms with thrust implemention
+
+* I optimized my work efficient scan, and __speed is increased to 270%__ of my original implementation, please refer to __Optimization__ section.
+
+* I also wrote an __inclusive version__ of __work-efficient scan__ - because i misunderstood the requirement at first! The difference of the inclusive method is that it creates a buffer that is 1 element larger and swap the last(0) and and second last elements before downsweeping. Although I corrected my implemention to exclusive scan, the inclusive scan can still be called by passing ScanType::inclusive to scan_implenmention method in efficient.cu.
+
+* __Radix sort__ assumes inputs are between [0, a_given_maximum) . I compared my radix sort with std::sort and thrust's unstable and stable sort.
+
+* I added a helper class `PerformanceTimer` in common.h which is used to do performance measurement.
+
+
+### Original Questions
+```
+* Roughly optimize the block sizes of each of your implementations for minimal
+  run time on your GPU.
+  * (You shouldn't compare unoptimized implementations to each other!)
+
+* Compare all of these GPU Scan implementations (Naive, Work-Efficient, and
+  Thrust) to the serial CPU version of Scan. Plot a graph of the comparison
+  (with array size on the independent axis).
+  * You should use CUDA events for timing GPU code. Be sure **not** to include
+    any *initial/final* memory operations (`cudaMalloc`, `cudaMemcpy`) in your
+    performance measurements, for comparability. Note that CUDA events cannot
+    time CPU code.
+  * You can use the C++11 `std::chrono` API for timing CPU code. See this
+    [Stack Overflow answer](http://stackoverflow.com/a/23000049) for an example.
+    Note that `std::chrono` may not provide high-precision timing. If it does
+    not, you can either use it to time many iterations, or use another method.
+  * To guess at what might be happening inside the Thrust implementation (e.g.
+    allocation, memory copy), take a look at the Nsight timeline for its
+    execution. Your analysis here doesn't have to be detailed, since you aren't
+    even looking at the code for the implementation.
+```
+
+* Please refer to __Performance__ section.
+
+```
+* Write a brief explanation of the phenomena you see here.
+  * Can you find the performance bottlenecks? Is it memory I/O? Computation? Is
+    it different for each implementation?
+```
+
+* ~~I notice that I couldn't get a good measurement for scan and sort of __Thrust__. I have trouble measuring `thrust::exclusive` with std::chrono, while I can use `std::chrono` to measure `thrust::scan` but the results from CUDA events seems off.~~ I passed a host array to thrust so in my original tests thrust is using a CPU sort algorithm, fixed and updated result in `result_radix_max_100.txt` and `result_radix_max_100000000.txt`
+
+* With max possible value increased (so does MSB), besides the run time of radix sort, that of std::sort also increased significantly, which is unexpected. Does bit length also affect the time for read/write or other operations? or the implementation of std::sort<int> use bit level information?
+
+* I think the bottleneck for blocksize is the warp size inside GPU.
+
+* My original work-efficient scan implementation was slower than CPU scan, and then I optimized it by __minimizing wasted threads__. and it now runs __faster than CPU scan__, please refer to __Optimization__ section below.
+
+```
+* Paste the output of the test program into a triple-backtick block in your
+  README.
+  * If you add your own tests (e.g. for radix sort or to test additional corner
+    cases), be sure to mention it explicitly.
+
+These questions should help guide you in performance analysis on future
+assignments, as well.
+```
+
+* The tests are done with arrays of lengths `2^26` (67108864) and `2^26-3` (67108861). The array generation uses current time as random seed.
+
+* I added tests for __radix sort__, which compares with `std::sort` as well as __Thrust__'s `thrust::sort`
+
+
+### Performance
+
+#### Blocksize
+
+When block size is smaller than 16, my application suffers from performance drop, which is recorded in `test_results` folder. I decided to just use `cudaOccupancyMaxPotentialBlockSize` for each device functions, which is almost 1024 on my computer.
+
+![chart_blocksize](/screenshots/chart_blocksize.png)
+
+| Block size | Naïve Scan(ms) | Work-efficient Scan (ms) | Reference CPU Scan (ms) |
+|------------|----------------|--------------------------|-------------------------|
+| 4          | 755.433        | 76.5717                  | 134.408                 |
+| 8          | 379.897        | 52.1942                  | 134.408                 |
+| 16         | 212.542        | 44.3224                  | 134.408                 |
+| 32         | 133.116        | 43.4925                  | 134.408                 |
+| 64         | 114.7          | 44.7172                  | 134.408                 |
+| 128        | 113.802        | 44.7841                  | 134.408                 |
+| 256        | 114.575        | 45.3902                  | 134.408                 |
+| 512        | 114.593        | 44.3084                  | 134.408                 |
+| 1024       | 113.867        | 44.2941                  | 134.408                 |
+
+#### Array length
+
+All the test are done with block size of 1024. The possible max value for sorting is 100.
+
+__Scan__ : this work-efficient scan implementation is faster than cpu scan on large input but slower than Thrust's
+
+![chart_scan](/screenshots/chart_scan.png)
+
+| Input array length power of 2 | CPU Scan (ms) | Naïve Scan (ms) | Work-efficient scan (ms) | Thrust scan (ms) (CUDA event) |
+|-------------------------------|---------------|-----------------|--------------------------|-------------------------------|
+| 12                            | 0             | 0.047872        | 0.110784                 | 0.028192                      |
+| 16                            | 0             | 0.09216         | 0.180064                 | 0.166432                      |
+| 20                            | 0             | 1.29402         | 0.818528                 | 0.30768                       |
+| 24                            | 23.5625       | 25.7094         | 11.3379                  | 2.00646                       |
+| 26                            | 159.925       | 133.61          | 44.4511                  | 7.7283                        |
+
+__Compaction__ : this work-efficient compaction implementation is faster than cpu's
+
+![chart_compact](/screenshots/chart_compact.png)
+
+| Input array length power of 2 | CPU compact (ms) | CPU scan compact  (ms) | Work-efficient compact (ms) |
+|-------------------------------|------------------|------------------------|-----------------------------|
+| 12                            | 0                | 0                      | 0.118368                    |
+| 16                            | 0                | 0                      | 0.193408                    |
+| 20                            | 0                | 0                      | 0.966944                    |
+| 24                            | 39.0743          | 38.6241                | 13.3034                     |
+| 26                            | 155.406          | 422.524                | 53.2659                     |
+
+__Sort__ : radix sort on GPU is faster than std::sort
+
+![chart_sort](/screenshots/chart_sort.png)
+
+| Input array length power of 2 | std::sort (ms) | Radix sort (ms) |
+|-------------------------------|----------------|-----------------|
+| 12                            | 0              | 0.883328        |
+| 16                            | 0              | 1.44288         |
+| 20                            | 22.1649        | 7.61686         |
+| 24                            | 378.025        | 105.222         |
+| 26                            | 1481.8         | 419.345         |
+
+#### Data maximum value and radix sort
+
+__NOTE: THIS TEST RUNS ON A DIFFERENT MACHINE: Windows 7, Xeon(R) E5-1630 @ 3.70GHz 32GB, GTX 1070 8192MB (Moore 103 SigLab)__
+
+All the test are done with block size of 1024; array length is ~~67108864~~ 33554432.
+
+| max value  | std::sort (ms) | Radix sort (ms) | thrust::sort (ms) |
+|------------|----------------|-----------------|-------------------|
+| 100        | 917            | 143.992         | 20.2813ms         |
+| 1000000000 | 2173           | 1023.97         | 20.6702ms         |
+
+With max possible value increased, besides the run time of radix sort, that of std::sort and thrust sorts also increased.
+
+__I peeked at thrust's inner function call and found thrust is using a radix sort algorithm.__
+
+
+### Sample Output
+
+```
+* THIS TEST RAN ON A DIFFERENT MACHINE:
+    Windows 7, Xeon(R) E5-1630 @ 3.70GHz 32GB, GTX 1070 8192MB (Moore 103 SigLab)
+
+CIS-565 HW2 CUDA Stream Compaction Test (Ruoyu Fan)
+    Block size for naive scan: 1024
+    Block size for up-sweep: 1024
+    Block size for down-sweep: 1024
+    Block size for boolean mapping: 1024
+    Block size for scattering: 1024
+    Block sizes for radix sort: 1024 1024 1024 1024
+
+****************
+** SCAN TESTS **
+****************
+Array size (power of two): 33554432
+Array size (non-power of two): 33554429
+    [  13  29  47   7   2  28  44  49  35  46   2  49   9 ...  37   0 ]
+==== cpu scan, power-of-two ====
+   elapsed time: 93ms    (std::chrono Measured)
+    [   0  13  42  89  96  98 126 170 219 254 300 302 351 ... 821752752 821752789 ]
+==== cpu scan, non-power-of-two ====
+    [   0  13  42  89  96  98 126 170 219 254 300 302 351 ... 821752661 821752701 ]
+   elapsed time: 87ms    (std::chrono Measured)
+    passed
+==== naive scan, power-of-two ====
+    [   0  13  42  89  96  98 126 170 219 254 300 302 351 ... 821752752 821752789 ]
+   elapsed time: 38.1036ms    (CUDA Measured)
+    passed
+==== naive scan, non-power-of-two ====
+    [   0  13  42  89  96  98 126 170 219 254 300 302 351 ... 821752661 821752701 ]
+   elapsed time: 38.112ms    (CUDA Measured)
+    passed
+==== work-efficient scan, power-of-two ====
+    [   0  13  42  89  96  98 126 170 219 254 300 302 351 ... 821752752 821752789 ]
+   elapsed time: 15.0276ms    (CUDA Measured)
+    passed
+==== work-efficient scan, non-power-of-two ====
+    [   0  13  42  89  96  98 126 170 219 254 300 302 351 ... 821752661 821752701 ]
+   elapsed time: 15.0576ms    (CUDA Measured)
+    passed
+==== thrust scan, power-of-two ====
+    [   0  13  42  89  96  98 126 170 219 254 300 302 351 ... 821752752 821752789 ]
+   elapsed time: 8.90237ms    (CUDA Measured)
+    passed
+==== thrust scan, non-power-of-two ====
+    [   0  13  42  89  96  98 126 170 219 254 300 302 351 ... 821752661 821752701 ]
+   elapsed time: 2.74368ms    (CUDA Measured)
+    passed
+
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+Array size (power of two): 33554432
+Array size (non-power of two): 33554429
+    [   1   1   3   1   1   3   0   2   3   2   0   1   0 ...   3   0 ]
+==== cpu compact without scan, power-of-two ====
+    [   1   1   3   1   1   3   2   3   2   1   3   2   2 ...   2   3 ]
+   elapsed time: 83ms    (std::chrono Measured)
+    passed
+==== cpu compact without scan, non-power-of-two ====
+    [   1   1   3   1   1   3   2   3   2   1   3   2   2 ...   3   2 ]
+   elapsed time: 84ms    (std::chrono Measured)
+    passed
+==== cpu compact with scan ====
+    [   1   1   3   1   1   3   2   3   2   1   3   2   2 ...   2   3 ]
+   elapsed time: 237ms    (std::chrono Measured)
+    passed
+==== work-efficient compact, power-of-two ====
+    [   1   1   3   1   1   3   2   3   2   1   3   2   2 ...   2   3 ]
+   elapsed time: 18.2364ms    (CUDA Measured)
+    passed
+==== work-efficient compact, non-power-of-two ====
+    [   1   1   3   1   1   3   2   3   2   1   3   2   2 ...   3   2 ]
+   elapsed time: 18.2344ms    (CUDA Measured)
+    passed
+
+*****************************
+** RADIX SORT TESTS **
+*****************************
+Array size (power of two): 33554432
+Array size (non-power of two): 33554429
+Max value: 1000000000
+    [ 21520 17257 9407 8648 31232 11282 11169 22994 15890 9350 22656 25538 29919 ... 23658   0 ]
+==== std::sort, power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ... 32767 32767 ]
+   elapsed time: 2173ms    (std::chrono Measured)
+==== thrust::sort (which calls Thrust's radix sort), power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ... 32767 32767 ]
+   elapsed time: 20.6702ms    (CUDA Measured)
+    passed
+==== radix sort, power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ... 32767 32767 ]
+   elapsed time: 628.72ms    (CUDA Measured)
+    passed
+==== std::sort, non power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ... 32767 32767 ]
+   elapsed time: 2153ms    (std::chrono Measured)
+==== radix sort, non power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ... 32767 32767 ]
+   elapsed time: 617.616ms    (CUDA Measured)
+    passed
+```
+
+
+### Optimization
+
+#### Run less threads for work-efficient scan
+
+For work-efficient scan, my original implementation was using the same of amount of threads for every up sweep and down sweeps. Then I optimized it by using only necessary amount of threads for each iteration.
+
+The performance for scanning an array of length 67108861 using work-efficient approach boosted __from ~120.5ms to ~44.4ms__, which is __270% speed__ of my original approach. You can see the data in the files under __test_results/__ folder
+
+![chart_scan_optimization](/screenshots/chart_scan_optimization.png)
+
+Original implementation (in which `(index + 1) % (add_distance * 2) == 0` is `false` at many threads so __these threads were wasted__):
+
+```c++
+// running unnecessary threads
+__global__ void kernScanUpSweepPass(int N, int add_distance, int* buffer)
+{
+    auto index = threadIdx.x + blockIdx.x * blockDim.x;
+    if (index >= N) { return; }
+
+    if ((index + 1) % (add_distance * 2) == 0)
+    {
+        buffer[index] = buffer[index] + buffer[index - add_distance];
+    }
+}
+
+__global__ void kernScanDownSweepPass(int N, int distance, int* buffer)
+{
+    auto index = threadIdx.x + blockIdx.x * blockDim.x;
+    if (index >= N) { return; }
+
+    if ((index + 1) % (distance * 2) == 0)
+    {
+        auto temp = buffer[index - distance];
+        buffer[index - distance] = buffer[index];
+        buffer[index] = temp + buffer[index];
+    }
+}
+```
+
+New implementation:
+
+```c++
+// optimized: only launch necessary amount of threads in host code
+__global__ void kernScanUpSweepPass(int max_thread_index, int add_distance, int* buffer)
+{
+    auto tindex = threadIdx.x + blockIdx.x * blockDim.x;
+
+    if (tindex >= max_thread_index) { return; }
+
+    // I encountered overflow problem with index < N here so I changed to tindex < max_thread_index
+    size_t index = (add_distance * 2) * (1 + tindex) - 1;
+
+    buffer[index] = buffer[index] + buffer[index - add_distance];
+}
+
+// optimized: only launch necessary amount of threads in host code
+__global__ void kernScanDownSweepPass(int max_thread_index, int distance, int* buffer)
+{
+    auto tindex = threadIdx.x + blockIdx.x * blockDim.x;
+
+    if (tindex >= max_thread_index) { return; }
+
+    size_t index = (distance * 2) * (1 + tindex) - 1;  
+
+    auto temp = buffer[index - distance];
+    buffer[index - distance] = buffer[index];
+    buffer[index] = temp + buffer[index];
+}
+```
+
+And I calculated the number of threads needed as well as the maximum thread index for every up-sweep and down-sweep pass.
+
+Originally I was still using length of buffer as first parameter, but when I was calculating indices for a thread by using the condition of `(distance * 2) * (1 + tindex) - 1 > N`. There can come some weird result because of the multiplication result is out of bound (even for `size_t` - it took me 2 hours to debug that). So lessons learned, and I'll use more `n > b/a` instead of `a*n > b` as condition in the future.
+
+#### Helper class for performance measurement
+
+I create a RAII `PerformanceTimer` class for performance measurement. Which is like:
+
+```c++
+/**
+* This class is used for timing the performance
+* Uncopyable and unmovable
+*/
+class PerformanceTimer
+{
+public:
+    PerformanceTimer()
+    {
+        cudaEventCreate(&event_start);
+        cudaEventCreate(&event_end);
+    }
+
+    ~PerformanceTimer()
+    {
+        cudaEventDestroy(event_start);
+        cudaEventDestroy(event_end);
+    }
+
+    void startCpuTimer()
+    {
+        if (cpu_timer_started) { throw std::runtime_error("CPU timer already started"); }
+        cpu_timer_started = true;
+
+        time_start_cpu = std::chrono::high_resolution_clock::now();
+    }
+
+    void endCpuTimer()
+    {
+        time_end_cpu = std::chrono::high_resolution_clock::now();
+
+        if (!cpu_timer_started) { throw std::runtime_error("CPU timer not started"); }
+
+        std::chrono::duration<double, std::milli> duro = time_end_cpu - time_start_cpu;
+        prev_elapsed_time_cpu_milliseconds =
+            static_cast<decltype(prev_elapsed_time_cpu_milliseconds)>(duro.count());
+
+        cpu_timer_started = false;
+    }
+
+    void startGpuTimer()
+    {
+        if (gpu_timer_started) { throw std::runtime_error("GPU timer already started"); }
+        gpu_timer_started = true;
+
+        cudaEventRecord(event_start);
+    }
+
+    void endGpuTimer()
+    {
+        cudaEventRecord(event_end);
+        cudaEventSynchronize(event_end);
+
+        if (!gpu_timer_started) { throw std::runtime_error("GPU timer not started"); }
+
+        cudaEventElapsedTime(&prev_elapsed_time_gpu_milliseconds, event_start, event_end);
+        gpu_timer_started = false;
+    }
+
+    float getCpuElapsedTimeForPreviousOperation() noexcept
+    {
+        return prev_elapsed_time_cpu_milliseconds;
+    }
+
+    float getGpuElapsedTimeForPreviousOperation() noexcept
+    {
+        return prev_elapsed_time_gpu_milliseconds;
+    }
+
+    // remove copy and move functions
+    PerformanceTimer(const PerformanceTimer&) = delete;
+    PerformanceTimer(PerformanceTimer&&) = delete;
+    PerformanceTimer& operator=(const PerformanceTimer&) = delete;
+    PerformanceTimer& operator=(PerformanceTimer&&) = delete;
+
+private:
+    cudaEvent_t event_start = nullptr;
+    cudaEvent_t event_end = nullptr;
+
+    using time_point_t = std::chrono::high_resolution_clock::time_point;
+    time_point_t time_start_cpu;
+    time_point_t time_end_cpu;
+
+    bool cpu_timer_started = false;
+    bool gpu_timer_started = false;
+
+    float prev_elapsed_time_cpu_milliseconds = 0.f;
+    float prev_elapsed_time_gpu_milliseconds = 0.f;
+};
+```
+
+And inside a module I have:
+
+```c++
+using StreamCompaction::Common::PerformanceTimer;
+PerformanceTimer& timer()
+{
+    // not thread-safe
+    static PerformanceTimer timer;
+    return timer;
+}
+```
+
+Therefore, I can use
+
+```c++
+void someFunc()
+{
+    allocateYourBuffers()
+
+    timer().startGpuTimer();
+
+    doYourGpuScan();
+
+    timer().endGpuTimer();
+
+    endYourJob();
+}
+```
+
+and
+
+```c++
+timer().getGpuElapsedTimeForPreviousOperation();
+```
+
+to get the measured elapsed time for the operation.
diff --git a/screenshots/chart_blocksize.png b/screenshots/chart_blocksize.png
new file mode 100644
index 0000000..5135747
Binary files /dev/null and b/screenshots/chart_blocksize.png differ
diff --git a/screenshots/chart_compact.png b/screenshots/chart_compact.png
new file mode 100644
index 0000000..3530d46
Binary files /dev/null and b/screenshots/chart_compact.png differ
diff --git a/screenshots/chart_scan.png b/screenshots/chart_scan.png
new file mode 100644
index 0000000..349f583
Binary files /dev/null and b/screenshots/chart_scan.png differ
diff --git a/screenshots/chart_scan_optimization.png b/screenshots/chart_scan_optimization.png
new file mode 100644
index 0000000..efc8e32
Binary files /dev/null and b/screenshots/chart_scan_optimization.png differ
diff --git a/screenshots/chart_sort.png b/screenshots/chart_sort.png
new file mode 100644
index 0000000..4c4fe78
Binary files /dev/null and b/screenshots/chart_sort.png differ
diff --git a/screenshots/preview.gif b/screenshots/preview.gif
new file mode 100644
index 0000000..4941094
Binary files /dev/null and b/screenshots/preview.gif differ
diff --git a/screenshots/preview_optimized.gif b/screenshots/preview_optimized.gif
new file mode 100644
index 0000000..a120522
Binary files /dev/null and b/screenshots/preview_optimized.gif differ
diff --git a/src/main.cpp b/src/main.cpp
index 675da35..0170cd4 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -1,89 +1,131 @@
 /**
  * @file      main.cpp
  * @brief     Stream compaction test program
- * @authors   Kai Ninomiya
- * @date      2015
+ * @authors   Kai Ninomiya, Ruoyu Fan
+ * @date      2015, 2016
  * @copyright University of Pennsylvania
  */
 
-#include <cstdio>
 #include <stream_compaction/cpu.h>
 #include <stream_compaction/naive.h>
 #include <stream_compaction/efficient.h>
 #include <stream_compaction/thrust.h>
+#include <stream_compaction/radix_sort.h>
+
 #include "testing_helpers.hpp"
+#include <iterator>
+#include <algorithm>
+#include <iostream>
+
+// size of 1 << 26 could on a 970 desktop 
+//   but crashed on my laptop (970m) so I reduced array size
+// change the array size if there is still problem
+const int SIZE = 1 << 25; //constexpr 
+//const int SIZE = 1 << 24;
+const int NPOT = SIZE - 3;
+const int SCAN_MAX = 50;
+const int COMPACTION_MAX = 4;
+
+const int SORT_SIZE = 1 << 25; 
+//const int SORT_SIZE = 1 << 24;
+const int SORT_NPOT = SORT_SIZE - 3;
+const int SORT_MAX = 1000000000;
+
+
+int a[SIZE], b[SIZE], c[SIZE], d[SORT_SIZE], e[SORT_SIZE], f[SORT_SIZE];
 
 int main(int argc, char* argv[]) {
-    const int SIZE = 1 << 8;
-    const int NPOT = SIZE - 3;
-    int a[SIZE], b[SIZE], c[SIZE];
 
     // Scan tests
+    
+    std::cout << "CIS-565 HW2 CUDA Stream Compaction Test (Ruoyu Fan)";
+    std::cout << std::endl;
+    std::cout << "    Block size for naive scan: " << StreamCompaction::Naive::getNaiveScanBlockSize() << std::endl;
+    std::cout << "    Block size for up-sweep: " << StreamCompaction::Efficient::getUpSweepBlockSize() << std::endl;
+    std::cout << "    Block size for down-sweep: " << StreamCompaction::Efficient::getDownSweepBlockSize() << std::endl;
+    std::cout << "    Block size for boolean mapping: " << StreamCompaction::Common::getMapToBooleanBlockSize() << std::endl;
+    std::cout << "    Block size for scattering: " << StreamCompaction::Common::getScatterBlocksize() << std::endl;
+    std::cout << "    Block sizes for radix sort: " 
+        << StreamCompaction::RadixSort::getComputeBArrayBlockSize() << " "
+        << StreamCompaction::RadixSort::getComputeDArrayBlockSize() << " "
+        << StreamCompaction::RadixSort::getComputeEArrayBlockSize() << " "
+        << StreamCompaction::RadixSort::getReshuffleBlockSize() << std::endl;
 
     printf("\n");
     printf("****************\n");
     printf("** SCAN TESTS **\n");
     printf("****************\n");
+    std::cout << "Array size (power of two): " << SIZE << std::endl;
+    std::cout << "Array size (non-power of two): " << NPOT << std::endl;
 
-    genArray(SIZE - 1, a, 50);  // Leave a 0 at the end to test that edge case
+    genArray(SIZE - 1, a, SCAN_MAX);  // result for edge case of 0 looks fine
     a[SIZE - 1] = 0;
     printArray(SIZE, a, true);
 
     zeroArray(SIZE, b);
     printDesc("cpu scan, power-of-two");
     StreamCompaction::CPU::scan(SIZE, b, a);
+    printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
     printArray(SIZE, b, true);
 
     zeroArray(SIZE, c);
     printDesc("cpu scan, non-power-of-two");
     StreamCompaction::CPU::scan(NPOT, c, a);
     printArray(NPOT, b, true);
+    printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
     printCmpResult(NPOT, b, c);
 
     zeroArray(SIZE, c);
     printDesc("naive scan, power-of-two");
     StreamCompaction::Naive::scan(SIZE, c, a);
-    //printArray(SIZE, c, true);
+    printArray(SIZE, c, true);
+    printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
     printCmpResult(SIZE, b, c);
 
     zeroArray(SIZE, c);
     printDesc("naive scan, non-power-of-two");
     StreamCompaction::Naive::scan(NPOT, c, a);
-    //printArray(SIZE, c, true);
+    printArray(NPOT, c, true);
+    printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
     printCmpResult(NPOT, b, c);
 
     zeroArray(SIZE, c);
     printDesc("work-efficient scan, power-of-two");
     StreamCompaction::Efficient::scan(SIZE, c, a);
-    //printArray(SIZE, c, true);
+    printArray(SIZE, c, true);
+    printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
     printCmpResult(SIZE, b, c);
 
     zeroArray(SIZE, c);
     printDesc("work-efficient scan, non-power-of-two");
     StreamCompaction::Efficient::scan(NPOT, c, a);
-    //printArray(NPOT, c, true);
+    printArray(NPOT, c, true);
+    printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
     printCmpResult(NPOT, b, c);
 
     zeroArray(SIZE, c);
     printDesc("thrust scan, power-of-two");
     StreamCompaction::Thrust::scan(SIZE, c, a);
-    //printArray(SIZE, c, true);
+    printArray(SIZE, c, true);
+    printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
     printCmpResult(SIZE, b, c);
 
     zeroArray(SIZE, c);
     printDesc("thrust scan, non-power-of-two");
     StreamCompaction::Thrust::scan(NPOT, c, a);
-    //printArray(NPOT, c, true);
+    printArray(NPOT, c, true);
+    printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
     printCmpResult(NPOT, b, c);
 
     printf("\n");
     printf("*****************************\n");
     printf("** STREAM COMPACTION TESTS **\n");
     printf("*****************************\n");
-
+    std::cout << "Array size (power of two): " << SIZE << std::endl;
+    std::cout << "Array size (non-power of two): " << NPOT << std::endl;
     // Compaction tests
 
-    genArray(SIZE - 1, a, 4);  // Leave a 0 at the end to test that edge case
+    genArray(SIZE - 1, a, COMPACTION_MAX);  // result for edge case of 0 looks fine
     a[SIZE - 1] = 0;
     printArray(SIZE, a, true);
 
@@ -94,6 +136,7 @@ int main(int argc, char* argv[]) {
     count = StreamCompaction::CPU::compactWithoutScan(SIZE, b, a);
     expectedCount = count;
     printArray(count, b, true);
+    printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
     printCmpLenResult(count, expectedCount, b, b);
 
     zeroArray(SIZE, c);
@@ -101,23 +144,74 @@ int main(int argc, char* argv[]) {
     count = StreamCompaction::CPU::compactWithoutScan(NPOT, c, a);
     expectedNPOT = count;
     printArray(count, c, true);
+    printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
     printCmpLenResult(count, expectedNPOT, b, c);
 
     zeroArray(SIZE, c);
     printDesc("cpu compact with scan");
     count = StreamCompaction::CPU::compactWithScan(SIZE, c, a);
     printArray(count, c, true);
+    printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
     printCmpLenResult(count, expectedCount, b, c);
 
     zeroArray(SIZE, c);
     printDesc("work-efficient compact, power-of-two");
     count = StreamCompaction::Efficient::compact(SIZE, c, a);
-    //printArray(count, c, true);
+    printArray(count, c, true);
+    printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
     printCmpLenResult(count, expectedCount, b, c);
 
     zeroArray(SIZE, c);
     printDesc("work-efficient compact, non-power-of-two");
     count = StreamCompaction::Efficient::compact(NPOT, c, a);
-    //printArray(count, c, true);
+    printArray(count, c, true);
+    printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
     printCmpLenResult(count, expectedNPOT, b, c);
+
+    printf("\n");
+    printf("*****************************\n");
+    printf("** RADIX SORT TESTS **\n");
+    printf("*****************************\n");
+    std::cout << "Array size (power of two): " << SORT_SIZE << std::endl;
+    std::cout << "Array size (non-power of two): " << SORT_NPOT << std::endl;
+    std::cout << "Max value: " << SORT_MAX << std::endl;
+
+    genArray(SORT_SIZE - 1, d, SORT_MAX);
+    d[SORT_SIZE - 1] = 0;
+    printArray(SORT_SIZE, d, true);
+
+    printDesc("std::sort, power-of-two");
+    std::copy(std::begin(d), std::end(d), std::begin(e));
+    StreamCompaction::CPU::stdSort(std::begin(e), std::end(e));
+    printArray(SORT_SIZE, e, true);
+    printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
+
+    printDesc("thrust::sort (which calls Thrust's radix sort), power-of-two");
+    std::copy(std::begin(d), std::end(d), std::begin(f));
+    StreamCompaction::Thrust::sort(std::begin(f), std::end(f));
+    printArray(SORT_SIZE, f, true);
+     printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+    printCmpResult(SORT_SIZE, e, f);
+	// I wanted to compare with thrust's unstable and stable sort, but it uses radix sort!
+
+    printDesc("radix sort, power-of-two");
+    std::copy(std::begin(d), std::end(d), std::begin(f));
+    StreamCompaction::RadixSort::radixSort(std::begin(f), std::end(f), SORT_MAX);
+    printArray(SORT_SIZE, f, true);
+    printElapsedTime(StreamCompaction::RadixSort::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+    printCmpResult(SORT_SIZE, e, f);
+
+    // must be after all power-of-two sorts since it is standard value
+    printDesc("std::sort, non power-of-two");
+    std::copy(std::begin(d), std::end(d), std::begin(e));
+    StreamCompaction::CPU::stdSort(std::begin(e), std::begin(e) + SORT_NPOT);
+    printArray(SORT_NPOT, e, true);
+    printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
+
+    printDesc("radix sort, non power-of-two");
+    std::copy(std::begin(d), std::end(d), std::begin(f));
+    StreamCompaction::RadixSort::radixSort(std::begin(f), std::begin(f) + SORT_NPOT, SORT_MAX);
+    printArray(SORT_NPOT, f, true);
+    printElapsedTime(StreamCompaction::RadixSort::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+    printCmpResult(SORT_NPOT, e, f);
 }
diff --git a/src/testing_helpers.hpp b/src/testing_helpers.hpp
index f6b572f..87b5a6b 100644
--- a/src/testing_helpers.hpp
+++ b/src/testing_helpers.hpp
@@ -1,6 +1,9 @@
 #pragma once
 
 #include <cstdlib>
+#include <cstdio>
+#include <iostream>
+#include <string>
 
 template<typename T>
 int cmpArrays(int n, T *a, T *b) {
@@ -17,6 +20,12 @@ void printDesc(const char *desc) {
     printf("==== %s ====\n", desc);
 }
 
+template<typename T>
+void printElapsedTime(T time, std::string note = "")
+{
+    std::cout << "   elapsed time: " << time << "ms    " << note << std::endl;
+}
+
 template<typename T>
 void printCmpResult(int n, T *a, T *b) {
     printf("    %s \n",
@@ -40,10 +49,19 @@ void zeroArray(int n, int *a) {
 }
 
 void genArray(int n, int *a, int maxval) {
-    srand(0);
+    //srand(0);
+    srand(time(nullptr));
 
-    for (int i = 0; i < n; i++) {
-        a[i] = rand() % maxval;
+    for (int i = 0; i < n; i++) 
+    {
+        if (maxval == 0)
+        {
+            a[i] = 0;
+        }
+        else
+        {
+            a[i] = rand() % maxval;
+        }
     }
 }
 
diff --git a/stream_compaction/CMakeLists.txt b/stream_compaction/CMakeLists.txt
index cdbef77..bc6bdbc 100644
--- a/stream_compaction/CMakeLists.txt
+++ b/stream_compaction/CMakeLists.txt
@@ -9,6 +9,8 @@ set(SOURCE_FILES
     "efficient.cu"
     "thrust.h"
     "thrust.cu"
+    "radix_sort.h"
+    "radix_sort.cu"
     )
 
 cuda_add_library(stream_compaction
diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu
index fe872d4..6d49e5d 100644
--- a/stream_compaction/common.cu
+++ b/stream_compaction/common.cu
@@ -1,5 +1,8 @@
 #include "common.h"
 
+#include <cuda.h>
+#include <cuda_runtime.h>
+
 void checkCUDAErrorFn(const char *msg, const char *file, int line) {
     cudaError_t err = cudaGetLastError();
     if (cudaSuccess == err) {
@@ -22,17 +25,61 @@ namespace Common {
  * Maps an array to an array of 0s and 1s for stream compaction. Elements
  * which map to 0 will be removed, and elements which map to 1 will be kept.
  */
-__global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
-    // TODO
+__global__ void kernMapToBoolean(int n, int *bools, const int *idata) 
+{
+    auto index = threadIdx.x + blockIdx.x * blockDim.x;
+    if (index >= n) { return; }
+
+    if (idata[index])
+    {
+        bools[index] = 1;
+    }
+    else
+    {
+        bools[index] = 0;
+    }
 }
 
+
+//__global__ void kernScatter(int n, int *odata,
+//        const int *idata, const int *bools, const int *indices) 
+
 /**
- * Performs scatter on an array. That is, for each element in idata,
- * if bools[idx] == 1, it copies idata[idx] to odata[indices[idx]].
- */
+* Performs scatter on an array. That is, for each element in idata,
+* if bools[idx] == 1, it copies idata[idx] to odata[indices[idx]].
+*/
 __global__ void kernScatter(int n, int *odata,
-        const int *idata, const int *bools, const int *indices) {
-    // TODO
+            const int *idata, const int *indices) 
+{
+    // use one less buffer to save space
+    auto index = threadIdx.x + blockIdx.x * blockDim.x;
+    if (index >= n) { return; }
+
+    if (idata[index])
+    {
+        odata[indices[index]] = idata[index];
+    }
+}
+
+int getMapToBooleanBlockSize()
+{
+    // not thread-safe
+    static int block_size = -1;
+    if (block_size == -1)
+    {
+        block_size = calculateBlockSizeForDeviceFunction(kernMapToBoolean);
+    }
+    return block_size;
+}
+int getScatterBlocksize()
+{
+    // not thread-safe
+    static int block_size = -1;
+    if (block_size == -1)
+    {
+        block_size = calculateBlockSizeForDeviceFunction(kernScatter);
+    }
+    return block_size;
 }
 
 }
diff --git a/stream_compaction/common.h b/stream_compaction/common.h
index 4f52663..5a93691 100644
--- a/stream_compaction/common.h
+++ b/stream_compaction/common.h
@@ -1,8 +1,14 @@
 #pragma once
 
+#include <cuda.h>
+#include <cuda_runtime.h>
+
 #include <cstdio>
 #include <cstring>
 #include <cmath>
+#include <algorithm>
+#include <chrono>
+#include <stdexcept>
 
 #define FILENAME (strrchr(__FILE__, '/') ? strrchr(__FILE__, '/') + 1 : __FILE__)
 #define checkCUDAError(msg) checkCUDAErrorFn(msg, FILENAME, __LINE__)
@@ -24,12 +30,139 @@ inline int ilog2ceil(int x) {
     return ilog2(x - 1) + 1;
 }
 
+inline int fullBlocksPerGrid(int n, int block_size)
+{
+    return (n + block_size - 1) / block_size;
+}
+
+// use fixed block size or block size from cudaOccupancyMaxPotentialBlockSize
+const int FIXED_BLOCK_SIZE = -1;
+//const int FIXED_BLOCK_SIZE = 128;
+
+template<typename T>
+int calculateBlockSizeForDeviceFunction(T func)
+{
+    if (FIXED_BLOCK_SIZE <= 0)
+    {
+        int block_size;
+        int min_grid_size;
+        cudaOccupancyMaxPotentialBlockSize(&min_grid_size, &block_size, func);
+        return block_size;
+    }
+    else
+    {
+        return FIXED_BLOCK_SIZE;
+    }
+}
+
+#undef IS_FIXED_BLOCK_SIZE
+#undef FIXED_BLOCK_SIZE
+
 
 namespace StreamCompaction {
 namespace Common {
     __global__ void kernMapToBoolean(int n, int *bools, const int *idata);
-
+    
+    //__global__ void kernScatter(int n, int *odata,
+    //        const int *idata, const int *bools, const int *indices);
     __global__ void kernScatter(int n, int *odata,
-            const int *idata, const int *bools, const int *indices);
+            const int *idata, const int *indices); // use one less buffer to save space
+
+    int getMapToBooleanBlockSize();
+    int getScatterBlocksize();
+
+    /**
+    * This class is used for timing the performance
+    * Uncopyable and unmovable
+    */
+    class PerformanceTimer
+    {
+    public:
+        PerformanceTimer()
+        {
+            cudaEventCreate(&event_start);
+            cudaEventCreate(&event_end);
+        }
+
+        ~PerformanceTimer()
+        {
+            cudaEventDestroy(event_start);
+            cudaEventDestroy(event_end);
+        }
+
+        void startCpuTimer()
+        {
+            if (cpu_timer_started) { throw std::runtime_error("CPU timer already started"); }
+            cpu_timer_started = true;
+
+            time_start_cpu = std::chrono::high_resolution_clock::now();
+        }
+
+        void endCpuTimer()
+        {
+            time_end_cpu = std::chrono::high_resolution_clock::now();
+
+            if (!cpu_timer_started) { throw std::runtime_error("CPU timer not started"); }
+            
+            std::chrono::duration<double, std::milli> duro = time_end_cpu - time_start_cpu;
+            prev_elapsed_time_cpu_milliseconds = 
+                static_cast<decltype(prev_elapsed_time_cpu_milliseconds)>(duro.count());
+
+            cpu_timer_started = false;
+        }
+
+        void startGpuTimer()
+        {
+            if (gpu_timer_started) { throw std::runtime_error("GPU timer already started"); }
+            gpu_timer_started = true;
+
+            cudaEventRecord(event_start);
+        }
+
+        void endGpuTimer()
+        {
+            cudaEventRecord(event_end);
+            cudaEventSynchronize(event_end);
+
+            if (!gpu_timer_started) { throw std::runtime_error("GPU timer not started"); }
+
+            cudaEventElapsedTime(&prev_elapsed_time_gpu_milliseconds, event_start, event_end);
+            gpu_timer_started = false;
+        }
+
+        float getCpuElapsedTimeForPreviousOperation() //noexcept //(damn I need VS 2015
+        {
+            return prev_elapsed_time_cpu_milliseconds;
+        }
+
+        float getGpuElapsedTimeForPreviousOperation() //noexcept
+        {
+            return prev_elapsed_time_gpu_milliseconds;
+        }
+
+		// remove copy and move functions
+		PerformanceTimer(const PerformanceTimer&) = delete;
+		PerformanceTimer(PerformanceTimer&&) = delete;
+		PerformanceTimer& operator=(const PerformanceTimer&) = delete;
+		PerformanceTimer& operator=(PerformanceTimer&&) = delete;
+
+    private:
+        cudaEvent_t event_start = nullptr;
+        cudaEvent_t event_end = nullptr;
+
+        using time_point_t = std::chrono::high_resolution_clock::time_point;
+        time_point_t time_start_cpu;
+        time_point_t time_end_cpu;
+
+        bool cpu_timer_started = false;
+        bool gpu_timer_started = false;
+
+        float prev_elapsed_time_cpu_milliseconds = 0.f;
+        float prev_elapsed_time_gpu_milliseconds = 0.f;
+    };
+
 }
 }
+
+
+
diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu
index e600c29..867b7b4 100644
--- a/stream_compaction/cpu.cu
+++ b/stream_compaction/cpu.cu
@@ -1,15 +1,45 @@
-#include <cstdio>
 #include "cpu.h"
 
+#include "common.h"
+
+#include <cstdio>
+#include <vector>
+#include <algorithm>
+
 namespace StreamCompaction {
 namespace CPU {
 
+using StreamCompaction::Common::PerformanceTimer;
+PerformanceTimer& timer()
+{
+    static PerformanceTimer timer;
+    return timer;
+}
+
+void scan_implementation(int n, int *odata, const int *idata)
+{
+    if (n <= 0) { return; }
+
+    odata[0] = 0;
+
+    using std::size_t;
+    for (size_t i = 1; i < n; i++)
+    {
+        odata[i] = odata[i - 1] + idata[i - 1];
+    }
+}
+
 /**
  * CPU scan (prefix sum).
  */
-void scan(int n, int *odata, const int *idata) {
-    // TODO
-    printf("TODO\n");
+void scan(int n, int *odata, const int *idata) 
+{
+    // DONE
+    timer().startCpuTimer();
+
+    scan_implementation(n, odata, idata);
+
+    timer().endCpuTimer();
 }
 
 /**
@@ -17,9 +47,25 @@ void scan(int n, int *odata, const int *idata) {
  *
  * @returns the number of elements remaining after compaction.
  */
-int compactWithoutScan(int n, int *odata, const int *idata) {
-    // TODO
-    return -1;
+int compactWithoutScan(int n, int *odata, const int *idata) 
+{
+    //DONE
+    timer().startCpuTimer();
+
+    using std::size_t;
+    size_t olength = 0;
+    for (size_t i = 0; i < n; i++)
+    {
+        if (idata[i])
+        {
+            odata[olength] = idata[i];
+            olength++;
+        }
+    }
+
+    timer().endCpuTimer();
+    
+    return static_cast<int>(olength);
 }
 
 /**
@@ -27,9 +73,49 @@ int compactWithoutScan(int n, int *odata, const int *idata) {
  *
  * @returns the number of elements remaining after compaction.
  */
-int compactWithScan(int n, int *odata, const int *idata) {
-    // TODO
-    return -1;
+int compactWithScan(int n, int *odata, const int *idata) 
+{
+    // DONE
+    // Run CPU scan
+    using std::size_t;
+    std::vector<int> scan_result(n, 0);
+
+    timer().startCpuTimer();
+    
+    for (size_t i = 0; i < n; i++)
+    {
+        if (idata[i])
+        {
+            // also use odata as a temp buffer to save space
+            odata[i] = 1;
+        }
+    }
+
+    scan_implementation(n, scan_result.data(), odata);
+
+    size_t olength = 0;
+    for (size_t i = 0; i < n; i++)
+    {
+        if (idata[i])
+        {
+            odata[scan_result[i]] = idata[i];
+            olength++;
+        }
+    }
+
+    timer().endCpuTimer();
+
+    return static_cast<int>(olength);
+}
+
+/**
+* This just calls std::sort
+*/
+void stdSort(int* start, int* end)
+{
+    timer().startCpuTimer();
+    std::sort(start, end);
+    timer().endCpuTimer();
 }
 
 }
diff --git a/stream_compaction/cpu.h b/stream_compaction/cpu.h
index 6348bf3..f83c49a 100644
--- a/stream_compaction/cpu.h
+++ b/stream_compaction/cpu.h
@@ -1,11 +1,17 @@
 #pragma once
 
+#include "common.h"
+
 namespace StreamCompaction {
 namespace CPU {
+    StreamCompaction::Common::PerformanceTimer& timer();
+
     void scan(int n, int *odata, const int *idata);
 
     int compactWithoutScan(int n, int *odata, const int *idata);
 
     int compactWithScan(int n, int *odata, const int *idata);
+
+    void stdSort(int* start, int* end);
 }
 }
diff --git a/stream_compaction/efficient.cu b/stream_compaction/efficient.cu
index b2f739b..5a359f1 100644
--- a/stream_compaction/efficient.cu
+++ b/stream_compaction/efficient.cu
@@ -1,19 +1,235 @@
+#include "efficient.h"
+
+#include "common.h"
 #include <cuda.h>
 #include <cuda_runtime.h>
-#include "common.h"
-#include "efficient.h"
+
+#include <type_traits>
 
 namespace StreamCompaction {
 namespace Efficient {
 
-// TODO: __global__
+using StreamCompaction::Common::PerformanceTimer;
+PerformanceTimer& timer()
+{
+    // not thread-safe
+    static PerformanceTimer timer;
+    return timer;
+}
+
+// DONE: __global__
+
+// optimized: only launch necessary amount of threads in host code
+__global__ void kernScanUpSweepPass(int max_thread_index, int add_distance, int* buffer)
+{
+    auto tindex = threadIdx.x + blockIdx.x * blockDim.x;
+
+    if (tindex >= max_thread_index) { return; }
+
+    // I encountered overflow problem with index < N here so I changed to tindex < max_thread_index
+    size_t index = (add_distance * 2) * (1 + tindex) - 1;
+
+    buffer[index] = buffer[index] + buffer[index - add_distance];
+}
+
+// optimized: only launch necessary amount of threads in host code
+__global__ void kernScanDownSweepPass(int max_thread_index, int distance, int* buffer)
+{
+    auto tindex = threadIdx.x + blockIdx.x * blockDim.x;
+
+    if (tindex >= max_thread_index) { return; }
+
+    size_t index = (distance * 2) * (1 + tindex) - 1;  
+
+    auto temp = buffer[index - distance];
+    buffer[index - distance] = buffer[index];
+    buffer[index] = temp + buffer[index];
+}
+
+// // This old functions launch the same number of threads for all passes, which is unnecessary
+//__global__ void kernScanUpSweepPass(int N, int add_distance, int* buffer)
+//{
+//    auto index = threadIdx.x + blockIdx.x * blockDim.x;
+//    if (index >= N) { return; }
+//
+//    if ((index + 1) % (add_distance * 2) == 0)
+//    {
+//        buffer[index] = buffer[index] + buffer[index - add_distance];
+//    }
+//}
+//
+//__global__ void kernScanDownSweepPass(int N, int distance, int* buffer)
+//{
+//    auto index = threadIdx.x + blockIdx.x * blockDim.x;
+//    if (index >= N) { return; }
+//
+//    if ((index + 1) % (distance * 2) == 0)
+//    {
+//        auto temp = buffer[index - distance];
+//        buffer[index - distance] = buffer[index];
+//        buffer[index] = temp + buffer[index];
+//    }
+//}
+
+/**
+* Swap value of two array members
+* Used for inclusive scan as I misunderstood the requirement
+*/
+__global__ void kernSwap(int index1, int index2, int* buffer)
+{
+    auto thread_index = threadIdx.x + blockIdx.x * blockDim.x;
+    if (thread_index == 0)
+    {
+        buffer[index1] ^= buffer[index2];
+        buffer[index2] ^= buffer[index1];
+        buffer[index1] ^= buffer[index2];
+    }
+}
+
+/**
+* Set value of an array member to zero
+*/
+__global__ void kernSetZero(int index, int* buffer)
+{
+    auto thread_index = threadIdx.x + blockIdx.x * blockDim.x;
+    if (thread_index == 0)
+    {
+        buffer[index] = 0;
+    }
+}
+
+
+int getUpSweepBlockSize()
+{
+    // not thread-safe
+    static int block_size = -1;
+    if (block_size == -1)
+    {
+        block_size = calculateBlockSizeForDeviceFunction(kernScanUpSweepPass);
+    }
+    return block_size;
+}
+
+int getDownSweepBlockSize()
+{
+    // not thread-safe
+    static int block_size = -1;
+    if (block_size == -1)
+    {
+        block_size = calculateBlockSizeForDeviceFunction(kernScanDownSweepPass);
+    }
+    return block_size;
+}
+
+enum class ScanType{inclusive, exclusive};
+
+void scanInPlaceDevice(int extended_n, int* dev_buffer)
+{
+    auto block_size_up = getUpSweepBlockSize();
+    auto full_blocks_per_grid_up = fullBlocksPerGrid(extended_n, block_size_up);
+    auto block_size_down = getDownSweepBlockSize();
+    auto full_blocks_per_grid_down = fullBlocksPerGrid(extended_n, block_size_down);
+
+    // up sweep
+    auto pass_count = ilog2ceil(extended_n) - 1;
+    auto max_thread_index = extended_n;
+    for (int d = 0; d <= pass_count; d++)
+    {
+        max_thread_index = extended_n >> (d + 1);
+        full_blocks_per_grid_up = fullBlocksPerGrid(max_thread_index, block_size_up); // only launch needed threads
+        kernScanUpSweepPass << <full_blocks_per_grid_up, block_size_up >> >(max_thread_index, 1 << d, dev_buffer);
+    }
+
+    // set the last element to zero
+    kernSetZero <<<1, 1 >>>(extended_n - 1, dev_buffer);
+
+
+    // down sweep
+    for (int d = pass_count; d >= 0; d--)
+    {
+        max_thread_index = extended_n >> (d + 1);
+        full_blocks_per_grid_down = fullBlocksPerGrid(max_thread_index, block_size_down); // only launch needed threads
+        kernScanDownSweepPass <<<full_blocks_per_grid_down, block_size_down >>>(max_thread_index, 1 << d, dev_buffer);
+    }
+}
+
+void scan_implemention(int n, int *odata, const int *idata, ScanType scan_type)
+{
+    if (n <= 0) { return; }
+
+    // DONE
+    // round n up to power of two
+    auto extended_n = std::size_t(1) << ilog2ceil(n);
+    // plus one for 
+    auto final_buffer_length = 
+        (scan_type == ScanType::inclusive) ? extended_n + 1: extended_n;
+
+    int* dev_buffer;
+    cudaMalloc((void**)&dev_buffer, final_buffer_length * sizeof(*dev_buffer));
+    checkCUDAError("cudaMalloc dev_buffer failed!");
+
+    // fill zero and copy to device buffer
+    cudaMemset(dev_buffer, 0, final_buffer_length * sizeof(*idata));
+    cudaMemcpy(dev_buffer, idata, n * sizeof(*idata), cudaMemcpyHostToDevice);
+
+    auto block_size_up = getUpSweepBlockSize();
+    auto full_blocks_per_grid_up = fullBlocksPerGrid(extended_n, block_size_up);
+    auto block_size_down = getDownSweepBlockSize();
+    auto full_blocks_per_grid_down = fullBlocksPerGrid(extended_n, block_size_down);
+
+    timer().startGpuTimer();
+
+    // up sweep
+    auto pass_count = ilog2ceil(extended_n) - 1;
+    auto max_thread_index = extended_n;
+    for (int d = 0; d <= pass_count; d++)
+    {
+        max_thread_index = extended_n >> (d + 1);
+        full_blocks_per_grid_up = fullBlocksPerGrid(max_thread_index, block_size_up); // only launch needed threads
+        kernScanUpSweepPass << <full_blocks_per_grid_up, block_size_up >> >(max_thread_index, 1 << d, dev_buffer);
+    }
+
+    if (scan_type == ScanType::inclusive)
+    {
+        // swap the last element of up sweep result and the real last element (0)
+        kernSwap <<<1, 1>>>(extended_n - 1, final_buffer_length - 1, dev_buffer);
+    }
+    else
+    {
+        // set the last element to zero
+        kernSetZero <<<1, 1>>>(extended_n - 1, dev_buffer);
+    }
+
+    // down sweep
+    for (int d = pass_count; d >= 0; d--)
+    {
+        max_thread_index = extended_n >> (d + 1);
+        full_blocks_per_grid_down = fullBlocksPerGrid(max_thread_index, block_size_down); // only launch needed threads
+        kernScanDownSweepPass <<<full_blocks_per_grid_down, block_size_down >>>(max_thread_index, 1 << d, dev_buffer);
+    }
+
+    timer().endGpuTimer();
+
+    if (scan_type == ScanType::inclusive)
+    {
+        // copy with offset to make it an inclusive scan
+        cudaMemcpy(odata, dev_buffer + 1, n * sizeof(*odata), cudaMemcpyDeviceToHost);
+    }
+    else
+    {
+        cudaMemcpy(odata, dev_buffer, n * sizeof(*odata), cudaMemcpyDeviceToHost);
+    }
+
+    cudaFree(dev_buffer);
+}
 
 /**
  * Performs prefix-sum (aka scan) on idata, storing the result into odata.
+ * This just call scan_implementaion
  */
-void scan(int n, int *odata, const int *idata) {
-    // TODO
-    printf("TODO\n");
+void scan(int n, int *odata, const int *idata) 
+{
+    scan_implemention(n, odata, idata, ScanType::exclusive);
 }
 
 /**
@@ -26,8 +242,63 @@ void scan(int n, int *odata, const int *idata) {
  * @returns      The number of elements remaining after compaction.
  */
 int compact(int n, int *odata, const int *idata) {
-    // TODO
-    return -1;
+    if (n <= 0) { return 0; }
+
+    // DONE
+    int* dev_idata;
+    cudaMalloc((void**)&dev_idata, n * sizeof(*dev_idata));
+    checkCUDAError("cudaMalloc dev_idata failed!");
+    cudaMemcpy(dev_idata, idata, n * sizeof(*idata), cudaMemcpyHostToDevice);
+
+    auto extended_n = std::size_t(1) << ilog2ceil(n); // round up to power of two for scanning
+        
+    int* dev_bools; // TODO: I could use size_t* (but which will result in a lot of rewrite for the kernel functions) 
+    cudaMalloc((void**)&dev_bools, extended_n * sizeof(*dev_bools));
+    checkCUDAError("cudaMalloc dev_bools failed!");
+    // fill zero and copy to boolean buffer
+    cudaMemset(dev_bools, 0, extended_n * sizeof(*dev_bools));
+    
+    // reuse bool buffer as indices buffer
+    auto dev_indices = dev_bools; 
+
+    int* dev_odata;
+    cudaMalloc((void**)&dev_odata, n * sizeof(*dev_odata));
+    checkCUDAError("cudaMalloc dev_odata failed!");
+
+    auto block_size_booleanize = Common::getMapToBooleanBlockSize();
+    auto full_blocks_per_grid_booleanize = fullBlocksPerGrid(n, block_size_booleanize);
+    auto block_size_scatter = Common::getScatterBlocksize();
+    auto full_blocks_per_grid_scatter = fullBlocksPerGrid(n, block_size_scatter);
+
+    timer().startGpuTimer();
+
+    // map to boolean
+    Common::kernMapToBoolean <<<full_blocks_per_grid_booleanize, block_size_booleanize >>>(n, dev_bools, dev_idata);
+
+    // exclusively scan the dev_bools buffer
+    scanInPlaceDevice(extended_n, dev_bools);
+
+    // scatter
+    Common::kernScatter <<<full_blocks_per_grid_scatter, block_size_scatter >>>(n, dev_odata, dev_idata, dev_indices);
+    timer().endGpuTimer();
+
+    // calculate compacted length
+    using dev_indices_t = std::remove_reference<decltype(*dev_indices)>::type;
+    dev_indices_t result_length;
+    cudaMemcpy(&result_length, dev_indices + n - 1, sizeof(result_length), cudaMemcpyDeviceToHost);
+    if (idata[n - 1])
+    {
+        result_length += 1;
+    }
+
+    // get compacted result
+    cudaMemcpy(odata, dev_odata, result_length * sizeof(*odata), cudaMemcpyDeviceToHost);
+
+    cudaFree(dev_idata);
+    cudaFree(dev_bools); // the same buffer as dev_indices
+    cudaFree(dev_odata);
+    
+    return result_length;
 }
 
 }
diff --git a/stream_compaction/efficient.h b/stream_compaction/efficient.h
index 395ba10..e9c6b60 100644
--- a/stream_compaction/efficient.h
+++ b/stream_compaction/efficient.h
@@ -1,9 +1,19 @@
 #pragma once
 
+#include "common.h"
+
 namespace StreamCompaction {
 namespace Efficient {
+
+    StreamCompaction::Common::PerformanceTimer& timer();
+
     void scan(int n, int *odata, const int *idata);
 
     int compact(int n, int *odata, const int *idata);
+
+    int getUpSweepBlockSize();
+    int getDownSweepBlockSize();
+
+    void scanInPlaceDevice(int extended_n, int* dev_buffer);
 }
 }
diff --git a/stream_compaction/naive.cu b/stream_compaction/naive.cu
index 3d86b60..b7ea4de 100644
--- a/stream_compaction/naive.cu
+++ b/stream_compaction/naive.cu
@@ -1,19 +1,88 @@
+#include "naive.h"
+
+#include "common.h"
 #include <cuda.h>
 #include <cuda_runtime.h>
-#include "common.h"
-#include "naive.h"
+#include <algorithm>
 
 namespace StreamCompaction {
 namespace Naive {
 
-// TODO: __global__
+    using StreamCompaction::Common::PerformanceTimer;
+    PerformanceTimer& timer()
+    {
+        static PerformanceTimer timer;
+        return timer;
+    }
+
+// DONE: __global__
+__global__ void kernNaiveScanPass(int N, int offset, int* in_buffer, int* out_buffer)
+{
+    auto index = threadIdx.x + blockIdx.x * blockDim.x;
+    if (index >= N) { return; }
+
+    if (index >= offset)
+    {
+        out_buffer[index] = in_buffer[index - offset] + in_buffer[index];
+    }
+    else
+    {
+        out_buffer[index] = in_buffer[index];
+    }
+}
+
+int getNaiveScanBlockSize()
+{
+    // not thread-safe
+    static int block_size = -1;
+    if (block_size == -1)
+    {
+        block_size = calculateBlockSizeForDeviceFunction(kernNaiveScanPass);
+    }
+    return block_size;
+}
+
 
 /**
  * Performs prefix-sum (aka scan) on idata, storing the result into odata.
  */
-void scan(int n, int *odata, const int *idata) {
-    // TODO
-    printf("TODO\n");
+void scan(int n, int *odata, const int *idata) 
+{
+    if (n <= 0) { return; }
+
+    auto block_size = getNaiveScanBlockSize();
+    auto full_blocks_per_grid = (n + block_size - 1) / block_size;
+
+    // DONE
+    int* dev_in_buffer;
+    cudaMalloc((void**)&dev_in_buffer, n * sizeof(*dev_in_buffer));
+    checkCUDAError("cudaMalloc dev_in_buffer failed!");
+    int* dev_out_buffer;
+    cudaMalloc((void**)&dev_out_buffer, n * sizeof(*dev_out_buffer));
+    checkCUDAError("cudaMalloc dev_out_buffer failed!");
+
+    cudaMemcpy(dev_in_buffer, idata, n * sizeof(*idata), cudaMemcpyHostToDevice);
+    
+    timer().startGpuTimer();
+
+    auto cap = ilog2ceil(n);
+    int offset;
+    for (int d = 1; d <= cap; d++)
+    {
+        offset = 1 << (d - 1);
+        kernNaiveScanPass <<< full_blocks_per_grid, block_size >>>(n, offset, dev_in_buffer, dev_out_buffer);
+        std::swap(dev_in_buffer, dev_out_buffer);
+    }
+    std::swap(dev_in_buffer, dev_out_buffer);
+
+    timer().endGpuTimer();
+
+    // defered copy because of exclusive scan
+    cudaMemcpy(odata + 1, dev_out_buffer, (n - 1) * sizeof(*odata), cudaMemcpyDeviceToHost);
+    odata[0] = 0;
+
+    cudaFree(dev_in_buffer);
+    cudaFree(dev_out_buffer);
 }
 
 }
diff --git a/stream_compaction/naive.h b/stream_compaction/naive.h
index 21152d6..2795ea6 100644
--- a/stream_compaction/naive.h
+++ b/stream_compaction/naive.h
@@ -1,7 +1,13 @@
 #pragma once
 
+#include "common.h"
+
 namespace StreamCompaction {
 namespace Naive {
+
+    StreamCompaction::Common::PerformanceTimer& timer();
     void scan(int n, int *odata, const int *idata);
+
+    int getNaiveScanBlockSize();
 }
 }
diff --git a/stream_compaction/radix_sort.cu b/stream_compaction/radix_sort.cu
new file mode 100644
index 0000000..0e54fc1
--- /dev/null
+++ b/stream_compaction/radix_sort.cu
@@ -0,0 +1,187 @@
+#include "radix_sort.h"
+
+#include "efficient.h"
+#include "common.h"
+#include <cuda.h>
+#include <cuda_runtime.h>
+#include <algorithm>
+
+namespace StreamCompaction {
+namespace RadixSort {
+
+
+using StreamCompaction::Common::PerformanceTimer;
+PerformanceTimer& timer()
+{
+    static PerformanceTimer timer;
+    return timer;
+}
+
+
+__global__ void kernComputeBArray(const int N, int bit_mask, bool *b, const int *idata)
+{
+    auto index = threadIdx.x + blockIdx.x * blockDim.x;
+    if (index >= N) { return; }
+
+    b[index] = ((idata[index] & bit_mask) != 0);
+}
+
+__global__ void kernComputeEArray(const int N, int* e, const bool *b, const int extended_n)
+{
+    auto index = threadIdx.x + blockIdx.x * blockDim.x;
+    if (index >= extended_n) { return; }
+    else if (index >= N)
+    {
+    	// the length of e (or later f) needs to be power of 2
+    	e[index] = 0;
+    }
+    else
+    {
+        // e is int array since directly performing scan on it
+        e[index] = !b[index];
+    }
+}
+
+__global__ void kernComputeDArray(const int N, int* d, const int* f, const bool* b)
+{
+    auto index = threadIdx.x + blockIdx.x * blockDim.x;
+    if (index >= N) { return; }
+
+    // t[i] = i-f[i] + total_falses
+    // total_falses = e[n-1] + f[n-1] = !b[n-1] + f[n-1]
+    // d[i] = b[i]? t[i] : f[i]
+    if (b[index])
+    {
+        d[index] = index - f[index] + !b[N - 1] + f[N - 1];
+    }
+    else
+    {
+        d[index] = f[index];
+    }
+}
+
+__global__ void kernReshuffle(const int N, int* to_buffer, const int* from_buffer, const int* indices)
+{
+    auto index = threadIdx.x + blockIdx.x * blockDim.x;
+    if (index >= N) { return; }
+
+    to_buffer[indices[index]] = from_buffer[index];
+}
+
+int getComputeBArrayBlockSize()
+{
+    // not thread-safe
+    static int block_size = -1;
+    if (block_size == -1)
+    {
+        block_size = calculateBlockSizeForDeviceFunction(kernComputeBArray);
+    }
+    return block_size;
+}
+
+int getComputeEArrayBlockSize()
+{
+    // not thread-safe
+    static int block_size = -1;
+    if (block_size == -1)
+    {
+        block_size = calculateBlockSizeForDeviceFunction(kernComputeEArray);
+    }
+    return block_size;
+}
+
+int getComputeDArrayBlockSize()
+{
+    // not thread-safe
+    static int block_size = -1;
+    if (block_size == -1)
+    {
+        block_size = calculateBlockSizeForDeviceFunction(kernComputeDArray);
+    }
+    return block_size;
+}
+
+int getReshuffleBlockSize()
+{
+    // not thread-safe
+    static int block_size = -1;
+    if (block_size == -1)
+    {
+        block_size = calculateBlockSizeForDeviceFunction(kernReshuffle);
+    }
+    return block_size;
+}
+
+/**
+* performs radix sort, which assumes all data is int type and between [0, max_value)
+*/
+void radixSort(int* start, int* end, int max_value)
+{
+    auto n = static_cast<std::size_t>(end - start);
+
+    auto block_size_compute_b = getComputeBArrayBlockSize();
+    auto full_blocks_per_grid_compute_b = fullBlocksPerGrid(n, block_size_compute_b);
+
+    auto block_size_compute_e = getComputeBArrayBlockSize();
+    auto full_blocks_per_grid_compute_e = fullBlocksPerGrid(n, block_size_compute_e);
+
+    auto block_size_compute_d = getComputeBArrayBlockSize();
+    auto full_blocks_per_grid_compute_d = fullBlocksPerGrid(n, block_size_compute_d);
+
+    auto block_size_reshuffle = getReshuffleBlockSize();
+    auto full_blocks_per_grid_reshuffle = fullBlocksPerGrid(n, block_size_compute_d);
+
+    auto extended_n = std::size_t(1) << ilog2ceil(n); // round up to power of two for scanning
+
+    int* dev_array;
+    cudaMalloc((void**)&dev_array, n * sizeof(*dev_array));
+    checkCUDAError("cudaMalloc dev_array failed!");
+    cudaMemcpy(dev_array, start, n * sizeof(*start), cudaMemcpyHostToDevice);
+
+    int* dev_temp;
+    cudaMalloc((void**)&dev_temp, n * sizeof(*dev_temp));
+    checkCUDAError("cudaMalloc dev_temp failed!");
+
+    bool* dev_b; // buffer which holds values of b
+    cudaMalloc((void**)&dev_b, n * sizeof(*dev_b));
+    checkCUDAError("cudaMalloc dev_b failed!");
+
+    int* dev_ef; // buffer which holds values of e and f
+    cudaMalloc((void**)&dev_ef, extended_n * sizeof(*dev_ef));
+    checkCUDAError("cudaMalloc dev_ef failed!");
+
+    int* dev_d; // buffer which holds values of d
+    cudaMalloc((void**)&dev_d, n * sizeof(*dev_d));
+    checkCUDAError("cudaMalloc dev_d failed!");
+
+    timer().startGpuTimer();
+    // input betweem [0, max_value)
+    // auto lsb_offset = 0;
+    auto msb_offset = ilog2ceil(max_value);
+    for (int offset = 0; offset < msb_offset; offset++)
+    {
+        auto bit_mask = 1 << offset;
+        kernComputeBArray <<<full_blocks_per_grid_compute_b, block_size_compute_b >>>(n, bit_mask, dev_b, dev_array);
+        kernComputeEArray <<<full_blocks_per_grid_compute_e, block_size_compute_e >>>(n, dev_ef, dev_b, extended_n);
+
+        StreamCompaction::Efficient::scanInPlaceDevice(extended_n, dev_ef);
+
+        kernComputeDArray <<<full_blocks_per_grid_compute_d, block_size_compute_d >>>(n, dev_d, dev_ef, dev_b);
+
+        kernReshuffle <<<full_blocks_per_grid_reshuffle, block_size_reshuffle >>>(n, dev_temp, dev_array, dev_d);
+        std::swap(dev_temp, dev_array);
+        
+    }
+    timer().endGpuTimer();
+
+    cudaMemcpy(start, dev_array, n * sizeof(*start), cudaMemcpyDeviceToHost);
+
+    cudaFree(dev_array);
+    cudaFree(dev_temp);
+    cudaFree(dev_b);
+    cudaFree(dev_ef);
+    cudaFree(dev_d);
+}
+
+}
+}
diff --git a/stream_compaction/radix_sort.h b/stream_compaction/radix_sort.h
new file mode 100644
index 0000000..c1e8c29
--- /dev/null
+++ b/stream_compaction/radix_sort.h
@@ -0,0 +1,15 @@
+#pragma once
+
+#include "common.h"
+
+namespace StreamCompaction {
+namespace RadixSort {
+    StreamCompaction::Common::PerformanceTimer& timer();
+    void radixSort(int* start, int* end, int max_value);
+
+    int getComputeBArrayBlockSize();
+    int getComputeEArrayBlockSize();
+    int getComputeDArrayBlockSize();
+    int getReshuffleBlockSize();
+}
+}
diff --git a/stream_compaction/thrust.cu b/stream_compaction/thrust.cu
index d8dbb32..5890e01 100644
--- a/stream_compaction/thrust.cu
+++ b/stream_compaction/thrust.cu
@@ -3,20 +3,67 @@
 #include <thrust/device_vector.h>
 #include <thrust/host_vector.h>
 #include <thrust/scan.h>
+#include <thrust/sort.h>
 #include "common.h"
 #include "thrust.h"
 
 namespace StreamCompaction {
 namespace Thrust {
 
+    using StreamCompaction::Common::PerformanceTimer;
+    PerformanceTimer& timer()
+    {
+        static PerformanceTimer timer;
+        return timer;
+    }
+
 /**
  * Performs prefix-sum (aka scan) on idata, storing the result into odata.
  */
 void scan(int n, int *odata, const int *idata) {
-    // TODO use `thrust::exclusive_scan`
+    // DONE use `thrust::exclusive_scan`
     // example: for device_vectors dv_in and dv_out:
     // thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin());
+
+    thrust::device_vector<int> thrust_idata(idata, idata + n);
+    thrust::device_vector<int> thrust_odata(odata, odata + n);
+
+    timer().startGpuTimer();
+
+    thrust::exclusive_scan(thrust_idata.begin(), thrust_idata.end(), thrust_odata.begin());
+
+    timer().endGpuTimer();
+
+    //thrust::host_vector<int> thrust_host_odata = thrust_odata;
+    thrust::copy(thrust_odata.begin(), thrust_odata.end(), odata);
 }
 
+void sort(int* start, int* end)
+{
+    thrust::device_vector<int> thrust_data(start, end);
+
+	timer().startGpuTimer();
+
+	// I found it calls thrust's radix sort.
+	thrust::sort(thrust_data.begin(), thrust_data.end()); 
+
+	timer().endGpuTimer();
+
+	thrust::copy(thrust_data.begin(), thrust_data.end(), start);
+}
+
+//void stableSort(int* start, int* end)
+//{
+//    thrust::device_vector<int> thrust_data(start, end);
+//
+//	timer().startGpuTimer();
+//
+//	thrust::sort(thrust_data.begin(), thrust_data.end());
+//   
+//	timer().endGpuTimer();
+//
+//	thrust::copy(thrust_data.begin(), thrust_data.end(), start);
+//}
+
 }
 }
diff --git a/stream_compaction/thrust.h b/stream_compaction/thrust.h
index 06707f3..6311ce7 100644
--- a/stream_compaction/thrust.h
+++ b/stream_compaction/thrust.h
@@ -1,7 +1,11 @@
 #pragma once
 
+#include "common.h"
+
 namespace StreamCompaction {
 namespace Thrust {
+    StreamCompaction::Common::PerformanceTimer& timer();
     void scan(int n, int *odata, const int *idata);
+    void sort(int* start, int* end);
 }
 }
diff --git a/test_results/charts.xlsx b/test_results/charts.xlsx
new file mode 100644
index 0000000..0beab61
Binary files /dev/null and b/test_results/charts.xlsx differ
diff --git a/test_results/result_1.txt b/test_results/result_1.txt
new file mode 100644
index 0000000..7d9051a
--- /dev/null
+++ b/test_results/result_1.txt
@@ -0,0 +1,106 @@
+CIS-565 HW2 CUDA Stream Compaction Test (Ruoyu Fan)
+    Block size for naive scan: 1024
+    Block size for up-sweep: 1024
+    Block size for down-sweep: 1024
+    Block size for boolean mapping: 1024
+    Block size for scattering: 1024
+    Block sizes for radix sort: 1024 1024 1024 1024
+
+****************
+** SCAN TESTS **
+****************
+Array size (power of two): 67108864
+Array size (non-power of two): 67108861
+    [  17  27  18  32  43   0  40  49  25   4  22  14  39 ...  47   0 ]
+==== cpu scan, power-of-two ====
+   elapesd time: 124.311ms    (std::chrono Measured)
+    [   0  17  44  62  94 137 137 177 226 251 255 277 291 ... 1643657908 1643657955 ]
+==== cpu scan, non-power-of-two ====
+    [   0  17  44  62  94 137 137 177 226 251 255 277 291 ... 1643657825 1643657826 ]
+   elapesd time: 97.2587ms    (std::chrono Measured)
+    passed 
+==== naive scan, power-of-two ====
+    [   0  17  44  62  94 137 137 177 226 251 255 277 291 ... 1643657908 1643657955 ]
+   elapesd time: 113.743ms    (CUDA Measured)
+    passed 
+==== naive scan, non-power-of-two ====
+    [   0  17  44  62  94 137 137 177 226 251 255 277 291 ... 1643657825 1643657826 ]
+   elapesd time: 113.622ms    (CUDA Measured)
+    passed 
+==== work-efficient scan, power-of-two ====
+    [   0  17  44  62  94 137 137 177 226 251 255 277 291 ... 1643657908 1643657955 ]
+   elapesd time: 44.3556ms    (CUDA Measured)
+    passed 
+==== work-efficient scan, non-power-of-two ====
+    [   0  17  44  62  94 137 137 177 226 251 255 277 291 ... 1643657825 1643657826 ]
+   elapesd time: 44.3666ms    (CUDA Measured)
+    passed 
+==== thrust scan, power-of-two ====
+    [   0  17  44  62  94 137 137 177 226 251 255 277 291 ... 1643657908 1643657955 ]
+   elapesd time: 7.61741ms    (CUDA Measured)
+   elapesd time: 15.6531ms    (std::chrono Measured)
+    passed 
+==== thrust scan, non-power-of-two ====
+    [   0  17  44  62  94 137 137 177 226 251 255 277 291 ... 1643657825 1643657826 ]
+   elapesd time: 7.6697ms    (CUDA Measured)
+   elapesd time: 21.6846ms    (std::chrono Measured)
+    passed 
+
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+Array size (power of two): 67108864
+Array size (non-power of two): 67108861
+    [   2   0   1   0   1   2   0   3   1   2   0   2   2 ...   0   0 ]
+==== cpu compact without scan, power-of-two ====
+    [   2   1   1   2   3   1   2   2   2   1   3   1   3 ...   1   3 ]
+   elapesd time: 162.924ms    (std::chrono Measured)
+    passed 
+==== cpu compact without scan, non-power-of-two ====
+    [   2   1   1   2   3   1   2   2   2   1   3   1   3 ...   3   1 ]
+   elapesd time: 162.398ms    (std::chrono Measured)
+    passed 
+==== cpu compact with scan ====
+    [   2   1   1   2   3   1   2   2   2   1   3   1   3 ...   1   3 ]
+   elapesd time: 413.098ms    (std::chrono Measured)
+    passed 
+==== work-efficient compact, power-of-two ====
+    [   2   1   1   2   3   1   2   2   2   1   3   1   3 ...   1   3 ]
+   elapesd time: 53.2036ms    (CUDA Measured)
+    passed 
+==== work-efficient compact, non-power-of-two ====
+    [   2   1   1   2   3   1   2   2   2   1   3   1   3 ...   3   1 ]
+   elapesd time: 53.1786ms    (CUDA Measured)
+    passed 
+
+*****************************
+** RADIX SORT TESTS **
+*****************************
+Array size (power of two): 67108864
+Array size (non-power of two): 67108861
+Max value: 100
+    [  33  33  85  26  30  16  85  80  28  82  32  35  79 ...  47   0 ]
+==== std::sort, power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapesd time: 1519.96ms    (std::chrono Measured)
+==== thrust unstable sort, power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapesd time: 415.289ms    (std::chrono Measured)
+   elapesd time: 0.001536ms    (CUDA Measured)
+    passed 
+==== thrust stable sort, power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapesd time: 412.907ms    (std::chrono Measured)
+   elapesd time: 0.001408ms    (CUDA Measured)
+    passed 
+==== radix sort, power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapesd time: 417.083ms    (CUDA Measured)
+    passed 
+==== std::sort, non power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapesd time: 1520.99ms    (std::chrono Measured)
+==== radix sort, non power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapesd time: 416.361ms    (CUDA Measured)
+    passed 
diff --git a/test_results/result_2.txt b/test_results/result_2.txt
new file mode 100644
index 0000000..02cf932
--- /dev/null
+++ b/test_results/result_2.txt
@@ -0,0 +1,106 @@
+CIS-565 HW2 CUDA Stream Compaction Test (Ruoyu Fan)
+    Block size for naive scan: 1024
+    Block size for up-sweep: 1024
+    Block size for down-sweep: 1024
+    Block size for boolean mapping: 1024
+    Block size for scattering: 1024
+    Block sizes for radix sort: 1024 1024 1024 1024
+
+****************
+** SCAN TESTS **
+****************
+Array size (power of two): 67108864
+Array size (non-power of two): 67108861
+    [  29  21  40  33  12   4  41  16  15  13  11   1  44 ...  26   0 ]
+==== cpu scan, power-of-two ====
+   elapesd time: 127.84ms    (std::chrono Measured)
+    [   0  29  50  90 123 135 139 180 196 211 224 235 236 ... 1643379015 1643379041 ]
+==== cpu scan, non-power-of-two ====
+    [   0  29  50  90 123 135 139 180 196 211 224 235 236 ... 1643378970 1643378986 ]
+   elapesd time: 97.2591ms    (std::chrono Measured)
+    passed 
+==== naive scan, power-of-two ====
+    [   0  29  50  90 123 135 139 180 196 211 224 235 236 ... 1643379015 1643379041 ]
+   elapesd time: 113.736ms    (CUDA Measured)
+    passed 
+==== naive scan, non-power-of-two ====
+    [   0  29  50  90 123 135 139 180 196 211 224 235 236 ... 1643378970 1643378986 ]
+   elapesd time: 113.642ms    (CUDA Measured)
+    passed 
+==== work-efficient scan, power-of-two ====
+    [   0  29  50  90 123 135 139 180 196 211 224 235 236 ... 1643379015 1643379041 ]
+   elapesd time: 44.4066ms    (CUDA Measured)
+    passed 
+==== work-efficient scan, non-power-of-two ====
+    [   0  29  50  90 123 135 139 180 196 211 224 235 236 ... 1643378970 1643378986 ]
+   elapesd time: 44.2758ms    (CUDA Measured)
+    passed 
+==== thrust scan, power-of-two ====
+    [   0  29  50  90 123 135 139 180 196 211 224 235 236 ... 1643379015 1643379041 ]
+   elapesd time: 7.58675ms    (CUDA Measured)
+   elapesd time: 15.629ms    (std::chrono Measured)
+    passed 
+==== thrust scan, non-power-of-two ====
+    [   0  29  50  90 123 135 139 180 196 211 224 235 236 ... 1643378970 1643378986 ]
+   elapesd time: 7.5575ms    (CUDA Measured)
+   elapesd time: 15.6058ms    (std::chrono Measured)
+    passed 
+
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+Array size (power of two): 67108864
+Array size (non-power of two): 67108861
+    [   0   0   0   3   1   2   0   1   2   1   2   3   0 ...   1   0 ]
+==== cpu compact without scan, power-of-two ====
+    [   3   1   2   1   2   1   2   3   2   3   3   2   1 ...   2   1 ]
+   elapesd time: 159.916ms    (std::chrono Measured)
+    passed 
+==== cpu compact without scan, non-power-of-two ====
+    [   3   1   2   1   2   1   2   3   2   3   3   2   1 ...   2   2 ]
+   elapesd time: 158.894ms    (std::chrono Measured)
+    passed 
+==== cpu compact with scan ====
+    [   3   1   2   1   2   1   2   3   2   3   3   2   1 ...   2   1 ]
+   elapesd time: 398.104ms    (std::chrono Measured)
+    passed 
+==== work-efficient compact, power-of-two ====
+    [   3   1   2   1   2   1   2   3   2   3   3   2   1 ...   2   1 ]
+   elapesd time: 53.2946ms    (CUDA Measured)
+    passed 
+==== work-efficient compact, non-power-of-two ====
+    [   3   1   2   1   2   1   2   3   2   3   3   2   1 ...   2   2 ]
+   elapesd time: 53.2884ms    (CUDA Measured)
+    passed 
+
+*****************************
+** RADIX SORT TESTS **
+*****************************
+Array size (power of two): 67108864
+Array size (non-power of two): 67108861
+Max value: 100
+    [  98  25  21  41  13  15  98   3  80  92  74  15   5 ...  75   0 ]
+==== std::sort, power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapesd time: 1521.05ms    (std::chrono Measured)
+==== thrust unstable sort, power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapesd time: 417.277ms    (std::chrono Measured)
+   elapesd time: 0.001088ms    (CUDA Measured)
+    passed 
+==== thrust stable sort, power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapesd time: 427.603ms    (std::chrono Measured)
+   elapesd time: 0.001088ms    (CUDA Measured)
+    passed 
+==== radix sort, power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapesd time: 416.437ms    (CUDA Measured)
+    passed 
+==== std::sort, non power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapesd time: 1509.88ms    (std::chrono Measured)
+==== radix sort, non power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapesd time: 416.197ms    (CUDA Measured)
+    passed 
diff --git a/test_results/result_3.txt b/test_results/result_3.txt
new file mode 100644
index 0000000..e11f67c
--- /dev/null
+++ b/test_results/result_3.txt
@@ -0,0 +1,106 @@
+CIS-565 HW2 CUDA Stream Compaction Test (Ruoyu Fan)
+    Block size for naive scan: 1024
+    Block size for up-sweep: 1024
+    Block size for down-sweep: 1024
+    Block size for boolean mapping: 1024
+    Block size for scattering: 1024
+    Block sizes for radix sort: 1024 1024 1024 1024
+
+****************
+** SCAN TESTS **
+****************
+Array size (power of two): 67108864
+Array size (non-power of two): 67108861
+    [  37  34  33  38  16  49  29  43  11   2  49  44  27 ...   5   0 ]
+==== cpu scan, power-of-two ====
+   elapesd time: 115.903ms    (std::chrono Measured)
+    [   0  37  71 104 142 158 207 236 279 290 292 341 385 ... 1643541232 1643541237 ]
+==== cpu scan, non-power-of-two ====
+    [   0  37  71 104 142 158 207 236 279 290 292 341 385 ... 1643541123 1643541155 ]
+   elapesd time: 84.6896ms    (std::chrono Measured)
+    passed 
+==== naive scan, power-of-two ====
+    [   0  37  71 104 142 158 207 236 279 290 292 341 385 ... 1643541232 1643541237 ]
+   elapesd time: 113.689ms    (CUDA Measured)
+    passed 
+==== naive scan, non-power-of-two ====
+    [   0  37  71 104 142 158 207 236 279 290 292 341 385 ... 1643541123 1643541155 ]
+   elapesd time: 113.611ms    (CUDA Measured)
+    passed 
+==== work-efficient scan, power-of-two ====
+    [   0  37  71 104 142 158 207 236 279 290 292 341 385 ... 1643541232 1643541237 ]
+   elapesd time: 44.3716ms    (CUDA Measured)
+    passed 
+==== work-efficient scan, non-power-of-two ====
+    [   0  37  71 104 142 158 207 236 279 290 292 341 385 ... 1643541123 1643541155 ]
+   elapesd time: 44.3758ms    (CUDA Measured)
+    passed 
+==== thrust scan, power-of-two ====
+    [   0  37  71 104 142 158 207 236 279 290 292 341 385 ... 1643541232 1643541237 ]
+   elapesd time: 7.61901ms    (CUDA Measured)
+   elapesd time: 15.6295ms    (std::chrono Measured)
+    passed 
+==== thrust scan, non-power-of-two ====
+    [   0  37  71 104 142 158 207 236 279 290 292 341 385 ... 1643541123 1643541155 ]
+   elapesd time: 7.69434ms    (CUDA Measured)
+   elapesd time: 0ms    (std::chrono Measured)
+    passed 
+
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+Array size (power of two): 67108864
+Array size (non-power of two): 67108861
+    [   3   3   3   2   0   0   0   2   2   3   3   0   0 ...   3   0 ]
+==== cpu compact without scan, power-of-two ====
+    [   3   3   3   2   2   2   3   3   2   3   2   2   1 ...   2   3 ]
+   elapesd time: 163.452ms    (std::chrono Measured)
+    passed 
+==== cpu compact without scan, non-power-of-two ====
+    [   3   3   3   2   2   2   3   3   2   3   2   2   1 ...   2   2 ]
+   elapesd time: 163.926ms    (std::chrono Measured)
+    passed 
+==== cpu compact with scan ====
+    [   3   3   3   2   2   2   3   3   2   3   2   2   1 ...   2   3 ]
+   elapesd time: 415.607ms    (std::chrono Measured)
+    passed 
+==== work-efficient compact, power-of-two ====
+    [   3   3   3   2   2   2   3   3   2   3   2   2   1 ...   2   3 ]
+   elapesd time: 53.2367ms    (CUDA Measured)
+    passed 
+==== work-efficient compact, non-power-of-two ====
+    [   3   3   3   2   2   2   3   3   2   3   2   2   1 ...   2   2 ]
+   elapesd time: 53.1997ms    (CUDA Measured)
+    passed 
+
+*****************************
+** RADIX SORT TESTS **
+*****************************
+Array size (power of two): 67108864
+Array size (non-power of two): 67108861
+Max value: 100
+    [  54   8   0  33  21  33  41  75  33  79  60  64  18 ...  37   0 ]
+==== std::sort, power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapesd time: 1523.91ms    (std::chrono Measured)
+==== thrust unstable sort, power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapesd time: 424.964ms    (std::chrono Measured)
+   elapesd time: 0.001248ms    (CUDA Measured)
+    passed 
+==== thrust stable sort, power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapesd time: 424.985ms    (std::chrono Measured)
+   elapesd time: 0.001088ms    (CUDA Measured)
+    passed 
+==== radix sort, power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapesd time: 416.306ms    (CUDA Measured)
+    passed 
+==== std::sort, non power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapesd time: 1496.82ms    (std::chrono Measured)
+==== radix sort, non power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapesd time: 417.934ms    (CUDA Measured)
+    passed 
diff --git a/test_results/result_before_optimization_1.txt b/test_results/result_before_optimization_1.txt
new file mode 100644
index 0000000..f3484a4
--- /dev/null
+++ b/test_results/result_before_optimization_1.txt
@@ -0,0 +1,106 @@
+CIS-565 HW2 CUDA Stream Compaction Test (Ruoyu Fan)
+    Block size for naive scan: 1024
+    Block size for up-sweep: 1024
+    Block size for down-sweep: 1024
+    Block size for boolean mapping: 1024
+    Block size for scattering: 1024
+    Block sizes for radix sort: 1024 1024 1024 1024
+
+****************
+** SCAN TESTS **
+****************
+Array size (power of two): 67108864
+Array size (non-power of two): 67108861
+    [  38  19  38  37   5  47  15  35   0  12   3   0  42 ...  38   0 ]
+==== cpu scan, power-of-two ====
+   elapesd time: 163.936ms    (std::chrono Measured)
+    [   0  38  57  95 132 137 184 199 234 234 246 249 249 ... 1643794086 1643794124 ]
+==== cpu scan, non-power-of-two ====
+    [   0  38  57  95 132 137 184 199 234 234 246 249 249 ... 1643794020 1643794034 ]
+   elapesd time: 139.391ms    (std::chrono Measured)
+    passed 
+==== naive scan, power-of-two ====
+    [   0  38  57  95 132 137 184 199 234 234 246 249 249 ... 1643794086 1643794124 ]
+   elapesd time: 113.742ms    (CUDA Measured)
+    passed 
+==== naive scan, non-power-of-two ====
+    [   0  38  57  95 132 137 184 199 234 234 246 249 249 ... 1643794020 1643794034 ]
+   elapesd time: 113.551ms    (CUDA Measured)
+    passed 
+==== work-efficient scan, power-of-two ====
+    [   0  38  57  95 132 137 184 199 234 234 246 249 249 ... 1643794086 1643794124 ]
+   elapesd time: 120.501ms    (CUDA Measured)
+    passed 
+==== work-efficient scan, non-power-of-two ====
+    [   0  38  57  95 132 137 184 199 234 234 246 249 249 ... 1643794020 1643794034 ]
+   elapesd time: 120.565ms    (CUDA Measured)
+    passed 
+==== thrust scan, power-of-two ====
+    [   0  38  57  95 132 137 184 199 234 234 246 249 249 ... 1643794086 1643794124 ]
+   elapesd time: 7.72432ms    (CUDA Measured)
+   elapesd time: 12.0047ms    (std::chrono Measured)
+    passed 
+==== thrust scan, non-power-of-two ====
+    [   0  38  57  95 132 137 184 199 234 234 246 249 249 ... 1643794020 1643794034 ]
+   elapesd time: 7.37165ms    (CUDA Measured)
+   elapesd time: 8.0206ms    (std::chrono Measured)
+    passed 
+
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+Array size (power of two): 67108864
+Array size (non-power of two): 67108861
+    [   2   3   2   1   3   1   1   1   2   0   1   0   2 ...   0   0 ]
+==== cpu compact without scan, power-of-two ====
+    [   2   3   2   1   3   1   1   1   2   1   2   1   1 ...   2   2 ]
+   elapesd time: 165.408ms    (std::chrono Measured)
+    passed 
+==== cpu compact without scan, non-power-of-two ====
+    [   2   3   2   1   3   1   1   1   2   1   2   1   1 ...   2   2 ]
+   elapesd time: 160.963ms    (std::chrono Measured)
+    passed 
+==== cpu compact with scan ====
+    [   2   3   2   1   3   1   1   1   2   1   2   1   1 ...   2   2 ]
+   elapesd time: 411.79ms    (std::chrono Measured)
+    passed 
+==== work-efficient compact, power-of-two ====
+    [   2   3   2   1   3   1   1   1   2   1   2   1   1 ...   2   2 ]
+   elapesd time: 129.465ms    (CUDA Measured)
+    passed 
+==== work-efficient compact, non-power-of-two ====
+    [   2   3   2   1   3   1   1   1   2   1   2   1   1 ...   2   2 ]
+   elapesd time: 129.529ms    (CUDA Measured)
+    passed 
+
+*****************************
+** RADIX SORT TESTS **
+*****************************
+Array size (power of two): 67108864
+Array size (non-power of two): 67108861
+Max value: 100
+    [  38  19  38  37  55  97  65  85  50  12  53   0  42 ...  88   0 ]
+==== std::sort, power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapesd time: 1500.15ms    (std::chrono Measured)
+==== thrust unstable sort, power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapesd time: 414.16ms    (std::chrono Measured)
+   elapesd time: 0.001088ms    (CUDA Measured)
+    passed 
+==== thrust stable sort, power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapesd time: 418.68ms    (std::chrono Measured)
+   elapesd time: 0.001088ms    (CUDA Measured)
+    passed 
+==== radix sort, power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapesd time: 950.529ms    (CUDA Measured)
+    passed 
+==== std::sort, non power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapesd time: 1527.59ms    (std::chrono Measured)
+==== radix sort, non power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapesd time: 950.95ms    (CUDA Measured)
+    passed 
diff --git a/test_results/result_before_optimization_2.txt b/test_results/result_before_optimization_2.txt
new file mode 100644
index 0000000..25ee039
--- /dev/null
+++ b/test_results/result_before_optimization_2.txt
@@ -0,0 +1,106 @@
+CIS-565 HW2 CUDA Stream Compaction Test (Ruoyu Fan)
+    Block size for naive scan: 1024
+    Block size for up-sweep: 1024
+    Block size for down-sweep: 1024
+    Block size for boolean mapping: 1024
+    Block size for scattering: 1024
+    Block sizes for radix sort: 1024 1024 1024 1024
+
+****************
+** SCAN TESTS **
+****************
+Array size (power of two): 67108864
+Array size (non-power of two): 67108861
+    [  48  23  33   8  35  33   9  27  24  30   4  30  44 ...  44   0 ]
+==== cpu scan, power-of-two ====
+   elapesd time: 160.422ms    (std::chrono Measured)
+    [   0  48  71 104 112 147 180 189 216 240 270 274 304 ... 1643396363 1643396407 ]
+==== cpu scan, non-power-of-two ====
+    [   0  48  71 104 112 147 180 189 216 240 270 274 304 ... 1643396286 1643396311 ]
+   elapesd time: 97.76ms    (std::chrono Measured)
+    passed 
+==== naive scan, power-of-two ====
+    [   0  48  71 104 112 147 180 189 216 240 270 274 304 ... 1643396363 1643396407 ]
+   elapesd time: 113.653ms    (CUDA Measured)
+    passed 
+==== naive scan, non-power-of-two ====
+    [   0  48  71 104 112 147 180 189 216 240 270 274 304 ... 1643396286 1643396311 ]
+   elapesd time: 113.603ms    (CUDA Measured)
+    passed 
+==== work-efficient scan, power-of-two ====
+    [   0  48  71 104 112 147 180 189 216 240 270 274 304 ... 1643396363 1643396407 ]
+   elapesd time: 120.645ms    (CUDA Measured)
+    passed 
+==== work-efficient scan, non-power-of-two ====
+    [   0  48  71 104 112 147 180 189 216 240 270 274 304 ... 1643396286 1643396311 ]
+   elapesd time: 120.59ms    (CUDA Measured)
+    passed 
+==== thrust scan, power-of-two ====
+    [   0  48  71 104 112 147 180 189 216 240 270 274 304 ... 1643396363 1643396407 ]
+   elapesd time: 7.7217ms    (CUDA Measured)
+   elapesd time: 11.9449ms    (std::chrono Measured)
+    passed 
+==== thrust scan, non-power-of-two ====
+    [   0  48  71 104 112 147 180 189 216 240 270 274 304 ... 1643396286 1643396311 ]
+   elapesd time: 7.7703ms    (CUDA Measured)
+   elapesd time: 8.6162ms    (std::chrono Measured)
+    passed 
+
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+Array size (power of two): 67108864
+Array size (non-power of two): 67108861
+    [   0   2   3   0   3   2   0   1   1   2   1   0   1 ...   1   0 ]
+==== cpu compact without scan, power-of-two ====
+    [   2   3   3   2   1   1   2   1   1   2   3   2   2 ...   3   1 ]
+   elapesd time: 161.489ms    (std::chrono Measured)
+    passed 
+==== cpu compact without scan, non-power-of-two ====
+    [   2   3   3   2   1   1   2   1   1   2   3   2   2 ...   3   1 ]
+   elapesd time: 159.964ms    (std::chrono Measured)
+    passed 
+==== cpu compact with scan ====
+    [   2   3   3   2   1   1   2   1   1   2   3   2   2 ...   3   1 ]
+   elapesd time: 410.573ms    (std::chrono Measured)
+    passed 
+==== work-efficient compact, power-of-two ====
+    [   2   3   3   2   1   1   2   1   1   2   3   2   2 ...   3   1 ]
+   elapesd time: 129.561ms    (CUDA Measured)
+    passed 
+==== work-efficient compact, non-power-of-two ====
+    [   2   3   3   2   1   1   2   1   1   2   3   2   2 ...   3   1 ]
+   elapesd time: 130.928ms    (CUDA Measured)
+    passed 
+
+*****************************
+** RADIX SORT TESTS **
+*****************************
+Array size (power of two): 67108864
+Array size (non-power of two): 67108861
+Max value: 100
+    [  17  27  96  98  87  94  33  63  40   9  17  44   5 ...  75   0 ]
+==== std::sort, power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapesd time: 1515.05ms    (std::chrono Measured)
+==== thrust unstable sort, power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapesd time: 415.197ms    (std::chrono Measured)
+   elapesd time: 0.001056ms    (CUDA Measured)
+    passed 
+==== thrust stable sort, power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapesd time: 419.66ms    (std::chrono Measured)
+   elapesd time: 0.001056ms    (CUDA Measured)
+    passed 
+==== radix sort, power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapesd time: 951.73ms    (CUDA Measured)
+    passed 
+==== std::sort, non power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapesd time: 1534.72ms    (std::chrono Measured)
+==== radix sort, non power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapesd time: 960.985ms    (CUDA Measured)
+    passed 
diff --git a/test_results/result_before_optimization_3.txt b/test_results/result_before_optimization_3.txt
new file mode 100644
index 0000000..bc2a911
--- /dev/null
+++ b/test_results/result_before_optimization_3.txt
@@ -0,0 +1,106 @@
+CIS-565 HW2 CUDA Stream Compaction Test (Ruoyu Fan)
+    Block size for naive scan: 1024
+    Block size for up-sweep: 1024
+    Block size for down-sweep: 1024
+    Block size for boolean mapping: 1024
+    Block size for scattering: 1024
+    Block sizes for radix sort: 1024 1024 1024 1024
+
+****************
+** SCAN TESTS **
+****************
+Array size (power of two): 67108864
+Array size (non-power of two): 67108861
+    [  32  23   0  14  49  39  47  12  21   3  40  36  34 ...  17   0 ]
+==== cpu scan, power-of-two ====
+   elapesd time: 162.452ms    (std::chrono Measured)
+    [   0  32  55  55  69 118 157 204 216 237 240 280 316 ... 1643620618 1643620635 ]
+==== cpu scan, non-power-of-two ====
+    [   0  32  55  55  69 118 157 204 216 237 240 280 316 ... 1643620565 1643620567 ]
+   elapesd time: 139.879ms    (std::chrono Measured)
+    passed 
+==== naive scan, power-of-two ====
+    [   0  32  55  55  69 118 157 204 216 237 240 280 316 ... 1643620618 1643620635 ]
+   elapesd time: 113.688ms    (CUDA Measured)
+    passed 
+==== naive scan, non-power-of-two ====
+    [   0  32  55  55  69 118 157 204 216 237 240 280 316 ... 1643620565 1643620567 ]
+   elapesd time: 113.833ms    (CUDA Measured)
+    passed 
+==== work-efficient scan, power-of-two ====
+    [   0  32  55  55  69 118 157 204 216 237 240 280 316 ... 1643620618 1643620635 ]
+   elapesd time: 120.611ms    (CUDA Measured)
+    passed 
+==== work-efficient scan, non-power-of-two ====
+    [   0  32  55  55  69 118 157 204 216 237 240 280 316 ... 1643620565 1643620567 ]
+   elapesd time: 120.695ms    (CUDA Measured)
+    passed 
+==== thrust scan, power-of-two ====
+    [   0  32  55  55  69 118 157 204 216 237 240 280 316 ... 1643620618 1643620635 ]
+   elapesd time: 7.94493ms    (CUDA Measured)
+   elapesd time: 12.4531ms    (std::chrono Measured)
+    passed 
+==== thrust scan, non-power-of-two ====
+    [   0  32  55  55  69 118 157 204 216 237 240 280 316 ... 1643620565 1643620567 ]
+   elapesd time: 8.14739ms    (CUDA Measured)
+   elapesd time: 9.2116ms    (std::chrono Measured)
+    passed 
+
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+Array size (power of two): 67108864
+Array size (non-power of two): 67108861
+    [   2   2   2   0   0   1   3   3   0   3   0   2   3 ...   0   0 ]
+==== cpu compact without scan, power-of-two ====
+    [   2   2   2   1   3   3   3   2   3   1   2   3   2 ...   2   3 ]
+   elapesd time: 165.433ms    (std::chrono Measured)
+    passed 
+==== cpu compact without scan, non-power-of-two ====
+    [   2   2   2   1   3   3   3   2   3   1   2   3   2 ...   2   3 ]
+   elapesd time: 161.903ms    (std::chrono Measured)
+    passed 
+==== cpu compact with scan ====
+    [   2   2   2   1   3   3   3   2   3   1   2   3   2 ...   2   3 ]
+   elapesd time: 411.094ms    (std::chrono Measured)
+    passed 
+==== work-efficient compact, power-of-two ====
+    [   2   2   2   1   3   3   3   2   3   1   2   3   2 ...   2   3 ]
+   elapesd time: 129.599ms    (CUDA Measured)
+    passed 
+==== work-efficient compact, non-power-of-two ====
+    [   2   2   2   1   3   3   3   2   3   1   2   3   2 ...   2   3 ]
+   elapesd time: 129.598ms    (CUDA Measured)
+    passed 
+
+*****************************
+** RADIX SORT TESTS **
+*****************************
+Array size (power of two): 67108864
+Array size (non-power of two): 67108861
+Max value: 100
+    [  48  79  34  59  54  23  60  44  43  31   1   6  24 ...  99   0 ]
+==== std::sort, power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapesd time: 1523.59ms    (std::chrono Measured)
+==== thrust unstable sort, power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapesd time: 427.786ms    (std::chrono Measured)
+   elapesd time: 0.001088ms    (CUDA Measured)
+    passed 
+==== thrust stable sort, power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapesd time: 415.179ms    (std::chrono Measured)
+   elapesd time: 0.00112ms    (CUDA Measured)
+    passed 
+==== radix sort, power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapesd time: 954.781ms    (CUDA Measured)
+    passed 
+==== std::sort, non power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapesd time: 1529.83ms    (std::chrono Measured)
+==== radix sort, non power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapesd time: 963.745ms    (CUDA Measured)
+    passed 
diff --git a/test_results/result_blocksize_1024.txt b/test_results/result_blocksize_1024.txt
new file mode 100644
index 0000000..256ab91
--- /dev/null
+++ b/test_results/result_blocksize_1024.txt
@@ -0,0 +1,106 @@
+CIS-565 HW2 CUDA Stream Compaction Test (Ruoyu Fan)
+    Block size for naive scan: 1024
+    Block size for up-sweep: 1024
+    Block size for down-sweep: 1024
+    Block size for boolean mapping: 1024
+    Block size for scattering: 1024
+    Block sizes for radix sort: 1024 1024 1024 1024
+
+****************
+** SCAN TESTS **
+****************
+Array size (power of two): 67108864
+Array size (non-power of two): 67108861
+    [   8  18  37  41  15  25  27   8  36  28  13  40  24 ...  35   0 ]
+==== cpu scan, power-of-two ====
+   elapesd time: 134.408ms    (std::chrono Measured)
+    [   0   8  26  63 104 119 144 171 179 215 243 256 296 ... 1643625502 1643625537 ]
+==== cpu scan, non-power-of-two ====
+    [   0   8  26  63 104 119 144 171 179 215 243 256 296 ... 1643625408 1643625440 ]
+   elapesd time: 149.901ms    (std::chrono Measured)
+    passed 
+==== naive scan, power-of-two ====
+    [   0   8  26  63 104 119 144 171 179 215 243 256 296 ... 1643625502 1643625537 ]
+   elapesd time: 113.867ms    (CUDA Measured)
+    passed 
+==== naive scan, non-power-of-two ====
+    [   0   8  26  63 104 119 144 171 179 215 243 256 296 ... 1643625408 1643625440 ]
+   elapesd time: 113.687ms    (CUDA Measured)
+    passed 
+==== work-efficient scan, power-of-two ====
+    [   0   8  26  63 104 119 144 171 179 215 243 256 296 ... 1643625502 1643625537 ]
+   elapesd time: 44.2491ms    (CUDA Measured)
+    passed 
+==== work-efficient scan, non-power-of-two ====
+    [   0   8  26  63 104 119 144 171 179 215 243 256 296 ... 1643625408 1643625440 ]
+   elapesd time: 44.3104ms    (CUDA Measured)
+    passed 
+==== thrust scan, power-of-two ====
+    [   0   8  26  63 104 119 144 171 179 215 243 256 296 ... 1643625502 1643625537 ]
+   elapesd time: 7.73741ms    (CUDA Measured)
+   elapesd time: 0ms    (std::chrono Measured)
+    passed 
+==== thrust scan, non-power-of-two ====
+    [   0   8  26  63 104 119 144 171 179 215 243 256 296 ... 1643625408 1643625440 ]
+   elapesd time: 7.74371ms    (CUDA Measured)
+   elapesd time: 0ms    (std::chrono Measured)
+    passed 
+
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+Array size (power of two): 67108864
+Array size (non-power of two): 67108861
+    [   0   1   0   3   3   1   2   1   1   2   1   0   3 ...   3   0 ]
+==== cpu compact without scan, power-of-two ====
+    [   1   3   3   1   2   1   1   2   1   3   3   1   3 ...   2   3 ]
+   elapesd time: 155.403ms    (std::chrono Measured)
+    passed 
+==== cpu compact without scan, non-power-of-two ====
+    [   1   3   3   1   2   1   1   2   1   3   3   1   3 ...   2   2 ]
+   elapesd time: 154.901ms    (std::chrono Measured)
+    passed 
+==== cpu compact with scan ====
+    [   1   3   3   1   2   1   1   2   1   3   3   1   3 ...   2   3 ]
+   elapesd time: 421.621ms    (std::chrono Measured)
+    passed 
+==== work-efficient compact, power-of-two ====
+    [   1   3   3   1   2   1   1   2   1   3   3   1   3 ...   2   3 ]
+   elapesd time: 54.2043ms    (CUDA Measured)
+    passed 
+==== work-efficient compact, non-power-of-two ====
+    [   1   3   3   1   2   1   1   2   1   3   3   1   3 ...   2   2 ]
+   elapesd time: 54.1137ms    (CUDA Measured)
+    passed 
+
+*****************************
+** RADIX SORT TESTS **
+*****************************
+Array size (power of two): 67108864
+Array size (non-power of two): 67108861
+Max value: 100
+    [  78  22  68  49  66  85  83  63  52  58  25   5  35 ...  84   0 ]
+==== std::sort, power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapesd time: 1522.95ms    (std::chrono Measured)
+==== thrust unstable sort, power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapesd time: 429.389ms    (std::chrono Measured)
+   elapesd time: 0.001184ms    (CUDA Measured)
+    passed 
+==== thrust stable sort, power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapesd time: 418.915ms    (std::chrono Measured)
+   elapesd time: 0.001216ms    (CUDA Measured)
+    passed 
+==== radix sort, power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapesd time: 419.691ms    (CUDA Measured)
+    passed 
+==== std::sort, non power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapesd time: 1516.39ms    (std::chrono Measured)
+==== radix sort, non power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapesd time: 416.676ms    (CUDA Measured)
+    passed 
diff --git a/test_results/result_blocksize_128.txt b/test_results/result_blocksize_128.txt
new file mode 100644
index 0000000..a483fc1
--- /dev/null
+++ b/test_results/result_blocksize_128.txt
@@ -0,0 +1,106 @@
+CIS-565 HW2 CUDA Stream Compaction Test (Ruoyu Fan)
+    Block size for naive scan: 128
+    Block size for up-sweep: 128
+    Block size for down-sweep: 128
+    Block size for boolean mapping: 128
+    Block size for scattering: 128
+    Block sizes for radix sort: 128 128 128 128
+
+****************
+** SCAN TESTS **
+****************
+Array size (power of two): 67108864
+Array size (non-power of two): 67108861
+    [  46  41  34  49  43  33   6  10  32  34  49  28   4 ...   5   0 ]
+==== cpu scan, power-of-two ====
+   elapesd time: 158.005ms    (std::chrono Measured)
+    [   0  46  87 121 170 213 246 252 262 294 328 377 405 ... 1643517674 1643517679 ]
+==== cpu scan, non-power-of-two ====
+    [   0  46  87 121 170 213 246 252 262 294 328 377 405 ... 1643517602 1643517613 ]
+   elapesd time: 96.7737ms    (std::chrono Measured)
+    passed 
+==== naive scan, power-of-two ====
+    [   0  46  87 121 170 213 246 252 262 294 328 377 405 ... 1643517674 1643517679 ]
+   elapesd time: 113.802ms    (CUDA Measured)
+    passed 
+==== naive scan, non-power-of-two ====
+    [   0  46  87 121 170 213 246 252 262 294 328 377 405 ... 1643517602 1643517613 ]
+   elapesd time: 113.643ms    (CUDA Measured)
+    passed 
+==== work-efficient scan, power-of-two ====
+    [   0  46  87 121 170 213 246 252 262 294 328 377 405 ... 1643517674 1643517679 ]
+   elapesd time: 44.7409ms    (CUDA Measured)
+    passed 
+==== work-efficient scan, non-power-of-two ====
+    [   0  46  87 121 170 213 246 252 262 294 328 377 405 ... 1643517602 1643517613 ]
+   elapesd time: 44.7841ms    (CUDA Measured)
+    passed 
+==== thrust scan, power-of-two ====
+    [   0  46  87 121 170 213 246 252 262 294 328 377 405 ... 1643517674 1643517679 ]
+   elapesd time: 7.69293ms    (CUDA Measured)
+   elapesd time: 0ms    (std::chrono Measured)
+    passed 
+==== thrust scan, non-power-of-two ====
+    [   0  46  87 121 170 213 246 252 262 294 328 377 405 ... 1643517602 1643517613 ]
+   elapesd time: 7.59293ms    (CUDA Measured)
+   elapesd time: 0ms    (std::chrono Measured)
+    passed 
+
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+Array size (power of two): 67108864
+Array size (non-power of two): 67108861
+    [   2   0   3   3   2   1   1   3   3   2   0   0   3 ...   2   0 ]
+==== cpu compact without scan, power-of-two ====
+    [   2   3   3   2   1   1   3   3   2   3   3   2   1 ...   1   2 ]
+   elapesd time: 163.926ms    (std::chrono Measured)
+    passed 
+==== cpu compact without scan, non-power-of-two ====
+    [   2   3   3   2   1   1   3   3   2   3   3   2   1 ...   1   1 ]
+   elapesd time: 163.929ms    (std::chrono Measured)
+    passed 
+==== cpu compact with scan ====
+    [   2   3   3   2   1   1   3   3   2   3   3   2   1 ...   1   2 ]
+   elapesd time: 412.474ms    (std::chrono Measured)
+    passed 
+==== work-efficient compact, power-of-two ====
+    [   2   3   3   2   1   1   3   3   2   3   3   2   1 ...   1   2 ]
+   elapesd time: 53.6893ms    (CUDA Measured)
+    passed 
+==== work-efficient compact, non-power-of-two ====
+    [   2   3   3   2   1   1   3   3   2   3   3   2   1 ...   1   1 ]
+   elapesd time: 53.726ms    (CUDA Measured)
+    passed 
+
+*****************************
+** RADIX SORT TESTS **
+*****************************
+Array size (power of two): 67108864
+Array size (non-power of two): 67108861
+Max value: 100
+    [  15  46  15  57  62  25  63  97  98  14  12  92  16 ...   4   0 ]
+==== std::sort, power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapesd time: 1503.52ms    (std::chrono Measured)
+==== thrust unstable sort, power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapesd time: 414.828ms    (std::chrono Measured)
+   elapesd time: 0.001088ms    (CUDA Measured)
+    passed 
+==== thrust stable sort, power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapesd time: 417.44ms    (std::chrono Measured)
+   elapesd time: 0.00112ms    (CUDA Measured)
+    passed 
+==== radix sort, power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapesd time: 418.162ms    (CUDA Measured)
+    passed 
+==== std::sort, non power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapesd time: 1518.31ms    (std::chrono Measured)
+==== radix sort, non power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapesd time: 418.752ms    (CUDA Measured)
+    passed 
diff --git a/test_results/result_blocksize_16.txt b/test_results/result_blocksize_16.txt
new file mode 100644
index 0000000..047696c
--- /dev/null
+++ b/test_results/result_blocksize_16.txt
@@ -0,0 +1,106 @@
+CIS-565 HW2 CUDA Stream Compaction Test (Ruoyu Fan)
+    Block size for naive scan: 16
+    Block size for up-sweep: 16
+    Block size for down-sweep: 16
+    Block size for boolean mapping: 16
+    Block size for scattering: 16
+    Block sizes for radix sort: 16 16 16 16
+
+****************
+** SCAN TESTS **
+****************
+Array size (power of two): 67108864
+Array size (non-power of two): 67108861
+    [  43  42   6  30  14   3   5  12  45  14  24  29  33 ...  40   0 ]
+==== cpu scan, power-of-two ====
+   elapesd time: 120.874ms    (std::chrono Measured)
+    [   0  43  85  91 121 135 138 143 155 200 214 238 267 ... 1643602997 1643603037 ]
+==== cpu scan, non-power-of-two ====
+    [   0  43  85  91 121 135 138 143 155 200 214 238 267 ... 1643602942 1643602945 ]
+   elapesd time: 94.2564ms    (std::chrono Measured)
+    passed 
+==== naive scan, power-of-two ====
+    [   0  43  85  91 121 135 138 143 155 200 214 238 267 ... 1643602997 1643603037 ]
+   elapesd time: 212.542ms    (CUDA Measured)
+    passed 
+==== naive scan, non-power-of-two ====
+    [   0  43  85  91 121 135 138 143 155 200 214 238 267 ... 1643602942 1643602945 ]
+   elapesd time: 197.078ms    (CUDA Measured)
+    passed 
+==== work-efficient scan, power-of-two ====
+    [   0  43  85  91 121 135 138 143 155 200 214 238 267 ... 1643602997 1643603037 ]
+   elapesd time: 44.3224ms    (CUDA Measured)
+    passed 
+==== work-efficient scan, non-power-of-two ====
+    [   0  43  85  91 121 135 138 143 155 200 214 238 267 ... 1643602942 1643602945 ]
+   elapesd time: 44.2977ms    (CUDA Measured)
+    passed 
+==== thrust scan, power-of-two ====
+    [   0  43  85  91 121 135 138 143 155 200 214 238 267 ... 1643602997 1643603037 ]
+   elapesd time: 7.68762ms    (CUDA Measured)
+   elapesd time: 15.6526ms    (std::chrono Measured)
+    passed 
+==== thrust scan, non-power-of-two ====
+    [   0  43  85  91 121 135 138 143 155 200 214 238 267 ... 1643602942 1643602945 ]
+   elapesd time: 7.66259ms    (CUDA Measured)
+   elapesd time: 15.6029ms    (std::chrono Measured)
+    passed 
+
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+Array size (power of two): 67108864
+Array size (non-power of two): 67108861
+    [   0   1   3   1   0   0   3   3   3   2   3   3   3 ...   2   0 ]
+==== cpu compact without scan, power-of-two ====
+    [   1   3   1   3   3   3   2   3   3   3   1   1   2 ...   3   2 ]
+   elapesd time: 162.406ms    (std::chrono Measured)
+    passed 
+==== cpu compact without scan, non-power-of-two ====
+    [   1   3   1   3   3   3   2   3   3   3   1   1   2 ...   2   3 ]
+   elapesd time: 161.925ms    (std::chrono Measured)
+    passed 
+==== cpu compact with scan ====
+    [   1   3   1   3   3   3   2   3   3   3   1   1   2 ...   3   2 ]
+   elapesd time: 407.585ms    (std::chrono Measured)
+    passed 
+==== work-efficient compact, power-of-two ====
+    [   1   3   1   3   3   3   2   3   3   3   1   1   2 ...   3   2 ]
+   elapesd time: 62.2979ms    (CUDA Measured)
+    passed 
+==== work-efficient compact, non-power-of-two ====
+    [   1   3   1   3   3   3   2   3   3   3   1   1   2 ...   2   3 ]
+   elapesd time: 62.2192ms    (CUDA Measured)
+    passed 
+
+*****************************
+** RADIX SORT TESTS **
+*****************************
+Array size (power of two): 67108864
+Array size (non-power of two): 67108861
+Max value: 100
+    [  66  26  83  83  12  40  92  89  54  46  39  88  15 ...  88   0 ]
+==== std::sort, power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapesd time: 1498.49ms    (std::chrono Measured)
+==== thrust unstable sort, power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapesd time: 426.322ms    (std::chrono Measured)
+   elapesd time: 0.001472ms    (CUDA Measured)
+    passed 
+==== thrust stable sort, power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapesd time: 411.847ms    (std::chrono Measured)
+   elapesd time: 0.001152ms    (CUDA Measured)
+    passed 
+==== radix sort, power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapesd time: 545.908ms    (CUDA Measured)
+    passed 
+==== std::sort, non power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapesd time: 1494.83ms    (std::chrono Measured)
+==== radix sort, non power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapesd time: 554.777ms    (CUDA Measured)
+    passed 
diff --git a/test_results/result_blocksize_256.txt b/test_results/result_blocksize_256.txt
new file mode 100644
index 0000000..7ae2b41
--- /dev/null
+++ b/test_results/result_blocksize_256.txt
@@ -0,0 +1,106 @@
+CIS-565 HW2 CUDA Stream Compaction Test (Ruoyu Fan)
+    Block size for naive scan: 256
+    Block size for up-sweep: 256
+    Block size for down-sweep: 256
+    Block size for boolean mapping: 256
+    Block size for scattering: 256
+    Block sizes for radix sort: 256 256 256 256
+
+****************
+** SCAN TESTS **
+****************
+Array size (power of two): 67108864
+Array size (non-power of two): 67108861
+    [  13  11  23  13  16  28   2   9  25  46   3   0   5 ...  14   0 ]
+==== cpu scan, power-of-two ====
+   elapesd time: 162.003ms    (std::chrono Measured)
+    [   0  13  24  47  60  76 104 106 115 140 186 189 189 ... 1643295327 1643295341 ]
+==== cpu scan, non-power-of-two ====
+    [   0  13  24  47  60  76 104 106 115 140 186 189 189 ... 1643295226 1643295260 ]
+   elapesd time: 94.2996ms    (std::chrono Measured)
+    passed 
+==== naive scan, power-of-two ====
+    [   0  13  24  47  60  76 104 106 115 140 186 189 189 ... 1643295327 1643295341 ]
+   elapesd time: 114.575ms    (CUDA Measured)
+    passed 
+==== naive scan, non-power-of-two ====
+    [   0  13  24  47  60  76 104 106 115 140 186 189 189 ... 1643295226 1643295260 ]
+   elapesd time: 113.791ms    (CUDA Measured)
+    passed 
+==== work-efficient scan, power-of-two ====
+    [   0  13  24  47  60  76 104 106 115 140 186 189 189 ... 1643295327 1643295341 ]
+   elapesd time: 45.3902ms    (CUDA Measured)
+    passed 
+==== work-efficient scan, non-power-of-two ====
+    [   0  13  24  47  60  76 104 106 115 140 186 189 189 ... 1643295226 1643295260 ]
+   elapesd time: 45.3006ms    (CUDA Measured)
+    passed 
+==== thrust scan, power-of-two ====
+    [   0  13  24  47  60  76 104 106 115 140 186 189 189 ... 1643295327 1643295341 ]
+   elapesd time: 7.74496ms    (CUDA Measured)
+   elapesd time: 12.5349ms    (std::chrono Measured)
+    passed 
+==== thrust scan, non-power-of-two ====
+    [   0  13  24  47  60  76 104 106 115 140 186 189 189 ... 1643295226 1643295260 ]
+   elapesd time: 7.92486ms    (CUDA Measured)
+   elapesd time: 15.6367ms    (std::chrono Measured)
+    passed 
+
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+Array size (power of two): 67108864
+Array size (non-power of two): 67108861
+    [   2   0   0   0   3   0   2   2   3   3   1   0   1 ...   3   0 ]
+==== cpu compact without scan, power-of-two ====
+    [   2   3   2   2   3   3   1   1   3   2   1   1   2 ...   2   3 ]
+   elapesd time: 163.934ms    (std::chrono Measured)
+    passed 
+==== cpu compact without scan, non-power-of-two ====
+    [   2   3   2   2   3   3   1   1   3   2   1   1   2 ...   2   2 ]
+   elapesd time: 162.403ms    (std::chrono Measured)
+    passed 
+==== cpu compact with scan ====
+    [   2   3   2   2   3   3   1   1   3   2   1   1   2 ...   2   3 ]
+   elapesd time: 411.574ms    (std::chrono Measured)
+    passed 
+==== work-efficient compact, power-of-two ====
+    [   2   3   2   2   3   3   1   1   3   2   1   1   2 ...   2   3 ]
+   elapesd time: 53.4933ms    (CUDA Measured)
+    passed 
+==== work-efficient compact, non-power-of-two ====
+    [   2   3   2   2   3   3   1   1   3   2   1   1   2 ...   2   2 ]
+   elapesd time: 53.47ms    (CUDA Measured)
+    passed 
+
+*****************************
+** RADIX SORT TESTS **
+*****************************
+Array size (power of two): 67108864
+Array size (non-power of two): 67108861
+Max value: 100
+    [  32  65  36  71  68  70  59  63  90  75  65  65  66 ...  14   0 ]
+==== std::sort, power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapesd time: 1495.48ms    (std::chrono Measured)
+==== thrust unstable sort, power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapesd time: 432.425ms    (std::chrono Measured)
+   elapesd time: 0.001152ms    (CUDA Measured)
+    passed 
+==== thrust stable sort, power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapesd time: 431.506ms    (std::chrono Measured)
+   elapesd time: 0.001152ms    (CUDA Measured)
+    passed 
+==== radix sort, power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapesd time: 417.745ms    (CUDA Measured)
+    passed 
+==== std::sort, non power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapesd time: 1519.54ms    (std::chrono Measured)
+==== radix sort, non power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapesd time: 416.848ms    (CUDA Measured)
+    passed 
diff --git a/test_results/result_blocksize_32.txt b/test_results/result_blocksize_32.txt
new file mode 100644
index 0000000..e6047aa
--- /dev/null
+++ b/test_results/result_blocksize_32.txt
@@ -0,0 +1,106 @@
+CIS-565 HW2 CUDA Stream Compaction Test (Ruoyu Fan)
+    Block size for naive scan: 32
+    Block size for up-sweep: 32
+    Block size for down-sweep: 32
+    Block size for boolean mapping: 32
+    Block size for scattering: 32
+    Block sizes for radix sort: 32 32 32 32
+
+****************
+** SCAN TESTS **
+****************
+Array size (power of two): 67108864
+Array size (non-power of two): 67108861
+    [  10  18  28  43  38  19   1  46   6  26  42  28  46 ...  22   0 ]
+==== cpu scan, power-of-two ====
+   elapesd time: 162.953ms    (std::chrono Measured)
+    [   0  10  28  56  99 137 156 157 203 209 235 277 305 ... 1643743873 1643743895 ]
+==== cpu scan, non-power-of-two ====
+    [   0  10  28  56  99 137 156 157 203 209 235 277 305 ... 1643743749 1643743780 ]
+   elapesd time: 96.7571ms    (std::chrono Measured)
+    passed 
+==== naive scan, power-of-two ====
+    [   0  10  28  56  99 137 156 157 203 209 235 277 305 ... 1643743873 1643743895 ]
+   elapesd time: 133.116ms    (CUDA Measured)
+    passed 
+==== naive scan, non-power-of-two ====
+    [   0  10  28  56  99 137 156 157 203 209 235 277 305 ... 1643743749 1643743780 ]
+   elapesd time: 128.137ms    (CUDA Measured)
+    passed 
+==== work-efficient scan, power-of-two ====
+    [   0  10  28  56  99 137 156 157 203 209 235 277 305 ... 1643743873 1643743895 ]
+   elapesd time: 43.4925ms    (CUDA Measured)
+    passed 
+==== work-efficient scan, non-power-of-two ====
+    [   0  10  28  56  99 137 156 157 203 209 235 277 305 ... 1643743749 1643743780 ]
+   elapesd time: 43.3876ms    (CUDA Measured)
+    passed 
+==== thrust scan, power-of-two ====
+    [   0  10  28  56  99 137 156 157 203 209 235 277 305 ... 1643743873 1643743895 ]
+   elapesd time: 7.7208ms    (CUDA Measured)
+   elapesd time: 0ms    (std::chrono Measured)
+    passed 
+==== thrust scan, non-power-of-two ====
+    [   0  10  28  56  99 137 156 157 203 209 235 277 305 ... 1643743749 1643743780 ]
+   elapesd time: 7.51955ms    (CUDA Measured)
+   elapesd time: 0ms    (std::chrono Measured)
+    passed 
+
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+Array size (power of two): 67108864
+Array size (non-power of two): 67108861
+    [   0   3   3   3   2   0   3   2   1   0   0   2   3 ...   0   0 ]
+==== cpu compact without scan, power-of-two ====
+    [   3   3   3   2   3   2   1   2   3   3   1   3   3 ...   2   2 ]
+   elapesd time: 164.934ms    (std::chrono Measured)
+    passed 
+==== cpu compact without scan, non-power-of-two ====
+    [   3   3   3   2   3   2   1   2   3   3   1   3   3 ...   3   2 ]
+   elapesd time: 163.927ms    (std::chrono Measured)
+    passed 
+==== cpu compact with scan ====
+    [   3   3   3   2   3   2   1   2   3   3   1   3   3 ...   2   2 ]
+   elapesd time: 414.604ms    (std::chrono Measured)
+    passed 
+==== work-efficient compact, power-of-two ====
+    [   3   3   3   2   3   2   1   2   3   3   1   3   3 ...   2   2 ]
+   elapesd time: 54.2754ms    (CUDA Measured)
+    passed 
+==== work-efficient compact, non-power-of-two ====
+    [   3   3   3   2   3   2   1   2   3   3   1   3   3 ...   3   2 ]
+   elapesd time: 54.2756ms    (CUDA Measured)
+    passed 
+
+*****************************
+** RADIX SORT TESTS **
+*****************************
+Array size (power of two): 67108864
+Array size (non-power of two): 67108861
+Max value: 100
+    [  30  72  59  33  89  80   8  32  72  56   4  42   8 ...  21   0 ]
+==== std::sort, power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapesd time: 1493.97ms    (std::chrono Measured)
+==== thrust unstable sort, power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapesd time: 425.906ms    (std::chrono Measured)
+   elapesd time: 0.001152ms    (CUDA Measured)
+    passed 
+==== thrust stable sort, power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapesd time: 426.963ms    (std::chrono Measured)
+   elapesd time: 0.001184ms    (CUDA Measured)
+    passed 
+==== radix sort, power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapesd time: 450.604ms    (CUDA Measured)
+    passed 
+==== std::sort, non power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapesd time: 1505.41ms    (std::chrono Measured)
+==== radix sort, non power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapesd time: 442.399ms    (CUDA Measured)
+    passed 
diff --git a/test_results/result_blocksize_4.txt b/test_results/result_blocksize_4.txt
new file mode 100644
index 0000000..4283668
--- /dev/null
+++ b/test_results/result_blocksize_4.txt
@@ -0,0 +1,106 @@
+CIS-565 HW2 CUDA Stream Compaction Test (Ruoyu Fan)
+    Block size for naive scan: 4
+    Block size for up-sweep: 4
+    Block size for down-sweep: 4
+    Block size for boolean mapping: 4
+    Block size for scattering: 4
+    Block sizes for radix sort: 4 4 4 4
+
+****************
+** SCAN TESTS **
+****************
+Array size (power of two): 67108864
+Array size (non-power of two): 67108861
+    [  10  40  34  29   3  18  43  40  36   6  41  24  29 ...  27   0 ]
+==== cpu scan, power-of-two ====
+   elapesd time: 131.091ms    (std::chrono Measured)
+    [   0  10  50  84 113 116 134 177 217 253 259 300 324 ... 1643543450 1643543477 ]
+==== cpu scan, non-power-of-two ====
+    [   0  10  50  84 113 116 134 177 217 253 259 300 324 ... 1643543353 1643543374 ]
+   elapesd time: 97.2582ms    (std::chrono Measured)
+    passed 
+==== naive scan, power-of-two ====
+    [   0  10  50  84 113 116 134 177 217 253 259 300 324 ... 1643543450 1643543477 ]
+   elapesd time: 755.433ms    (CUDA Measured)
+    passed 
+==== naive scan, non-power-of-two ====
+    [   0  10  50  84 113 116 134 177 217 253 259 300 324 ... 1643543353 1643543374 ]
+   elapesd time: 756.067ms    (CUDA Measured)
+    passed 
+==== work-efficient scan, power-of-two ====
+    [   0  10  50  84 113 116 134 177 217 253 259 300 324 ... 1643543450 1643543477 ]
+   elapesd time: 76.5717ms    (CUDA Measured)
+    passed 
+==== work-efficient scan, non-power-of-two ====
+    [   0  10  50  84 113 116 134 177 217 253 259 300 324 ... 1643543353 1643543374 ]
+   elapesd time: 76.4466ms    (CUDA Measured)
+    passed 
+==== thrust scan, power-of-two ====
+    [   0  10  50  84 113 116 134 177 217 253 259 300 324 ... 1643543450 1643543477 ]
+   elapesd time: 7.60525ms    (CUDA Measured)
+   elapesd time: 15.6265ms    (std::chrono Measured)
+    passed 
+==== thrust scan, non-power-of-two ====
+    [   0  10  50  84 113 116 134 177 217 253 259 300 324 ... 1643543353 1643543374 ]
+   elapesd time: 7.60605ms    (CUDA Measured)
+   elapesd time: 8.9781ms    (std::chrono Measured)
+    passed 
+
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+Array size (power of two): 67108864
+Array size (non-power of two): 67108861
+    [   3   2   1   3   3   2   2   1   0   0   0   0   0 ...   1   0 ]
+==== cpu compact without scan, power-of-two ====
+    [   3   2   1   3   3   2   2   1   1   2   3   2   2 ...   3   1 ]
+   elapesd time: 164.4ms    (std::chrono Measured)
+    passed 
+==== cpu compact without scan, non-power-of-two ====
+    [   3   2   1   3   3   2   2   1   1   2   3   2   2 ...   1   3 ]
+   elapesd time: 152.59ms    (std::chrono Measured)
+    passed 
+==== cpu compact with scan ====
+    [   3   2   1   3   3   2   2   1   1   2   3   2   2 ...   3   1 ]
+   elapesd time: 411.575ms    (std::chrono Measured)
+    passed 
+==== work-efficient compact, power-of-two ====
+    [   3   2   1   3   3   2   2   1   1   2   3   2   2 ...   3   1 ]
+   elapesd time: 143.747ms    (CUDA Measured)
+    passed 
+==== work-efficient compact, non-power-of-two ====
+    [   3   2   1   3   3   2   2   1   1   2   3   2   2 ...   1   3 ]
+   elapesd time: 143.446ms    (CUDA Measured)
+    passed 
+
+*****************************
+** RADIX SORT TESTS **
+*****************************
+Array size (power of two): 67108864
+Array size (non-power of two): 67108861
+Max value: 100
+    [  33  43  61  82  69  55  63  67  95  87  56  32  61 ...  76   0 ]
+==== std::sort, power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapesd time: 1503.12ms    (std::chrono Measured)
+==== thrust unstable sort, power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapesd time: 416.707ms    (std::chrono Measured)
+   elapesd time: 0.001088ms    (CUDA Measured)
+    passed 
+==== thrust stable sort, power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapesd time: 421.296ms    (std::chrono Measured)
+   elapesd time: 0.001248ms    (CUDA Measured)
+    passed 
+==== radix sort, power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapesd time: 1430.52ms    (CUDA Measured)
+    passed 
+==== std::sort, non power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapesd time: 1512.72ms    (std::chrono Measured)
+==== radix sort, non power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapesd time: 1424.39ms    (CUDA Measured)
+    passed 
diff --git a/test_results/result_blocksize_512.txt b/test_results/result_blocksize_512.txt
new file mode 100644
index 0000000..d2f91f2
--- /dev/null
+++ b/test_results/result_blocksize_512.txt
@@ -0,0 +1,106 @@
+CIS-565 HW2 CUDA Stream Compaction Test (Ruoyu Fan)
+    Block size for naive scan: 512
+    Block size for up-sweep: 512
+    Block size for down-sweep: 512
+    Block size for boolean mapping: 512
+    Block size for scattering: 512
+    Block sizes for radix sort: 512 512 512 512
+
+****************
+** SCAN TESTS **
+****************
+Array size (power of two): 67108864
+Array size (non-power of two): 67108861
+    [   8  48   0  22  46  42  42   6  29  43  24  12  33 ...   3   0 ]
+==== cpu scan, power-of-two ====
+   elapesd time: 113.379ms    (std::chrono Measured)
+    [   0   8  56  56  78 124 166 208 214 243 286 310 322 ... 1643320954 1643320957 ]
+==== cpu scan, non-power-of-two ====
+    [   0   8  56  56  78 124 166 208 214 243 286 310 322 ... 1643320916 1643320916 ]
+   elapesd time: 94.7561ms    (std::chrono Measured)
+    passed 
+==== naive scan, power-of-two ====
+    [   0   8  56  56  78 124 166 208 214 243 286 310 322 ... 1643320954 1643320957 ]
+   elapesd time: 114.593ms    (CUDA Measured)
+    passed 
+==== naive scan, non-power-of-two ====
+    [   0   8  56  56  78 124 166 208 214 243 286 310 322 ... 1643320916 1643320916 ]
+   elapesd time: 113.794ms    (CUDA Measured)
+    passed 
+==== work-efficient scan, power-of-two ====
+    [   0   8  56  56  78 124 166 208 214 243 286 310 322 ... 1643320954 1643320957 ]
+   elapesd time: 44.3084ms    (CUDA Measured)
+    passed 
+==== work-efficient scan, non-power-of-two ====
+    [   0   8  56  56  78 124 166 208 214 243 286 310 322 ... 1643320916 1643320916 ]
+   elapesd time: 44.2844ms    (CUDA Measured)
+    passed 
+==== thrust scan, power-of-two ====
+    [   0   8  56  56  78 124 166 208 214 243 286 310 322 ... 1643320954 1643320957 ]
+   elapesd time: 7.65802ms    (CUDA Measured)
+   elapesd time: 15.6002ms    (std::chrono Measured)
+    passed 
+==== thrust scan, non-power-of-two ====
+    [   0   8  56  56  78 124 166 208 214 243 286 310 322 ... 1643320916 1643320916 ]
+   elapesd time: 7.88176ms    (CUDA Measured)
+   elapesd time: 0ms    (std::chrono Measured)
+    passed 
+
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+Array size (power of two): 67108864
+Array size (non-power of two): 67108861
+    [   1   1   0   1   2   3   0   1   1   3   0   2   1 ...   2   0 ]
+==== cpu compact without scan, power-of-two ====
+    [   1   1   1   2   3   1   1   3   2   1   2   2   2 ...   2   2 ]
+   elapesd time: 164.929ms    (std::chrono Measured)
+    passed 
+==== cpu compact without scan, non-power-of-two ====
+    [   1   1   1   2   3   1   1   3   2   1   2   2   2 ...   2   2 ]
+   elapesd time: 164.423ms    (std::chrono Measured)
+    passed 
+==== cpu compact with scan ====
+    [   1   1   1   2   3   1   1   3   2   1   2   2   2 ...   2   2 ]
+   elapesd time: 412.597ms    (std::chrono Measured)
+    passed 
+==== work-efficient compact, power-of-two ====
+    [   1   1   1   2   3   1   1   3   2   1   2   2   2 ...   2   2 ]
+   elapesd time: 53.2704ms    (CUDA Measured)
+    passed 
+==== work-efficient compact, non-power-of-two ====
+    [   1   1   1   2   3   1   1   3   2   1   2   2   2 ...   2   2 ]
+   elapesd time: 53.2295ms    (CUDA Measured)
+    passed 
+
+*****************************
+** RADIX SORT TESTS **
+*****************************
+Array size (power of two): 67108864
+Array size (non-power of two): 67108861
+Max value: 100
+    [  31  51  77  75  62  11  62  83  56  75  89  70  65 ...  51   0 ]
+==== std::sort, power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapesd time: 1523.42ms    (std::chrono Measured)
+==== thrust unstable sort, power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapesd time: 421.785ms    (std::chrono Measured)
+   elapesd time: 0.001632ms    (CUDA Measured)
+    passed 
+==== thrust stable sort, power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapesd time: 425.961ms    (std::chrono Measured)
+   elapesd time: 0.001184ms    (CUDA Measured)
+    passed 
+==== radix sort, power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapesd time: 418.947ms    (CUDA Measured)
+    passed 
+==== std::sort, non power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapesd time: 1494.14ms    (std::chrono Measured)
+==== radix sort, non power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapesd time: 416.928ms    (CUDA Measured)
+    passed 
diff --git a/test_results/result_blocksize_64.txt b/test_results/result_blocksize_64.txt
new file mode 100644
index 0000000..33d7ec0
--- /dev/null
+++ b/test_results/result_blocksize_64.txt
@@ -0,0 +1,106 @@
+CIS-565 HW2 CUDA Stream Compaction Test (Ruoyu Fan)
+    Block size for naive scan: 64
+    Block size for up-sweep: 64
+    Block size for down-sweep: 64
+    Block size for boolean mapping: 64
+    Block size for scattering: 64
+    Block sizes for radix sort: 64 64 64 64
+
+****************
+** SCAN TESTS **
+****************
+Array size (power of two): 67108864
+Array size (non-power of two): 67108861
+    [  30  36  13   2   8  41  42  35  42  40  47  44  18 ...  49   0 ]
+==== cpu scan, power-of-two ====
+   elapesd time: 118.412ms    (std::chrono Measured)
+    [   0  30  66  79  81  89 130 172 207 249 289 336 380 ... 1643522704 1643522753 ]
+==== cpu scan, non-power-of-two ====
+    [   0  30  66  79  81  89 130 172 207 249 289 336 380 ... 1643522605 1643522641 ]
+   elapesd time: 94.2541ms    (std::chrono Measured)
+    passed 
+==== naive scan, power-of-two ====
+    [   0  30  66  79  81  89 130 172 207 249 289 336 380 ... 1643522704 1643522753 ]
+   elapesd time: 114.7ms    (CUDA Measured)
+    passed 
+==== naive scan, non-power-of-two ====
+    [   0  30  66  79  81  89 130 172 207 249 289 336 380 ... 1643522605 1643522641 ]
+   elapesd time: 113.626ms    (CUDA Measured)
+    passed 
+==== work-efficient scan, power-of-two ====
+    [   0  30  66  79  81  89 130 172 207 249 289 336 380 ... 1643522704 1643522753 ]
+   elapesd time: 44.7172ms    (CUDA Measured)
+    passed 
+==== work-efficient scan, non-power-of-two ====
+    [   0  30  66  79  81  89 130 172 207 249 289 336 380 ... 1643522605 1643522641 ]
+   elapesd time: 44.5198ms    (CUDA Measured)
+    passed 
+==== thrust scan, power-of-two ====
+    [   0  30  66  79  81  89 130 172 207 249 289 336 380 ... 1643522704 1643522753 ]
+   elapesd time: 7.42698ms    (CUDA Measured)
+   elapesd time: 15.6355ms    (std::chrono Measured)
+    passed 
+==== thrust scan, non-power-of-two ====
+    [   0  30  66  79  81  89 130 172 207 249 289 336 380 ... 1643522605 1643522641 ]
+   elapesd time: 7.6359ms    (CUDA Measured)
+   elapesd time: 15.6267ms    (std::chrono Measured)
+    passed 
+
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+Array size (power of two): 67108864
+Array size (non-power of two): 67108861
+    [   3   3   0   3   1   2   3   2   1   1   1   3   0 ...   0   0 ]
+==== cpu compact without scan, power-of-two ====
+    [   3   3   3   1   2   3   2   1   1   1   3   1   1 ...   1   2 ]
+   elapesd time: 162.432ms    (std::chrono Measured)
+    passed 
+==== cpu compact without scan, non-power-of-two ====
+    [   3   3   3   1   2   3   2   1   1   1   3   1   1 ...   1   2 ]
+   elapesd time: 162.447ms    (std::chrono Measured)
+    passed 
+==== cpu compact with scan ====
+    [   3   3   3   1   2   3   2   1   1   1   3   1   1 ...   1   2 ]
+   elapesd time: 409.609ms    (std::chrono Measured)
+    passed 
+==== work-efficient compact, power-of-two ====
+    [   3   3   3   1   2   3   2   1   1   1   3   1   1 ...   1   2 ]
+   elapesd time: 53.5306ms    (CUDA Measured)
+    passed 
+==== work-efficient compact, non-power-of-two ====
+    [   3   3   3   1   2   3   2   1   1   1   3   1   1 ...   1   2 ]
+   elapesd time: 53.686ms    (CUDA Measured)
+    passed 
+
+*****************************
+** RADIX SORT TESTS **
+*****************************
+Array size (power of two): 67108864
+Array size (non-power of two): 67108861
+Max value: 100
+    [  50  40  76  42  10  83  16  71   8  69   9  59  29 ...  30   0 ]
+==== std::sort, power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapesd time: 1528.95ms    (std::chrono Measured)
+==== thrust unstable sort, power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapesd time: 425.366ms    (std::chrono Measured)
+   elapesd time: 0.001088ms    (CUDA Measured)
+    passed 
+==== thrust stable sort, power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapesd time: 426.134ms    (std::chrono Measured)
+   elapesd time: 0.00112ms    (CUDA Measured)
+    passed 
+==== radix sort, power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapesd time: 418.522ms    (CUDA Measured)
+    passed 
+==== std::sort, non power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapesd time: 1516.8ms    (std::chrono Measured)
+==== radix sort, non power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapesd time: 418.378ms    (CUDA Measured)
+    passed 
diff --git a/test_results/result_blocksize_8.txt b/test_results/result_blocksize_8.txt
new file mode 100644
index 0000000..74dbe54
--- /dev/null
+++ b/test_results/result_blocksize_8.txt
@@ -0,0 +1,106 @@
+CIS-565 HW2 CUDA Stream Compaction Test (Ruoyu Fan)
+    Block size for naive scan: 8
+    Block size for up-sweep: 8
+    Block size for down-sweep: 8
+    Block size for boolean mapping: 8
+    Block size for scattering: 8
+    Block sizes for radix sort: 8 8 8 8
+
+****************
+** SCAN TESTS **
+****************
+Array size (power of two): 67108864
+Array size (non-power of two): 67108861
+    [   8  14  36   6  10  19  34   1  39  40  27  32  18 ...  35   0 ]
+==== cpu scan, power-of-two ====
+   elapesd time: 107.325ms    (std::chrono Measured)
+    [   0   8  22  58  64  74  93 127 128 167 207 234 266 ... 1643593142 1643593177 ]
+==== cpu scan, non-power-of-two ====
+    [   0   8  22  58  64  74  93 127 128 167 207 234 266 ... 1643593071 1643593071 ]
+   elapesd time: 100.297ms    (std::chrono Measured)
+    passed 
+==== naive scan, power-of-two ====
+    [   0   8  22  58  64  74  93 127 128 167 207 234 266 ... 1643593142 1643593177 ]
+   elapesd time: 379.897ms    (CUDA Measured)
+    passed 
+==== naive scan, non-power-of-two ====
+    [   0   8  22  58  64  74  93 127 128 167 207 234 266 ... 1643593071 1643593071 ]
+   elapesd time: 379.789ms    (CUDA Measured)
+    passed 
+==== work-efficient scan, power-of-two ====
+    [   0   8  22  58  64  74  93 127 128 167 207 234 266 ... 1643593142 1643593177 ]
+   elapesd time: 52.1942ms    (CUDA Measured)
+    passed 
+==== work-efficient scan, non-power-of-two ====
+    [   0   8  22  58  64  74  93 127 128 167 207 234 266 ... 1643593071 1643593071 ]
+   elapesd time: 52.1319ms    (CUDA Measured)
+    passed 
+==== thrust scan, power-of-two ====
+    [   0   8  22  58  64  74  93 127 128 167 207 234 266 ... 1643593142 1643593177 ]
+   elapesd time: 7.71994ms    (CUDA Measured)
+   elapesd time: 15.6528ms    (std::chrono Measured)
+    passed 
+==== thrust scan, non-power-of-two ====
+    [   0   8  22  58  64  74  93 127 128 167 207 234 266 ... 1643593071 1643593071 ]
+   elapesd time: 7.41206ms    (CUDA Measured)
+   elapesd time: 21.3408ms    (std::chrono Measured)
+    passed 
+
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+Array size (power of two): 67108864
+Array size (non-power of two): 67108861
+    [   3   1   2   2   0   2   1   2   0   2   0   2   3 ...   2   0 ]
+==== cpu compact without scan, power-of-two ====
+    [   3   1   2   2   2   1   2   2   2   3   2   2   2 ...   2   2 ]
+   elapesd time: 165.426ms    (std::chrono Measured)
+    passed 
+==== cpu compact without scan, non-power-of-two ====
+    [   3   1   2   2   2   1   2   2   2   3   2   2   2 ...   1   2 ]
+   elapesd time: 165.433ms    (std::chrono Measured)
+    passed 
+==== cpu compact with scan ====
+    [   3   1   2   2   2   1   2   2   2   3   2   2   2 ...   2   2 ]
+   elapesd time: 411.595ms    (std::chrono Measured)
+    passed 
+==== work-efficient compact, power-of-two ====
+    [   3   1   2   2   2   1   2   2   2   3   2   2   2 ...   2   2 ]
+   elapesd time: 86.2233ms    (CUDA Measured)
+    passed 
+==== work-efficient compact, non-power-of-two ====
+    [   3   1   2   2   2   1   2   2   2   3   2   2   2 ...   1   2 ]
+   elapesd time: 86.2171ms    (CUDA Measured)
+    passed 
+
+*****************************
+** RADIX SORT TESTS **
+*****************************
+Array size (power of two): 67108864
+Array size (non-power of two): 67108861
+Max value: 100
+    [  81  67  63  60   8  38  71  60  66  22  92  90  68 ...  83   0 ]
+==== std::sort, power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapesd time: 1485.43ms    (std::chrono Measured)
+==== thrust unstable sort, power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapesd time: 400.919ms    (std::chrono Measured)
+   elapesd time: 0.001088ms    (CUDA Measured)
+    passed 
+==== thrust stable sort, power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapesd time: 418.473ms    (std::chrono Measured)
+   elapesd time: 0.00112ms    (CUDA Measured)
+    passed 
+==== radix sort, power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapesd time: 818.39ms    (CUDA Measured)
+    passed 
+==== std::sort, non power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapesd time: 1515.04ms    (std::chrono Measured)
+==== radix sort, non power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapesd time: 818.379ms    (CUDA Measured)
+    passed 
diff --git a/test_results/result_radix_max_100.txt b/test_results/result_radix_max_100.txt
new file mode 100644
index 0000000..2124dec
--- /dev/null
+++ b/test_results/result_radix_max_100.txt
@@ -0,0 +1,101 @@
+* THIS TEST RAN ON A DIFFERENT MACHINE:
+    Windows 7, Xeon(R) E5-1630 @ 3.70GHz 32GB, GTX 1070 8192MB (Moore 103 SigLab)
+
+CIS-565 HW2 CUDA Stream Compaction Test (Ruoyu Fan)
+    Block size for naive scan: 1024
+    Block size for up-sweep: 1024
+    Block size for down-sweep: 1024
+    Block size for boolean mapping: 1024
+    Block size for scattering: 1024
+    Block sizes for radix sort: 1024 1024 1024 1024
+
+****************
+** SCAN TESTS **
+****************
+Array size (power of two): 33554432
+Array size (non-power of two): 33554429
+    [  19  37   1  47  44  49  35  24   1  15  49   9  28 ...  37   0 ]
+==== cpu scan, power-of-two ====
+   elapsed time: 94ms    (std::chrono Measured)
+    [   0  19  56  57 104 148 197 232 256 257 272 321 330 ... 821783866 821783903 ]
+==== cpu scan, non-power-of-two ====
+    [   0  19  56  57 104 148 197 232 256 257 272 321 330 ... 821783806 821783812 ]
+   elapsed time: 87ms    (std::chrono Measured)
+    passed 
+==== naive scan, power-of-two ====
+    [   0  19  56  57 104 148 197 232 256 257 272 321 330 ... 821783866 821783903 ]
+   elapsed time: 38.0587ms    (CUDA Measured)
+    passed 
+==== naive scan, non-power-of-two ====
+    [   0  19  56  57 104 148 197 232 256 257 272 321 330 ... 821783806 821783812 ]
+   elapsed time: 38.0423ms    (CUDA Measured)
+    passed 
+==== work-efficient scan, power-of-two ====
+    [   0  19  56  57 104 148 197 232 256 257 272 321 330 ... 821783866 821783903 ]
+   elapsed time: 15.0559ms    (CUDA Measured)
+    passed 
+==== work-efficient scan, non-power-of-two ====
+    [   0  19  56  57 104 148 197 232 256 257 272 321 330 ... 821783806 821783812 ]
+   elapsed time: 15.0186ms    (CUDA Measured)
+    passed 
+==== thrust scan, power-of-two ====
+    [   0  19  56  57 104 148 197 232 256 257 272 321 330 ... 821783866 821783903 ]
+   elapsed time: 9.03926ms    (CUDA Measured)
+    passed 
+==== thrust scan, non-power-of-two ====
+    [   0  19  56  57 104 148 197 232 256 257 272 321 330 ... 821783806 821783812 ]
+   elapsed time: 2.79245ms    (CUDA Measured)
+    passed 
+
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+Array size (power of two): 33554432
+Array size (non-power of two): 33554429
+    [   1   2   0   2   3   0   3   3   2   3   3   3   1 ...   0   0 ]
+==== cpu compact without scan, power-of-two ====
+    [   1   2   2   3   3   3   2   3   3   3   1   3   1 ...   2   2 ]
+   elapsed time: 86ms    (std::chrono Measured)
+    passed 
+==== cpu compact without scan, non-power-of-two ====
+    [   1   2   2   3   3   3   2   3   3   3   1   3   1 ...   2   2 ]
+   elapsed time: 87ms    (std::chrono Measured)
+    passed 
+==== cpu compact with scan ====
+    [   1   2   2   3   3   3   2   3   3   3   1   3   1 ...   2   2 ]
+   elapsed time: 236ms    (std::chrono Measured)
+    passed 
+==== work-efficient compact, power-of-two ====
+    [   1   2   2   3   3   3   2   3   3   3   1   3   1 ...   2   2 ]
+   elapsed time: 18.2723ms    (CUDA Measured)
+    passed 
+==== work-efficient compact, non-power-of-two ====
+    [   1   2   2   3   3   3   2   3   3   3   1   3   1 ...   2   2 ]
+   elapsed time: 18.2446ms    (CUDA Measured)
+    passed 
+
+*****************************
+** RADIX SORT TESTS **
+*****************************
+Array size (power of two): 33554432
+Array size (non-power of two): 33554429
+Max value: 100
+    [  76  34  12  37   5  35  59  88   6  87   3  47  20 ...   8   0 ]
+==== std::sort, power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapsed time: 917ms    (std::chrono Measured)
+==== thrust::sort (which calls Thrust's radix sort), power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapsed time: 20.2813ms    (CUDA Measured)
+    passed 
+==== radix sort, power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapsed time: 143.992ms    (CUDA Measured)
+    passed 
+==== std::sort, non power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapsed time: 923ms    (std::chrono Measured)
+==== radix sort, non power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapsed time: 143.374ms    (CUDA Measured)
+    passed 
diff --git a/test_results/result_radix_max_1000000000.txt b/test_results/result_radix_max_1000000000.txt
new file mode 100644
index 0000000..0efa4b5
--- /dev/null
+++ b/test_results/result_radix_max_1000000000.txt
@@ -0,0 +1,101 @@
+* THIS TEST RAN ON A DIFFERENT MACHINE:
+    Windows 7, Xeon(R) E5-1630 @ 3.70GHz 32GB, GTX 1070 8192MB (Moore 103 SigLab)
+
+CIS-565 HW2 CUDA Stream Compaction Test (Ruoyu Fan)
+    Block size for naive scan: 1024
+    Block size for up-sweep: 1024
+    Block size for down-sweep: 1024
+    Block size for boolean mapping: 1024
+    Block size for scattering: 1024
+    Block sizes for radix sort: 1024 1024 1024 1024
+
+****************
+** SCAN TESTS **
+****************
+Array size (power of two): 33554432
+Array size (non-power of two): 33554429
+    [  13  29  47   7   2  28  44  49  35  46   2  49   9 ...  37   0 ]
+==== cpu scan, power-of-two ====
+   elapsed time: 93ms    (std::chrono Measured)
+    [   0  13  42  89  96  98 126 170 219 254 300 302 351 ... 821752752 821752789 ]
+==== cpu scan, non-power-of-two ====
+    [   0  13  42  89  96  98 126 170 219 254 300 302 351 ... 821752661 821752701 ]
+   elapsed time: 87ms    (std::chrono Measured)
+    passed 
+==== naive scan, power-of-two ====
+    [   0  13  42  89  96  98 126 170 219 254 300 302 351 ... 821752752 821752789 ]
+   elapsed time: 38.1036ms    (CUDA Measured)
+    passed 
+==== naive scan, non-power-of-two ====
+    [   0  13  42  89  96  98 126 170 219 254 300 302 351 ... 821752661 821752701 ]
+   elapsed time: 38.112ms    (CUDA Measured)
+    passed 
+==== work-efficient scan, power-of-two ====
+    [   0  13  42  89  96  98 126 170 219 254 300 302 351 ... 821752752 821752789 ]
+   elapsed time: 15.0276ms    (CUDA Measured)
+    passed 
+==== work-efficient scan, non-power-of-two ====
+    [   0  13  42  89  96  98 126 170 219 254 300 302 351 ... 821752661 821752701 ]
+   elapsed time: 15.0576ms    (CUDA Measured)
+    passed 
+==== thrust scan, power-of-two ====
+    [   0  13  42  89  96  98 126 170 219 254 300 302 351 ... 821752752 821752789 ]
+   elapsed time: 8.90237ms    (CUDA Measured)
+    passed 
+==== thrust scan, non-power-of-two ====
+    [   0  13  42  89  96  98 126 170 219 254 300 302 351 ... 821752661 821752701 ]
+   elapsed time: 2.74368ms    (CUDA Measured)
+    passed 
+
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+Array size (power of two): 33554432
+Array size (non-power of two): 33554429
+    [   1   1   3   1   1   3   0   2   3   2   0   1   0 ...   3   0 ]
+==== cpu compact without scan, power-of-two ====
+    [   1   1   3   1   1   3   2   3   2   1   3   2   2 ...   2   3 ]
+   elapsed time: 83ms    (std::chrono Measured)
+    passed 
+==== cpu compact without scan, non-power-of-two ====
+    [   1   1   3   1   1   3   2   3   2   1   3   2   2 ...   3   2 ]
+   elapsed time: 84ms    (std::chrono Measured)
+    passed 
+==== cpu compact with scan ====
+    [   1   1   3   1   1   3   2   3   2   1   3   2   2 ...   2   3 ]
+   elapsed time: 237ms    (std::chrono Measured)
+    passed 
+==== work-efficient compact, power-of-two ====
+    [   1   1   3   1   1   3   2   3   2   1   3   2   2 ...   2   3 ]
+   elapsed time: 18.2364ms    (CUDA Measured)
+    passed 
+==== work-efficient compact, non-power-of-two ====
+    [   1   1   3   1   1   3   2   3   2   1   3   2   2 ...   3   2 ]
+   elapsed time: 18.2344ms    (CUDA Measured)
+    passed 
+
+*****************************
+** RADIX SORT TESTS **
+*****************************
+Array size (power of two): 33554432
+Array size (non-power of two): 33554429
+Max value: 1000000000
+    [ 21520 17257 9407 8648 31232 11282 11169 22994 15890 9350 22656 25538 29919 ... 23658   0 ]
+==== std::sort, power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ... 32767 32767 ]
+   elapsed time: 2173ms    (std::chrono Measured)
+==== thrust::sort (which calls Thrust's radix sort), power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ... 32767 32767 ]
+   elapsed time: 20.6702ms    (CUDA Measured)
+    passed 
+==== radix sort, power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ... 32767 32767 ]
+   elapsed time: 628.72ms    (CUDA Measured)
+    passed 
+==== std::sort, non power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ... 32767 32767 ]
+   elapsed time: 2153ms    (std::chrono Measured)
+==== radix sort, non power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ... 32767 32767 ]
+   elapsed time: 617.616ms    (CUDA Measured)
+    passed 
diff --git a/test_results/result_size_12_power_of_2.txt b/test_results/result_size_12_power_of_2.txt
new file mode 100644
index 0000000..715d4a7
--- /dev/null
+++ b/test_results/result_size_12_power_of_2.txt
@@ -0,0 +1,106 @@
+CIS-565 HW2 CUDA Stream Compaction Test (Ruoyu Fan)
+    Block size for naive scan: 1024
+    Block size for up-sweep: 1024
+    Block size for down-sweep: 1024
+    Block size for boolean mapping: 1024
+    Block size for scattering: 1024
+    Block sizes for radix sort: 1024 1024 1024 1024
+
+****************
+** SCAN TESTS **
+****************
+Array size (power of two): 4096
+Array size (non-power of two): 4093
+    [  33  37   5  39  24  31  22  36  22  21  32  43   4 ...   9   0 ]
+==== cpu scan, power-of-two ====
+   elapsed time: 0ms    (std::chrono Measured)
+    [   0  33  70  75 114 138 169 191 227 249 270 302 345 ... 101088 101097 ]
+==== cpu scan, non-power-of-two ====
+    [   0  33  70  75 114 138 169 191 227 249 270 302 345 ... 100984 101004 ]
+   elapsed time: 0ms    (std::chrono Measured)
+    passed 
+==== naive scan, power-of-two ====
+    [   0  33  70  75 114 138 169 191 227 249 270 302 345 ... 101088 101097 ]
+   elapsed time: 0.047872ms    (CUDA Measured)
+    passed 
+==== naive scan, non-power-of-two ====
+    [   0  33  70  75 114 138 169 191 227 249 270 302 345 ... 100984 101004 ]
+   elapsed time: 0.046368ms    (CUDA Measured)
+    passed 
+==== work-efficient scan, power-of-two ====
+    [   0  33  70  75 114 138 169 191 227 249 270 302 345 ... 101088 101097 ]
+   elapsed time: 0.110784ms    (CUDA Measured)
+    passed 
+==== work-efficient scan, non-power-of-two ====
+    [   0  33  70  75 114 138 169 191 227 249 270 302 345 ... 100984 101004 ]
+   elapsed time: 0.1096ms    (CUDA Measured)
+    passed 
+==== thrust scan, power-of-two ====
+    [   0  33  70  75 114 138 169 191 227 249 270 302 345 ... 101088 101097 ]
+   elapsed time: 0.028192ms    (CUDA Measured)
+   elapsed time: 0ms    (std::chrono Measured)
+    passed 
+==== thrust scan, non-power-of-two ====
+    [   0  33  70  75 114 138 169 191 227 249 270 302 345 ... 100984 101004 ]
+   elapsed time: 0.019264ms    (CUDA Measured)
+   elapsed time: 0ms    (std::chrono Measured)
+    passed 
+
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+Array size (power of two): 4096
+Array size (non-power of two): 4093
+    [   3   1   3   1   2   1   0   2   0   1   0   1   2 ...   1   0 ]
+==== cpu compact without scan, power-of-two ====
+    [   3   1   3   1   2   1   2   1   1   2   1   3   2 ...   1   1 ]
+   elapsed time: 0ms    (std::chrono Measured)
+    passed 
+==== cpu compact without scan, non-power-of-two ====
+    [   3   1   3   1   2   1   2   1   1   2   1   3   2 ...   2   3 ]
+   elapsed time: 0ms    (std::chrono Measured)
+    passed 
+==== cpu compact with scan ====
+    [   3   1   3   1   2   1   2   1   1   2   1   3   2 ...   1   1 ]
+   elapsed time: 0ms    (std::chrono Measured)
+    passed 
+==== work-efficient compact, power-of-two ====
+    [   3   1   3   1   2   1   2   1   1   2   1   3   2 ...   1   1 ]
+   elapsed time: 0.118368ms    (CUDA Measured)
+    passed 
+==== work-efficient compact, non-power-of-two ====
+    [   3   1   3   1   2   1   2   1   1   2   1   3   2 ...   2   3 ]
+   elapsed time: 0.117536ms    (CUDA Measured)
+    passed 
+
+*****************************
+** RADIX SORT TESTS **
+*****************************
+Array size (power of two): 4096
+Array size (non-power of two): 4093
+Max value: 100
+    [  83  37  55  89  74  81  72  86  72  21  32  93  54 ...   9   0 ]
+==== std::sort, power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapsed time: 0ms    (std::chrono Measured)
+==== thrust unstable sort, power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapsed time: 0ms    (std::chrono Measured)
+   elapsed time: 0.001152ms    (CUDA Measured)
+    passed 
+==== thrust stable sort, power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapsed time: 0ms    (std::chrono Measured)
+   elapsed time: 0.001152ms    (CUDA Measured)
+    passed 
+==== radix sort, power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapsed time: 0.883328ms    (CUDA Measured)
+    passed 
+==== std::sort, non power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapsed time: 0ms    (std::chrono Measured)
+==== radix sort, non power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapsed time: 0.881664ms    (CUDA Measured)
+    passed 
diff --git a/test_results/result_size_16_power_of_2.txt b/test_results/result_size_16_power_of_2.txt
new file mode 100644
index 0000000..6bce6c6
--- /dev/null
+++ b/test_results/result_size_16_power_of_2.txt
@@ -0,0 +1,106 @@
+CIS-565 HW2 CUDA Stream Compaction Test (Ruoyu Fan)
+    Block size for naive scan: 1024
+    Block size for up-sweep: 1024
+    Block size for down-sweep: 1024
+    Block size for boolean mapping: 1024
+    Block size for scattering: 1024
+    Block sizes for radix sort: 1024 1024 1024 1024
+
+****************
+** SCAN TESTS **
+****************
+Array size (power of two): 65536
+Array size (non-power of two): 65533
+    [  48  48  40  36  40  29  35   9  23  32  24  24  29 ...  19   0 ]
+==== cpu scan, power-of-two ====
+   elapsed time: 0ms    (std::chrono Measured)
+    [   0  48  96 136 172 212 241 276 285 308 340 364 388 ... 1607161 1607180 ]
+==== cpu scan, non-power-of-two ====
+    [   0  48  96 136 172 212 241 276 285 308 340 364 388 ... 1607103 1607131 ]
+   elapsed time: 0ms    (std::chrono Measured)
+    passed 
+==== naive scan, power-of-two ====
+    [   0  48  96 136 172 212 241 276 285 308 340 364 388 ... 1607161 1607180 ]
+   elapsed time: 0.09216ms    (CUDA Measured)
+    passed 
+==== naive scan, non-power-of-two ====
+    [   0  48  96 136 172 212 241 276 285 308 340 364 388 ... 1607103 1607131 ]
+   elapsed time: 0.08976ms    (CUDA Measured)
+    passed 
+==== work-efficient scan, power-of-two ====
+    [   0  48  96 136 172 212 241 276 285 308 340 364 388 ... 1607161 1607180 ]
+   elapsed time: 0.180064ms    (CUDA Measured)
+    passed 
+==== work-efficient scan, non-power-of-two ====
+    [   0  48  96 136 172 212 241 276 285 308 340 364 388 ... 1607103 1607131 ]
+   elapsed time: 0.179712ms    (CUDA Measured)
+    passed 
+==== thrust scan, power-of-two ====
+    [   0  48  96 136 172 212 241 276 285 308 340 364 388 ... 1607161 1607180 ]
+   elapsed time: 0.166432ms    (CUDA Measured)
+   elapsed time: 0ms    (std::chrono Measured)
+    passed 
+==== thrust scan, non-power-of-two ====
+    [   0  48  96 136 172 212 241 276 285 308 340 364 388 ... 1607103 1607131 ]
+   elapsed time: 0.170848ms    (CUDA Measured)
+   elapsed time: 0ms    (std::chrono Measured)
+    passed 
+
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+Array size (power of two): 65536
+Array size (non-power of two): 65533
+    [   0   2   2   0   2   3   1   1   3   0   2   2   3 ...   3   0 ]
+==== cpu compact without scan, power-of-two ====
+    [   2   2   2   3   1   1   3   2   2   3   3   2   2 ...   2   3 ]
+   elapsed time: 0ms    (std::chrono Measured)
+    passed 
+==== cpu compact without scan, non-power-of-two ====
+    [   2   2   2   3   1   1   3   2   2   3   3   2   2 ...   3   2 ]
+   elapsed time: 0ms    (std::chrono Measured)
+    passed 
+==== cpu compact with scan ====
+    [   2   2   2   3   1   1   3   2   2   3   3   2   2 ...   2   3 ]
+   elapsed time: 0ms    (std::chrono Measured)
+    passed 
+==== work-efficient compact, power-of-two ====
+    [   2   2   2   3   1   1   3   2   2   3   3   2   2 ...   2   3 ]
+   elapsed time: 0.193408ms    (CUDA Measured)
+    passed 
+==== work-efficient compact, non-power-of-two ====
+    [   2   2   2   3   1   1   3   2   2   3   3   2   2 ...   3   2 ]
+   elapsed time: 0.193824ms    (CUDA Measured)
+    passed 
+
+*****************************
+** RADIX SORT TESTS **
+*****************************
+Array size (power of two): 65536
+Array size (non-power of two): 65533
+Max value: 100
+    [  48  98  90  36  90  79  85   9  23  32  74  74  79 ...  19   0 ]
+==== std::sort, power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapsed time: 0ms    (std::chrono Measured)
+==== thrust unstable sort, power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapsed time: 0ms    (std::chrono Measured)
+   elapsed time: 0.00112ms    (CUDA Measured)
+    passed 
+==== thrust stable sort, power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapsed time: 0ms    (std::chrono Measured)
+   elapsed time: 0.001152ms    (CUDA Measured)
+    passed 
+==== radix sort, power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapsed time: 1.44288ms    (CUDA Measured)
+    passed 
+==== std::sort, non power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapsed time: 0ms    (std::chrono Measured)
+==== radix sort, non power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapsed time: 1.44154ms    (CUDA Measured)
+    passed 
diff --git a/test_results/result_size_20_power_of_2.txt b/test_results/result_size_20_power_of_2.txt
new file mode 100644
index 0000000..33945cf
--- /dev/null
+++ b/test_results/result_size_20_power_of_2.txt
@@ -0,0 +1,106 @@
+CIS-565 HW2 CUDA Stream Compaction Test (Ruoyu Fan)
+    Block size for naive scan: 1024
+    Block size for up-sweep: 1024
+    Block size for down-sweep: 1024
+    Block size for boolean mapping: 1024
+    Block size for scattering: 1024
+    Block sizes for radix sort: 1024 1024 1024 1024
+
+****************
+** SCAN TESTS **
+****************
+Array size (power of two): 1048576
+Array size (non-power of two): 1048573
+    [  46  20   9  39  15   9   5  45  33  48  37  45  35 ...  16   0 ]
+==== cpu scan, power-of-two ====
+   elapsed time: 0ms    (std::chrono Measured)
+    [   0  46  66  75 114 129 138 143 188 221 269 306 351 ... 25669615 25669631 ]
+==== cpu scan, non-power-of-two ====
+    [   0  46  66  75 114 129 138 143 188 221 269 306 351 ... 25669544 25669584 ]
+   elapsed time: 1.5056ms    (std::chrono Measured)
+    passed 
+==== naive scan, power-of-two ====
+    [   0  46  66  75 114 129 138 143 188 221 269 306 351 ... 25669615 25669631 ]
+   elapsed time: 1.29402ms    (CUDA Measured)
+    passed 
+==== naive scan, non-power-of-two ====
+    [   0  46  66  75 114 129 138 143 188 221 269 306 351 ... 25669544 25669584 ]
+   elapsed time: 1.29389ms    (CUDA Measured)
+    passed 
+==== work-efficient scan, power-of-two ====
+    [   0  46  66  75 114 129 138 143 188 221 269 306 351 ... 25669615 25669631 ]
+   elapsed time: 0.818528ms    (CUDA Measured)
+    passed 
+==== work-efficient scan, non-power-of-two ====
+    [   0  46  66  75 114 129 138 143 188 221 269 306 351 ... 25669544 25669584 ]
+   elapsed time: 0.816128ms    (CUDA Measured)
+    passed 
+==== thrust scan, power-of-two ====
+    [   0  46  66  75 114 129 138 143 188 221 269 306 351 ... 25669615 25669631 ]
+   elapsed time: 0.30768ms    (CUDA Measured)
+   elapsed time: 15.6263ms    (std::chrono Measured)
+    passed 
+==== thrust scan, non-power-of-two ====
+    [   0  46  66  75 114 129 138 143 188 221 269 306 351 ... 25669544 25669584 ]
+   elapsed time: 0.304096ms    (CUDA Measured)
+   elapsed time: 0ms    (std::chrono Measured)
+    passed 
+
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+Array size (power of two): 1048576
+Array size (non-power of two): 1048573
+    [   2   2   3   3   3   1   3   3   3   2   1   3   1 ...   2   0 ]
+==== cpu compact without scan, power-of-two ====
+    [   2   2   3   3   3   1   3   3   3   2   1   3   1 ...   1   2 ]
+   elapsed time: 0ms    (std::chrono Measured)
+    passed 
+==== cpu compact without scan, non-power-of-two ====
+    [   2   2   3   3   3   1   3   3   3   2   1   3   1 ...   3   2 ]
+   elapsed time: 0ms    (std::chrono Measured)
+    passed 
+==== cpu compact with scan ====
+    [   2   2   3   3   3   1   3   3   3   2   1   3   1 ...   1   2 ]
+   elapsed time: 0ms    (std::chrono Measured)
+    passed 
+==== work-efficient compact, power-of-two ====
+    [   2   2   3   3   3   1   3   3   3   2   1   3   1 ...   1   2 ]
+   elapsed time: 0.966944ms    (CUDA Measured)
+    passed 
+==== work-efficient compact, non-power-of-two ====
+    [   2   2   3   3   3   1   3   3   3   2   1   3   1 ...   3   2 ]
+   elapsed time: 0.963168ms    (CUDA Measured)
+    passed 
+
+*****************************
+** RADIX SORT TESTS **
+*****************************
+Array size (power of two): 1048576
+Array size (non-power of two): 1048573
+Max value: 100
+    [  46  70  59  39  15   9  55  95  83  98  37  95  85 ...  66   0 ]
+==== std::sort, power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapsed time: 22.1649ms    (std::chrono Measured)
+==== thrust unstable sort, power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapsed time: 0ms    (std::chrono Measured)
+   elapsed time: 0.001152ms    (CUDA Measured)
+    passed 
+==== thrust stable sort, power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapsed time: 0ms    (std::chrono Measured)
+   elapsed time: 0.001248ms    (CUDA Measured)
+    passed 
+==== radix sort, power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapsed time: 7.61686ms    (CUDA Measured)
+    passed 
+==== std::sort, non power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapsed time: 15.6456ms    (std::chrono Measured)
+==== radix sort, non power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapsed time: 7.64022ms    (CUDA Measured)
+    passed 
diff --git a/test_results/result_size_24_power_of_2.txt b/test_results/result_size_24_power_of_2.txt
new file mode 100644
index 0000000..314dfdf
--- /dev/null
+++ b/test_results/result_size_24_power_of_2.txt
@@ -0,0 +1,106 @@
+CIS-565 HW2 CUDA Stream Compaction Test (Ruoyu Fan)
+    Block size for naive scan: 1024
+    Block size for up-sweep: 1024
+    Block size for down-sweep: 1024
+    Block size for boolean mapping: 1024
+    Block size for scattering: 1024
+    Block sizes for radix sort: 1024 1024 1024 1024
+
+****************
+** SCAN TESTS **
+****************
+Array size (power of two): 16777216
+Array size (non-power of two): 16777213
+    [  18   9   5  44  10  43  43  30  40  12  33  14  21 ...   6   0 ]
+==== cpu scan, power-of-two ====
+   elapsed time: 23.5625ms    (std::chrono Measured)
+    [   0  18  27  32  76  86 129 172 202 242 254 287 301 ... 411025443 411025449 ]
+==== cpu scan, non-power-of-two ====
+    [   0  18  27  32  76  86 129 172 202 242 254 287 301 ... 411025387 411025398 ]
+   elapsed time: 41.6108ms    (std::chrono Measured)
+    passed 
+==== naive scan, power-of-two ====
+    [   0  18  27  32  76  86 129 172 202 242 254 287 301 ... 411025443 411025449 ]
+   elapsed time: 25.7094ms    (CUDA Measured)
+    passed 
+==== naive scan, non-power-of-two ====
+    [   0  18  27  32  76  86 129 172 202 242 254 287 301 ... 411025387 411025398 ]
+   elapsed time: 25.6795ms    (CUDA Measured)
+    passed 
+==== work-efficient scan, power-of-two ====
+    [   0  18  27  32  76  86 129 172 202 242 254 287 301 ... 411025443 411025449 ]
+   elapsed time: 11.3379ms    (CUDA Measured)
+    passed 
+==== work-efficient scan, non-power-of-two ====
+    [   0  18  27  32  76  86 129 172 202 242 254 287 301 ... 411025387 411025398 ]
+   elapsed time: 11.165ms    (CUDA Measured)
+    passed 
+==== thrust scan, power-of-two ====
+    [   0  18  27  32  76  86 129 172 202 242 254 287 301 ... 411025443 411025449 ]
+   elapsed time: 2.00646ms    (CUDA Measured)
+   elapsed time: 0ms    (std::chrono Measured)
+    passed 
+==== thrust scan, non-power-of-two ====
+    [   0  18  27  32  76  86 129 172 202 242 254 287 301 ... 411025387 411025398 ]
+   elapsed time: 2.12765ms    (CUDA Measured)
+   elapsed time: 0ms    (std::chrono Measured)
+    passed 
+
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+Array size (power of two): 16777216
+Array size (non-power of two): 16777213
+    [   1   3   3   3   2   0   1   3   2   0   3   0   3 ...   3   0 ]
+==== cpu compact without scan, power-of-two ====
+    [   1   3   3   3   2   1   3   2   3   3   2   1   3 ...   1   3 ]
+   elapsed time: 39.0743ms    (std::chrono Measured)
+    passed 
+==== cpu compact without scan, non-power-of-two ====
+    [   1   3   3   3   2   1   3   2   3   3   2   1   3 ...   1   1 ]
+   elapsed time: 38.6241ms    (std::chrono Measured)
+    passed 
+==== cpu compact with scan ====
+    [   1   3   3   3   2   1   3   2   3   3   2   1   3 ...   1   3 ]
+   elapsed time: 105.781ms    (std::chrono Measured)
+    passed 
+==== work-efficient compact, power-of-two ====
+    [   1   3   3   3   2   1   3   2   3   3   2   1   3 ...   1   3 ]
+   elapsed time: 13.3034ms    (CUDA Measured)
+    passed 
+==== work-efficient compact, non-power-of-two ====
+    [   1   3   3   3   2   1   3   2   3   3   2   1   3 ...   1   1 ]
+   elapsed time: 13.3877ms    (CUDA Measured)
+    passed 
+
+*****************************
+** RADIX SORT TESTS **
+*****************************
+Array size (power of two): 16777216
+Array size (non-power of two): 16777213
+Max value: 100
+    [  21  39  19  39  74  20   5   3   2  64  35   8  59 ...  51   0 ]
+==== std::sort, power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapsed time: 378.025ms    (std::chrono Measured)
+==== thrust unstable sort, power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapsed time: 109.329ms    (std::chrono Measured)
+   elapsed time: 0.00112ms    (CUDA Measured)
+    passed 
+==== thrust stable sort, power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapsed time: 100.279ms    (std::chrono Measured)
+   elapsed time: 0.00112ms    (CUDA Measured)
+    passed 
+==== radix sort, power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapsed time: 105.222ms    (CUDA Measured)
+    passed 
+==== std::sort, non power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapsed time: 385.305ms    (std::chrono Measured)
+==== radix sort, non power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapsed time: 104.685ms    (CUDA Measured)
+    passed 
diff --git a/test_results/result_size_26_power_of_2.txt b/test_results/result_size_26_power_of_2.txt
new file mode 100644
index 0000000..e9bf9c2
--- /dev/null
+++ b/test_results/result_size_26_power_of_2.txt
@@ -0,0 +1,106 @@
+CIS-565 HW2 CUDA Stream Compaction Test (Ruoyu Fan)
+    Block size for naive scan: 1024
+    Block size for up-sweep: 1024
+    Block size for down-sweep: 1024
+    Block size for boolean mapping: 1024
+    Block size for scattering: 1024
+    Block sizes for radix sort: 1024 1024 1024 1024
+
+****************
+** SCAN TESTS **
+****************
+Array size (power of two): 67108864
+Array size (non-power of two): 67108861
+    [  16  31  42  47  35  41  12  16  18  28  46  35  27 ...  16   0 ]
+==== cpu scan, power-of-two ====
+   elapsed time: 159.925ms    (std::chrono Measured)
+    [   0  16  47  89 136 171 212 224 240 258 286 332 367 ... 1643670879 1643670895 ]
+==== cpu scan, non-power-of-two ====
+    [   0  16  47  89 136 171 212 224 240 258 286 332 367 ... 1643670804 1643670832 ]
+   elapsed time: 133.354ms    (std::chrono Measured)
+    passed 
+==== naive scan, power-of-two ====
+    [   0  16  47  89 136 171 212 224 240 258 286 332 367 ... 1643670879 1643670895 ]
+   elapsed time: 113.61ms    (CUDA Measured)
+    passed 
+==== naive scan, non-power-of-two ====
+    [   0  16  47  89 136 171 212 224 240 258 286 332 367 ... 1643670804 1643670832 ]
+   elapsed time: 113.677ms    (CUDA Measured)
+    passed 
+==== work-efficient scan, power-of-two ====
+    [   0  16  47  89 136 171 212 224 240 258 286 332 367 ... 1643670879 1643670895 ]
+   elapsed time: 44.4511ms    (CUDA Measured)
+    passed 
+==== work-efficient scan, non-power-of-two ====
+    [   0  16  47  89 136 171 212 224 240 258 286 332 367 ... 1643670804 1643670832 ]
+   elapsed time: 44.4368ms    (CUDA Measured)
+    passed 
+==== thrust scan, power-of-two ====
+    [   0  16  47  89 136 171 212 224 240 258 286 332 367 ... 1643670879 1643670895 ]
+   elapsed time: 7.72387ms    (CUDA Measured)
+   elapsed time: 15.6024ms    (std::chrono Measured)
+    passed 
+==== thrust scan, non-power-of-two ====
+    [   0  16  47  89 136 171 212 224 240 258 286 332 367 ... 1643670804 1643670832 ]
+   elapsed time: 7.77056ms    (CUDA Measured)
+   elapsed time: 15.6503ms    (std::chrono Measured)
+    passed 
+
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+Array size (power of two): 67108864
+Array size (non-power of two): 67108861
+    [   2   0   1   1   1   3   0   2   1   2   2   2   0 ...   1   0 ]
+==== cpu compact without scan, power-of-two ====
+    [   2   1   1   1   3   2   1   2   2   2   2   2   3 ...   1   1 ]
+   elapsed time: 155.406ms    (std::chrono Measured)
+    passed 
+==== cpu compact without scan, non-power-of-two ====
+    [   2   1   1   1   3   2   1   2   2   2   2   2   3 ...   3   2 ]
+   elapsed time: 153.902ms    (std::chrono Measured)
+    passed 
+==== cpu compact with scan ====
+    [   2   1   1   1   3   2   1   2   2   2   2   2   3 ...   1   1 ]
+   elapsed time: 422.524ms    (std::chrono Measured)
+    passed 
+==== work-efficient compact, power-of-two ====
+    [   2   1   1   1   3   2   1   2   2   2   2   2   3 ...   1   1 ]
+   elapsed time: 53.2659ms    (CUDA Measured)
+    passed 
+==== work-efficient compact, non-power-of-two ====
+    [   2   1   1   1   3   2   1   2   2   2   2   2   3 ...   3   2 ]
+   elapsed time: 53.1911ms    (CUDA Measured)
+    passed 
+
+*****************************
+** RADIX SORT TESTS **
+*****************************
+Array size (power of two): 67108864
+Array size (non-power of two): 67108861
+Max value: 100
+    [  35  85  73  87  36  84  69  53  83  58  58  50  88 ...  15   0 ]
+==== std::sort, power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapsed time: 1481.8ms    (std::chrono Measured)
+==== thrust unstable sort, power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapsed time: 434.399ms    (std::chrono Measured)
+   elapsed time: 0.001184ms    (CUDA Measured)
+    passed 
+==== thrust stable sort, power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapsed time: 423.933ms    (std::chrono Measured)
+   elapsed time: 0.001248ms    (CUDA Measured)
+    passed 
+==== radix sort, power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapsed time: 419.345ms    (CUDA Measured)
+    passed 
+==== std::sort, non power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapsed time: 1519.37ms    (std::chrono Measured)
+==== radix sort, non power-of-two ====
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  99  99 ]
+   elapsed time: 416.354ms    (CUDA Measured)
+    passed 
diff --git a/test_results/result_size_8_power_of_2.txt b/test_results/result_size_8_power_of_2.txt
new file mode 100644
index 0000000..81dcd4d
--- /dev/null
+++ b/test_results/result_size_8_power_of_2.txt
@@ -0,0 +1,106 @@
+CIS-565 HW2 CUDA Stream Compaction Test (Ruoyu Fan)
+    Block size for naive scan: 1024
+    Block size for up-sweep: 1024
+    Block size for down-sweep: 1024
+    Block size for boolean mapping: 1024
+    Block size for scattering: 1024
+    Block sizes for radix sort: 1024 1024 1024 1024
+
+****************
+** SCAN TESTS **
+****************
+Array size (power of two): 256
+Array size (non-power of two): 253
+    [   4  13  47  25   0   4   9  20  40   2  18   9  28 ...  43   0 ]
+==== cpu scan, power-of-two ====
+   elapsed time: 0ms    (std::chrono Measured)
+    [   0   4  17  64  89  89  93 102 122 162 164 182 191 ... 6036 6079 ]
+==== cpu scan, non-power-of-two ====
+    [   0   4  17  64  89  89  93 102 122 162 164 182 191 ... 5984 5984 ]
+   elapsed time: 0ms    (std::chrono Measured)
+    passed 
+==== naive scan, power-of-two ====
+    [   0   4  17  64  89  89  93 102 122 162 164 182 191 ... 6036 6079 ]
+   elapsed time: 0.031232ms    (CUDA Measured)
+    passed 
+==== naive scan, non-power-of-two ====
+    [   0   4  17  64  89  89  93 102 122 162 164 182 191 ... 5984 5984 ]
+   elapsed time: 0.030816ms    (CUDA Measured)
+    passed 
+==== work-efficient scan, power-of-two ====
+    [   0   4  17  64  89  89  93 102 122 162 164 182 191 ... 6036 6079 ]
+   elapsed time: 0.064288ms    (CUDA Measured)
+    passed 
+==== work-efficient scan, non-power-of-two ====
+    [   0   4  17  64  89  89  93 102 122 162 164 182 191 ... 5984 5984 ]
+   elapsed time: 0.06384ms    (CUDA Measured)
+    passed 
+==== thrust scan, power-of-two ====
+    [   0   4  17  64  89  89  93 102 122 162 164 182 191 ... 6036 6079 ]
+   elapsed time: 0.02128ms    (CUDA Measured)
+   elapsed time: 0ms    (std::chrono Measured)
+    passed 
+==== thrust scan, non-power-of-two ====
+    [   0   4  17  64  89  89  93 102 122 162 164 182 191 ... 5984 5984 ]
+   elapsed time: 0.013152ms    (CUDA Measured)
+   elapsed time: 0ms    (std::chrono Measured)
+    passed 
+
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+Array size (power of two): 256
+Array size (non-power of two): 253
+    [   2   1   3   1   2   2   1   0   0   2   2   1   0 ...   3   0 ]
+==== cpu compact without scan, power-of-two ====
+    [   2   1   3   1   2   2   1   2   2   1   1   1   2 ...   3   3 ]
+   elapsed time: 0ms    (std::chrono Measured)
+    passed 
+==== cpu compact without scan, non-power-of-two ====
+    [   2   1   3   1   2   2   1   2   2   1   1   1   2 ...   2   3 ]
+   elapsed time: 0ms    (std::chrono Measured)
+    passed 
+==== cpu compact with scan ====
+    [   2   1   3   1   2   2   1   2   2   1   1   1   2 ...   3   3 ]
+   elapsed time: 0ms    (std::chrono Measured)
+    passed 
+==== work-efficient compact, power-of-two ====
+    [   2   1   3   1   2   2   1   2   2   1   1   1   2 ...   3   3 ]
+   elapsed time: 0.072192ms    (CUDA Measured)
+    passed 
+==== work-efficient compact, non-power-of-two ====
+    [   2   1   3   1   2   2   1   2   2   1   1   1   2 ...   2   3 ]
+   elapsed time: 0.071296ms    (CUDA Measured)
+    passed 
+
+*****************************
+** RADIX SORT TESTS **
+*****************************
+Array size (power of two): 256
+Array size (non-power of two): 253
+Max value: 100
+    [  54  13  47  25  50  54   9  20  40   2  18   9  28 ...  43   0 ]
+==== std::sort, power-of-two ====
+    [   0   0   1   1   1   2   2   2   2   2   3   3   4 ...  98  98 ]
+   elapsed time: 0ms    (std::chrono Measured)
+==== thrust unstable sort, power-of-two ====
+    [   0   0   1   1   1   2   2   2   2   2   3   3   4 ...  98  98 ]
+   elapsed time: 0ms    (std::chrono Measured)
+   elapsed time: 0.001152ms    (CUDA Measured)
+    passed 
+==== thrust stable sort, power-of-two ====
+    [   0   0   1   1   1   2   2   2   2   2   3   3   4 ...  98  98 ]
+   elapsed time: 0ms    (std::chrono Measured)
+   elapsed time: 0.001152ms    (CUDA Measured)
+    passed 
+==== radix sort, power-of-two ====
+    [   0   0   1   1   1   2   2   2   2   2   3   3   4 ...  98  98 ]
+   elapsed time: 0.603968ms    (CUDA Measured)
+    passed 
+==== std::sort, non power-of-two ====
+    [   0   1   1   1   2   2   2   2   2   3   3   4   4 ...  98  98 ]
+   elapsed time: 0ms    (std::chrono Measured)
+==== radix sort, non power-of-two ====
+    [   0   1   1   1   2   2   2   2   2   3   3   4   4 ...  98  98 ]
+   elapsed time: 0.605184ms    (CUDA Measured)
+    passed