[FEATURE] Investigate C Rust for Hemera loop performance
Python is extremely slow when it comes to loops due to its dynamic typing. In contrast, Rust and C are statically typed. Originally, I was going to use C via Cython, but I've decided to use Rust for its memory safety instead.
Utilizing Cython Rust for the loops in Hemera has already shown greatly improved performance.
Additional Context
As can be seen below, the _generate_string_buffer method is the most time-consuming method in the Hemera terminal effects module. This method is responsible for converting the delta frame to its color-formatted str representation form before printing to the terminal.
Line profiling and cProfile has shown that the loop iteration takes ~75% of the total time, which is a significant bottleneck when you consider that the remaining 25% is spent on heavily-penalized system calls for writing to the terminal. This should be the reverse.
The line profile output shows just how challenging of a task this is for Python: the loop had to iterate 6,472,336 subpixel pairs by the 197th frame within the 2.95s total time. Several optimizations have been made to minimize the number of memory retrievals and even conditional checks, with optimization results listed below. However, even with these numbers, Rust should still be able to provide a significant performance boost.
Addendum
Current Optimizations in Processing Subpixel Pairs:
| Stage |
Pixel Pairs Processed |
Reduction |
| Total input subpixel pairs |
10,423,270 |
0% |
| Only pairs in changed rows |
6,472,336 |
37.9% |
| Only changed subpixel pairs |
203,758 |
96.1% |
Current Optimizations in ANSI Code Generation:
| ANSI Code |
Count of Codes Issued |
Count without Change Detection |
Reduction |
| Cursor Movement |
43,202 |
6,472,336 |
99.3% |
| Foreground Color |
66,449 |
6,472,336 |
99.0% |
| Background Color |
54,511 |
6,472,336 |
99.2% |
| All ANSI Codes |
164,162 |
19,417,008 |
99.2% |
Code to optimize with Rust:
# Iterate frame
for y in range(h):
row_buffer.clear()
# Check if row has changes
if row_sums[y] > empty_sum:
for x in range(w):
sum_color = sum_frame[y, x]
# If the current pixel is printable, get the fg color and calculate the bg color
if sum_color != empty_sum:
fg_color = delta_frame[y, x]
bg_color = sum_color - fg_color
# bg_color = delta_frame[1, y, x]
# Skip cursor movement if it's the same row/column as the last printed pixel
if last_subpixel_sum == empty_sum:
row_buffer.append(f"\033[{y + 1};{x + 1}H")
# Only write color change sequences when necessary (skip if same as last)
# Foreground color check/caching
if fg_color != last_ansi_fg_color:
row_buffer.append(ansi_fg[fg_color])
last_ansi_fg_color = fg_color
# Background color check/caching
if bg_color != last_ansi_bg_color:
row_buffer.append(ansi_bg[bg_color])
last_ansi_bg_color = bg_color
# Add the printed character
row_buffer.append("▀")
# Cache the last sum
last_subpixel_sum = sum_color
# Add the row buffer for the changed row
buffer.write("".join(row_buffer) + "\n")
# Output the accumulated buffer to stdout
self.write_to_term(buffer.getvalue())
self.flush_to_term()
Benchmarks for Before Rust Optimization:
Total time: 2.94979 s
File: /home/charles/projects/nyx-engine/nyx/hemera_term_fx/hemera_term_fx.py
Function: _generate_string_buffer at line 134
Line # Hits Time Per Hit % Time Line Contents
==============================================================
134 def _generate_string_buffer(self, delta_frame: np.ndarray):
135 """Convert the delta frame to its color-formatted `str` representation form before printing
136 to the terminal.
137 Args:
138 delta_frame (np.ndarray): The delta frame to process and print.
139 """
140 # Calculate sum of fg + bg
141 197 4969184.0 25224.3 0.2 delta_frame, sum_frame = self.sum_bg(delta_frame)
142 # Calculate the sum of each row
143 197 6658238.0 33798.2 0.2 row_sums = np.sum(sum_frame, axis=1)
144 # Start the string buffer
145 197 349312.0 1773.2 0.0 buffer = io.StringIO()
146 # Initialize loop variables outside the loop
147 197 355082.0 1802.4 0.0 empty_pixel = np.uint8(0)
148 197 205343.0 1042.4 0.0 empty_sum = np.uint16(0)
149 197 30633.0 155.5 0.0 fg_color, bg_color = empty_pixel, empty_sum
150 197 17450.0 88.6 0.0 last_ansi_fg_color, last_ansi_bg_color = empty_pixel, empty_sum
151 197 16754.0 85.0 0.0 sum_color, last_subpixel_sum = empty_sum, empty_sum
152 197 59706.0 303.1 0.0 h, w = delta_frame.shape
153 197 28575.0 145.1 0.0 row_buffer = []
154 197 25603.0 130.0 0.0 ansi_fg = self.ansi_fg
155 197 21032.0 106.8 0.0 ansi_bg = self.ansi_bg
156
157 # Iterate frame
158 21867 2102357.0 96.1 0.1 for y in range(h):
159 21670 4101178.0 189.3 0.1 row_buffer.clear()
160
161 # Check if row has changes
162 21670 4708098.0 217.3 0.2 if row_sums[y] > empty_sum:
163 6472336 556593912.0 86.0 18.9 for x in range(w):
164 6458880 936312570.0 145.0 31.7 sum_color = sum_frame[y, x]
165
166 # If the current pixel is printable, get the fg color and calculate the bg color
167 6458880 660925447.0 102.3 22.4 if sum_color != empty_sum:
168 203758 35263674.0 173.1 1.2 fg_color = delta_frame[y, x]
169 203758 32254690.0 158.3 1.1 bg_color = sum_color - fg_color
170 # bg_color = delta_frame[1, y, x]
171
172 # Skip cursor movement if it's the same row/column as the last printed pixel
173 203758 21897722.0 107.5 0.7 if last_subpixel_sum == empty_sum:
174 43202 14534454.0 336.4 0.5 row_buffer.append(f"\033[{y + 1};{x + 1}H")
175
176 # Only write color change sequences when necessary (skip if same as last)
177 # Foreground color check/caching
178 203758 21848698.0 107.2 0.7 if fg_color != last_ansi_fg_color:
179 66449 10345818.0 155.7 0.4 row_buffer.append(ansi_fg[fg_color])
180 66449 6274707.0 94.4 0.2 last_ansi_fg_color = fg_color
181 # Background color check/caching
182 203758 20488517.0 100.6 0.7 if bg_color != last_ansi_bg_color:
183 54511 8241801.0 151.2 0.3 row_buffer.append(ansi_bg[bg_color])
184 54511 5274209.0 96.8 0.2 last_ansi_bg_color = bg_color
185
186 # Add the printed character
187 203758 19928811.0 97.8 0.7 row_buffer.append("▀")
188
189 # Cache the last sum
190 6458880 526746257.0 81.6 17.9 last_subpixel_sum = sum_color
191
192 # Add the row buffer for the changed row
193 13456 10964123.0 814.8 0.4 buffer.write("".join(row_buffer) + "\n")
194
195 # Output the accumulated buffer to stdout
196 197 37649331.0 191113.4 1.3 self.write_to_term(buffer.getvalue())
197 197 601186.0 3051.7 0.0 self.flush_to_term()
[FEATURE] Investigate
CRust for Hemera loop performancePython is extremely slow when it comes to loops due to its dynamic typing. In contrast, Rust and C are statically typed. Originally, I was going to use C via Cython, but I've decided to use Rust for its memory safety instead.
Utilizing
CythonRust for the loops in Hemera has already shown greatly improved performance.Additional Context
As can be seen below, the
_generate_string_buffermethod is the most time-consuming method in the Hemera terminal effects module. This method is responsible for converting the delta frame to its color-formattedstrrepresentation form before printing to the terminal.Line profiling and cProfile has shown that the loop iteration takes ~75% of the total time, which is a significant bottleneck when you consider that the remaining 25% is spent on heavily-penalized system calls for writing to the terminal. This should be the reverse.
The line profile output shows just how challenging of a task this is for Python: the loop had to iterate 6,472,336 subpixel pairs by the 197th frame within the 2.95s total time. Several optimizations have been made to minimize the number of memory retrievals and even conditional checks, with optimization results listed below. However, even with these numbers, Rust should still be able to provide a significant performance boost.
Addendum
Current Optimizations in Processing Subpixel Pairs:
Current Optimizations in ANSI Code Generation:
Code to optimize with Rust:
Benchmarks for Before Rust Optimization: