Optimizations
Once we had the basic functionality for training a model done, we decided to focus on optimizations. This is dangerous territory, as we would discover, but in many cases there are quite a few things you can do to optimize your code and bring the gatecount down. There are two main steps to take here: find opportunities for optimization and then design the actual optimizations itself. We'll touch on those in this section. Lastly, we'll also share how optimizations can lead to underconstraining your circuit and our lessons learned here.
How to find opportunities for optimizations
Somewhere, far far away in a Noir Discord thread, we noticed the mention of a "profiler" that is available in the Noir Github. Notably, the Aztec team itself uses it to profile large circuits. With a profiler, you are able to identify costly parts of your code and thus get direction on where to focus optimization efforts. We definitely wanted to give this tool a try! The only downside; it came with this warning "This profiler is not yet documented or advertised publicly as it is still slightly in flux and used only by the Aztec team." Luckily, we like a challenge.
As it turns out, using the profiler is totally doable. You can find the profiler tool here. These are the steps we took to make it work:
- Git clone the Noir repo
- Checkout the tag of the nargo version you intend to use. Make sure the installed
bb
version aligns with it. - Set
PATH_TO_ARTIFACT
to where thejson
of your compiled circuit is located, for exampletarget/test.json
. - Set
PATH_TO_OUTPUT
to where thesvg
file will be saved, for example/output
. - Set
PATH_TO_BB
to wherebb
is installed. - Obtain gates flamegraph:
cargo run --package noir_profiler --bin noir-profiler gates-flamegraph --artifact-path PATH_TO_ARTIFACT --output PATH_TO_OUTPUT --backend-path PATH_TO_BB
- Obtain opcodes flamegraph
cargo run --package noir_profiler --bin noir-profiler opcodes-flamegraph --artifact-path PATH_TO_ARTIFACT --output PATH_TO_OUTPUT
The one we focused on is the gates flamegraph. The first iterations looked something like this:
We highly recommend the use of this profiler, and have also mentioned this to the Aztec/Noir team. Apart from helping you steer your optimization efforts in a certain direction, it also helps developing intuition for the cost of certain things in circuits (in Noir). Below we will discuss some of the optimizations we implemented.
Unnecessary re-calculations
As developers, we often rely on compilers to optimize code for us, and we don't always look at a "simple" multiplication more or less. However, in the world of ZKP that can be costly. In this case it cost us 127,000 gates. Let us explain.
As discussed above, the sigmoid
function is an important building block for the training function. In this implementation, it does an approximation instead of the sigmoid function itself. In this approximation, we decide in which of the 5 cases we find ourselves and based on that return one of the following outputs:
#![allow(unused)] fn main() { let outputs = [ Quantized::new(6), // 0.0001 (Quantized::new(1819) * x) + Quantized::new(9502), // 0.02776 and 0.145 (Quantized::new(11141) * x) + Quantized::new(32768), // 0.17 and 0.5 (Quantized::new(1819) * x) + Quantized::new(56031), // 0.02776 and 0.85498 Quantized::new(65529), // 0.9999 ]; }
The culprit was (Quantized::new(1819) * x)
, which appears twice. Of course we could have seen that directly in the code, but it stood out to us from the flamegraph. We might have assumed this would be optimized away somehow, but that was not the case. The fix was simple; assign the value to a variable and use it in both spots. This brought down the gatecount about ~127K gates at the time that we were benchmarking this.
Optimization dot product
The flamegraph revealed what the expensive operations in our code were: Quantized multiplication, and in particular scaling down after doing the multiplication.
After a multiplication strictly speaking it is needed to scale down to get the right answer. But what if you don't need the "right answer" right away? This was the case for the dot product in the code for obtaining the prediction:
#![allow(unused)] fn main() { fn get_prediction<let M: u32>( weights: [Quantized; M], inputs: [Quantized; M], bias: Quantized, ) -> Quantized { let mut z = Quantized::zero(); for i in 0..M { z += weights[i] * inputs[i]; } approx_sigmoid(z + bias) } }
At line 8 for each iteration a multiplication with a scale down happens, before adding all the resulting values together. In addition to the expensive scaling down operation, in each quantized multiplication 2 bitsize checks are done in order to prevent overflow during the multiplication (code ref). Instead of using this straight forward dot product, we customized it to scale down fewer times and adjust the bitsize checks to fit within our larger functionality and be cheaper.
First, let's look at delayed scaling down. The opportunity at line 8 is to realize that we can perfectly add multiple scaled up values (which happens after naively multiplying the quantized values and not scale down) and scale down just once after all additions have taken place. Since scaling down is so expensive, this seems like a good idea. Here, we're not yet taking into account the possible overflow, which we'll talk about next. The intermediate solution would look something like this:
#![allow(unused)] fn main() { fn get_prediction<let M: u32>( weights: [Quantized; M], inputs: [Quantized; M], bias: Quantized, ) -> Quantized { // let z = weights dot_product inputs let mut z = 0; for i in 0..M { // Perform operations directly on Field elements, scale down at the end z += weights[i].x * inputs[i].x; } // Scale the intermediate value down due to multiplications and add bias approx_sigmoid(Quantized { x: scale_down(z) } + bias) } }
The number of scaling down operations is brought down from M
to 1. In addition to that, we're doing Field multiplications instead of Quantized multiplications which is cheaper, since we're avoiding the previously mentioned bitsize checks in quantized multiplication. However, we're in risk of overflow.
We need to bring some bitsize checks back to make sure no intermediate overflow is happening. Recall that inputs
are being restricted to 20 bits from the start, so we don't need to do additional bit checks on those. We just need to make sure that the weights
are max 105 bits and the intermediate z
values don't go over 125 bits. In this way, the multiplication on the right hand side becomes maximum 125 bits, and their addition 126 bits, falling exactly within the necessary bounds:
#![allow(unused)] fn main() { assert_bitsize::<105>(weights[i]); assert_bitsize::<125>(Quantized { x: z }); z += weights[i].x * inputs[i].x; }
The same type of optimization has been applied in the train
function (ref).
A surprising optimization
During our quest to squeeze out at the gates we possibly could, we encountered the following optimization that we only came to understand with help from the Noir team. Before:
#![allow(unused)] fn main() { let mut index = 4; if x <= cuts[0] { index = 0; } else if x <= cuts[1] { index = 1; } else if x <= cuts[2] { index = 2; } else if x <= cuts[3] { index = 3; } outputs[index] }
After (~3k gates cheaper at the time):
#![allow(unused)] fn main() { let mut res = outputs[4]; // Default to the last index in case x is above all cuts // Determine the correct interval index by checking against each cut if x <= cuts[0] { res = outputs[0]; } else if x <= cuts[1] { res = outputs[1]; } else if x <= cuts[2] { res = outputs[2]; } else if x <= cuts[3] { res = outputs[3]; } res }
Note that the outputs
array is already known beforehand in both cases:
#![allow(unused)] fn main() { let outputs = [ Quantized::new(6), // 0.0001, 0.0001 / 2^-16 = 6.5536 temp + Quantized::new(9502), //0.02776 and 0.145, 0.02776 / 2^-16 = 1819.27936, 0.145/2^-16 = 9502.72 (Quantized::new(11141) * x) + Quantized::new(32768), //0.17 and 0.5, 0.17 / 2^-16 = 11141.12, 0.5/2^-16 = 32768 temp + Quantized::new(56031), //0.02776 and 0.85498, 0.85498/2^-16 = 56031.96928 Quantized::new(65529), //0.9999 / 2^-16 = 65529.4464 ]; }
@TomAFrench of the Aztec team pointed out that because the array content is known at compile time, at runtime only the correct one has to be selected in the optimized code snippet. This is called value merging.
In the earlier version of the code, at compile time we don't know which entry of outputs
we're going to read and this causes the value merging to happen on the index we'll be reading from. This, Tom explained to us, forces the generation of a memory block and do a dynamic read from array, which is more expensive.
The difference is that in the optimized code, we could replace array access with the actual value, whereas in the initial code, we can't do that. A good reminder that everything known at compile-time will be cheaper within circuits, and reasoning how to do the least amount of work at runtime will help us optimize the code.
Lesson learned: optimization or underconstraining?
One of the early optimizations we tried to introduce was for the byte decomposition of a Field. If you recall the section on multiplication for fixed point arithmetic above, we started out scaling down after multiplication using a trick to decompose the Field into bytes and then chop off 2 of them, resulting in a division by . It so happens that byte decomposition is quite expensive in circuits, and our code used a lot of multiplications and thus this scaling down trick. When studying the flamegraph, we knew: this is where we can make a big difference!
After browsing through the Noir documentation, we found an interesting code snippet:
#![allow(unused)] fn main() { fn test_to_be_bytes() { let field = 2; let bytes: [u8; 8] = field.to_be_bytes(); assert_eq(bytes, [0, 0, 0, 0, 0, 0, 0, 2]); assert_eq(Field::from_be_bytes::<8>(bytes), field); } }
The Field type has both a function to decompose into bytes, to_be_bytes
, and one to create a Field from bytes, from_be_bytes
. What if we performed the first action in an unconstrained function and used the second one to constrain the actual byte decomposition? It seemed worth a try, and since we were learning about the optimizations for Noir we were not sure of the outcome. This is the code we used (do not use it!):
#![allow(unused)] fn main() { unsafe { bytes = get_bytes(temp); } assert(Field::from_be_bytes::<32>(bytes) == temp); }
To our surprise, this indeed led to a gate count improvement.
Unfortunately, this is exactly when we introduced a security issue.
Once again, @TomAFrench from the Noir team helped us understand and pointed out this issue. Basically, our 'optimization' took a shortcut compared to the byte decomposition function from the standard library. The check we bypassed ensured that the value that the byte array represented fell within the Field. In our implementation, would be equal to for the modulus of the field, while in the correct implementation an additional check is added to prevent that. In conclusion, our circuit became underconstrained due to our optimization efforts.
To fix the problem, we had to use the the (constrained) version of to_be_bytes
that we started out with. In the end, the code became obsolete because we used the improved scaling down code that Tom proposed, which doesn't use byte decomposition. However, for the future we are certainly warned and understand better the workings of these functions.
This was a harsh lesson to learn, and we hope our experience helps you avoid making the same mistake.