Publicly Verifiable & Private Collaborative ML Model Training

In this report, we share our project and experiences on work for Noir Research Grant Request (NRG) #2 on Private Shared States. For this, we implemented machine learning functionality in the context of co-snarks, where MPC and ZK are combined to do both multiparty computation and get a proof of said computation. Specifically, we implemented logistic regression in Noir and executed this with co-noir, a co-snark tool for Noir developed by Taceo. Also, we set up a benchmarking suite to get insights both on circuitsize in Noir, and the execution time when running with co-noir. The results of the benchmarking show that our protocol that takes ~1.3 million gates to train a model using 30 samples of the Iris dataset using 20 epochs. Using co-noir, our implementation takes ~1.1 hours on a local machine using a protocol for three parties. We will add more details about practical results in our benchmarking section. The library can be found here.

In this post, we give an introduction to the concepts we've worked with: MPC, ZK and logistic regression. Also, we present the details of the implementation, as well as applied optimizations, benchmarks and lessons learned.

Introduction

Nowadays, machine learning (ML) is widespread everywhere because of its rich field of application to real-world problems and the vast amount of data available. However, the fact that there is a lot of information does not mean that all of this information is useful to build ML models. Some of this information has restricted use according to law or company rules and privacy rights. This is where cryptography comes into play. In the context of ML, cryptography gives guarantees to train ML models using restricted information without breaking the rules or the rights proposed by its owners. In particular, two cryptographic tools can be used along with ML to train models with some security guarantees: secure multi-party computation (MPC) and zero-knowledge proofs (ZK). The goal of this project is to train a logistic regression model in a distributed and collaborative way with two main features: (1) the distributed interaction does not reveal additional information beyond the final trained model, and (2) the training is publicly auditable, which means that any third party can confirm that the model was trained with the claimed data and in a correct way.

BigPicture (3)

Let's begin by explaining the concept of MPC. In MPC, a group of parties $P_{1}, \dots, P_{n}$ want to evaluate a publicly known function $f$ on private inputs so that each party learns the output of the evaluation but no additional information. The problem here is that the inputs belong to different parties. For example, let's suppose that each party $P_{i}$ has an input $x_{i}$ , so they want to compute the evaluation $f (x_{1}, \dots, x_{n})$ in such a way that the parties do not learn something beyond the evaluation of the function. To accomplish this task, the parties engage in a protocol, which means that the parties participate in a communication session between them following specific rules. At the end of the session, the parties will obtain the result of the evaluated function, keeping the privacy of their inputs. However, not all participants have the best intentions, and some of them can try to learn more information beyond the input of the function, or they will try to prevent certain parties from knowing the correct output. Those evil parties will be called corrupted parties. Moreover, we will allow corrupt parties to collaborate and exchange information in MPC. Theoretically, we can represent this by saying that there is an adversary, similar to a mastermind that controls the actions of the corrupted parties and can read all the messages that the corrupted parties receive.

You may be asking: "Fine, but what is the relationship between ML and MPC?". Well, training an ML model can be written as a function whose inputs are the training data samples, and the output is the parameters of the trained model. Also, there are situations in which multiple owners have their data samples but can not jointly use the information because of privacy regulations. For example, consider a set of hospitals that want to train an ML model to predict whether a patient has breast cancer by analyzing X-ray images. Each hospital has its data but can not gather all the images in one place to train a robust ML model because sharing those images violates patient privacy. As a solution, each hospital can participate as a party in an MPC protocol to train a model with joint information, keep their information secret from the other hospitals participating in the protocol, and obtain the parameters of the trained model.

Let's cover the second world: zero-knowledge proofs. A ZK proof allows an entity holding a piece of information, called the Prover, to prove a mathematical statement involving that piece to another entity called the Verifier without revealing any additional information. An example in the context of ML can be framed as follows. Imagine that the Prover and the Verifier have the parameters of a trained ML model; the Prover can prove to the Verifier the following statement: "I have a training dataset that was used to train the model you have." The Prover can prove that statement without revealing the training dataset.

Now, we can mix all the ingredients (ML, MPC, and ZK proofs) in a recipe to privately train logistic regression models collaboratively such that other third parties can verify that the training was done using the correct inputs and following the proper rules. ML contributes to the logistic regression training and all the theories behind the training algorithms, MPC is the ingredient that contributes to the collaborative training in a private way, and ZK is the ingredient that provides public auditability of the training process.

Prover (2)

There are some security assumptions that we need to remark on. First, MPC protocols provide the following privacy guarantee: "The messages received by a party do not reveal additional information beyond the result of the function evaluation." That does not mean that MPC protocols protect the private information of the parties from the information leaked by the output of the computation. This is a very standard MPC security assumption. In the context of ML, MPC does not protect the privacy of the datasets owned by each party from the information that the trained model may leak. Some advanced adversaries can analyze the trained model and come up with information about the private data of the parties involved in the protocol. People can use tools like differential privacy to prevent this attack. Second, in this project we depend on the support offered by the co-noir. Hence, we will use the protocol Rep3 (which refers to the protocol of Arraki et al. with modification presented in Eerikson et al.), and the protocol based on Shamir secret sharing. Given that co-noir currently supports a security model of honest-majority and semi-honest adversaries, we will use the same security model. Third, during this project, we assume that three parties are engaged in the protocol, and one is corrupt as a starting point.

To achieve the main goal in practice, we used the co-noir tool. This tool allows us to write the training algorithm in the Noir programming language, train the model using a distributed protocol, and generate a proof of the training execution in a distributed way in parallel. All of this without revealing any additional information about the data samples provided by each participant.

In the project, we achieved the following results:

An implementation of a fixed-point library used in the logistic regression algorithm.
An implementation of a training algorithm for logistic regression using the Noir programming language that is compatible with the co-noir tool. This training algorithm supports multi-class classification.
We applied optimizations to the Noir code to reduce the number of gates and optimize the training time using co-noir.
We executed a benchmarking of our library using the Iris dataset and the Wine dataset. We chose those datasets because they are the starting point in ML as they are simple datasets, so that we set a baseline for more complicated datasets.

After the development, we developed a protocol that takes ~1.3 million gates to train a model using 30 samples of the Iris dataset using 20 epochs. Using co-noir, our implementation takes ~1.1 hours on a local machine using a protocol for three parties. We will add more details about practical results in our benchmarking section.

This project results from a grant titled "NRG #2: Publicly Verifiable & Private Collaborative ML Model Training" funded by Aztec Labs. We want to thank Aztec Labs for their support.

Fixed-point arithmetic

In machine learning, it is common to use decimal values in the datasets. For example, the Iris dataset contains 151 data samples with features of flowers and to which of the 3 species options they belong: "setosa", "versicolor" or "virginica". This is a snippet with some of the entries:

sepal_length	sepal_width	petal_length	petal_width	species
5.1	3.5	1.4	0.2	setosa
7.0	3.2	4.7	1.4	versicolor
7.7	2.8	6.7	2.0	virginica
6.7	3.3	5.7	2.1	virginica

The problem here is that the standard Noir data types do not support decimal values, so one of the first tasks was to decide how to represent these data points in our system. There were three things to consider:

We need support for decimal values.
We need support for negative values.
We need to achieve good performance

After some testing, we decided to go with the strategy proposed by Catrina and Saxena, in which decimal values can be represented as fixed-point values, and the latter can be encoded as elements in a finite field. Following that idea, we store a signed and scaled fixed-point value in a single Field element:

pub struct Quantized {
    pub x: Field,
}

Dealing with signed values: if the value is positive, it is represented as itself. If it is negative, it is represented as p - x where p is the modulus of the field we're working over.
Bitsizes: A field element has 254 bits where the first half can be used for positive values, and the second half for negative ones. However, we want to make use of the function decomposed, which splits a Field up into 128 first bits and the remaining 126 last bits. So for both positive and negative values we will use a maximum of 126 bits, and the two "left over" bits in the middle should stay unused always.
Prevent overflow/underflow: to maintain above bitsize always, we add checks on the bitsize of inputs before performing arithmetic. More on this later on.
Scaling the values: all values we work with have been multiplied by $2^{16}$ . This means they are scaled by $2^{- 16}$ . For example, Quantized { x: 65536 } equals 1, and Quantized { x: p-1 } equals $- 2^{- 16}$ .

Visually, you can think of a Field element with space for 254 bits like this |_,_,...,_,_|. We can represent both positive and negative numbers in this type, by occupying different areas within that field. Positive numbers will only have maximum 126 bits set in the first half: |x,x,x,x,..,x, .. ,_,_,_,_|. Negative numbers will only have maximum 126 bits set in the second half: |_,_,_,_, .. x,x,x,..,x|. We prevent overlap that can occur due to arithmetics by asserting the proper bitsizes on inputs before doing any addition, subtraction or multiplication. The benefit of this choice is that a Field is already the standard type of Noir; all other types such as u8, u16, u32 etc. under the hood are Fields with additional rangechecks. This means that Fields are the most performant, which is exactly what we are looking for.

Simple arithmetic

Another benefit of this approach is that arithmetic is rather straightforward, because of the modular arithmetic in a field. Adding a pair of Quantized values automatically works by adding the underlying field values. The same holds for subtraction. For example, let's say we have decimal values $4.5 + (- 3.1)$ . First we represent them as Quantized:

$4.5 * 2^{16} = 294912$
$3.1 * 2^{16} = 203161.6$ . Notice that this value has some decimal numbers, so we truncate it to $203161$ .

In Noir we can write this as follows:

#![allow(unused)]
fn main() {
let a: Field = 294912;
let a_quantized = Quantized { x: a };
let b: Field = -203161; // This automatically becomes p-203161
let b_quantized = Quantized { x: b };
}

Then, we add them as follows:

#![allow(unused)]
fn main() {
let addition1_quantized = a1_quantized + b1_quantized;
}

The above operation is equal to

$294912 + (p - 203161) = 294912 - 203161 + p = 91751 + p,$

which is equals to $91751$ modulo $p$ .

Now, if we calculate the decimal representation of the result (i.e. converting back from Quantized to decimal value), we obtain the following result:

$91751 * \frac{1}{2 ^{16}} = 1.400,$

and if we simply compute $4.5 + (- 3.1)$ , we see that $1.4$ is indeed the expected value.

Multiplication

The most difficult arithmetic operation is multiplication due to the scale we are using. When we naively multiply 2 values, the result will be an order of scale too large. Let's say we are working with $a * 2^{16}$ and $b * 2^{16}$ , then a straightforward multiplication results in $a * b * 2^{32}$ . Division by the original scale is needed to get the result right. This means we have to do a multiplication and then scale down by dividing by $2^{16}$ . How to approach this?

Our initial idea was to convert our Field element to bytes and then scale down by simply truncating it by 2 bytes. Since the scale is $2^{16}$ , this equals 2 bytes. Afterwards, just convert the leftover bytes to a Field. That would look something like this (pseudocode):

#![allow(unused)]
fn main() {
pub struct Quantized {
    pub x: Field,
}

pub fn chop_off_two_bytes(x: Field) -> Field {
    let bytes: [u8; 32] = x.to_be_bytes();
    let mut truncated = [0; 32];
    for i in 0..30 {
        truncated[i + 2] = bytes[i];
    }
    Field::from_be_bytes::<32>(truncated)
}

let res = a.x * b.x;
let scaled_down = chop_off_two_bytes(res);
}

But wait! What if our initial multiplication is a negative value? Then this whole approach doesn't make sense, because the bits are positioned in "reverse" order. To address this, in the case that we're working with a negative value, temporarily switch to a positive version of that value to do the scaling, and then switch back to negative. This is the idea:

#![allow(unused)]
fn main() {
let mut res = a.x * b.x;
let negative = is_negative(res);
if negative {
    res = p - res;
}
let mut scaled_down = chop_off_two_bytes(res);
if negative {
    scaled_down = p - scaled_down;
}
}

It turns out that there is a more performant way to do the flipping of the sign, which will be discussed later on.

To improve the division approach, we ended up incorporating a suggestion by Tom French of the Aztec team (sign flipping left out for clarity):

#![allow(unused)]
fn main() {
let res = a.x * b.x;

// Cast x to a u16 to preserve only the lowest 16 bits.
let lowest_16_bits = res as u16;

// Subtract off the lowest 16 bits so they are cleared.
let res_with_cleared_lower_bits = res - lowest_16_bits as Field;

// The lowest 16 bits are clear, `res_with_cleared_lower_bits` is divisible by `65536`,
// therefore field division is equivalent to integer division.
let final_res: Field = res_with_cleared_lower_bits / 65536;
}

Restrictions on bitsizes

As described above, a positive value will have maximum 126 bits in the first half of the Field element, while a negative value has a maximum of 126 bits at the end of the 254 bits of a Field. To maintain this invariant, we must assert limits on the bitsizes before performing arithmetic operations:

multiplication: both inputs can have max 63 bits. Then the result can become 63+63=126 bits.
addition and subtraction: both inputs can have max 125 bits. Then the result can become 125 + 1 carry bit = 126 bits.
division( $a, b$ ): $a$ can have 109 bits and $b$ can have 126 bits. We'll explain this next.

For division we have to keep in mind that the result must have the correct scale, just like with multiplication. However in this case, a naive division would lead to a scaled down result. To prevent this, we scale the numerator up by $2^{16}$ before performing the division. Since the scale itself is 17 bits, this means the initial numerator input must limited to have 126-17=109 bits. The denominator can have 126 bits.

In the standalone library for the fixed point arithmetic, these checks have been added to the respective functions. For the ML library however, we opted to use a "stripped" version of quantized directly in the library itself, which doesn't contain any bitsize checks. Instead, the bitsize checks have been added throughout all of the ML functionality iself. This was done to optimize the amount of bitsize checks needed by tailoring them to the actual functionality. Throughout the code the bitsize checks can be found as well as the reasoning for it.

Use the fixed point library

The quantized functionality can be found and used in a separate library called noir-fixed-point. Here, we present some instructions on how to use it.

First, you need to add the library as a dependency in your own Noir project:

[dependencies]
fixed_point = { tag = "v0.1.2", git = "https://github.com/hashcloak/noir-fixed-point" }

Then, you can create values and perform arithmetic as it is shown in the following example:

#![allow(unused)]
fn main() {
let a = Quantized::new(98304); // Represents 1.5
let b = Quantized::new(147456); // Represents 2.25

let add = a + b; // (98304 + 147456) / 2^16 = 3.75
let sub = a - b; // (98304 - 147456) / 2^16 = -0.75
let mul = a * b; // (98304 * 147456) / 2^(16 + 16) = 3.375
}

Logistic regression

Now that we have the most important building block down, we can move on to the next challenge: implementing logistic regression in Noir.

Logistic regression is a classification algorithm in ML that estimates the probability of a sample belonging to a specific class, such as Class A or Class B. For example, in the case of medical images, logistic regression can determine whether cancer is present or not, with the classes being "Yes" and "No".

There are two important phases in machine learning: the training and inference phase. In the first one, you use a dataset that links samples to the output classes to calculate the parameters of a prediction model. Samples will have a certain amount of features, which make them unique, and they have a label that indicates to which class they belong. For example, for a medical image, we don't use the image directly, but certain numerical representations of aspects of the image. The labels will be "cancer" and "no cancer". The weights represent the output of your training (you "trained a model"). Then, in the second phase, the inference, you take a sample that was not in the dataset and by doing a calculation which involves the weights, you predict in which class it falls.

In this project, we have only focused on the training phase and specifically by implementing logistic regression. In this algorithm, we use the following strategy: make a guess using the sigmoid function, determine how "bad" or "good" that guess was using a loss function, and iteratively updates the weights to keep improving the model.

Now, let's introduce the actual logistic regression model. Let $x$ be a sample represented as a vector of features in the dataset of breast cancer, let's encode "Cancer" as 1 and "Not cancer" as 0, and let's call this response variable as $y$ . We want to estimate $p (x) = P (y = 1∣ x)$ . The estimation of this probability can be done using the following model:

$p (x; w, w_{0}) = σ (w^{⊤} x + w_{0}),$

where $w, w_{0}$ are the parameters of the model that we want to estimate, and $σ$ is the sigmoid function given by

$σ (x) = \frac{e ^{x}}{1 + e ^{x}} .$

The goal in the training process is to estimate the parameters $w, w_{0}$ given a training dataset. This estimation can be done using the gradient descent algorithm. The algorithms to find the parameters is presented next: let $E$ be the number of epochs, $n$ be the number of samples, $m$ be the number of features, and $α$ the learning rate.

Algorithm for logistic regression:

Initialize $w$ and $w_{0}$ with zeroes.
Repeat $E$ times: 1. For each $j$ in $m$ : $w_{j} \leftarrow w_{j} - α \cdot \frac{1}{n} \sum_{i = 1}^{n} [p (x^{(i)}; w, w_{0}) - y] \cdot x_{j}^{(i)}$ 2. $w_{0} \leftarrow w_{0} - α \cdot \frac{1}{n} \sum_{i = 1}^{n} [p (x^{(i)}; w, w_{0}) - y]$ .
Output $w$ and $w_{0}$ .

The trainin algorithm is implemented here in the repository.

Sigmoid function

The sigmoid function that we mentioned before is defined as $σ (x) = \frac{1}{1 + e ^{- x}}$ . This function is an important subfunction of logistic regression and well, it looks pretty bad to implement in a zk language as Noir.

But not to worry! From the ML functionality in MPC library MP-SPDZ we realized that this could be done differently, namely by approximation. Implemented here in MP-SPDZ, this approach was published by Hong et al. in this paper and does the following:

$σ (x) = ⎩ ⎨ ⎧ 1 0^{- 4}, 0.02776 \cdot x + 0.145, 0.17 \cdot x + 0.5, 0.02776 \cdot x + 0.85498, 1 - 1 0^{- 4}, x \leq - 5 - 5 < x \leq - 2.5 - 2.5 < x \leq 2.5 2.5 < x \leq 5 x > 5$

In our library this is implemented here.

Input sample bitsize restrictions

As discussed in the previous section, the underlying Quantized type does not come without its restrictions. In particular, we are dealing with limits on the amount of bits the (intermediate) values in the operations can occupy. To make sure none of the intermediate values overflow, there are assertions on bitsize of values throughout the various functions. All of these can be found in the ml.nr file of the library, but here we want to point out an important limit on the input samples.

The function train_multi_class is the main entry point to the library and it takes the following arguments:

#![allow(unused)]
fn main() {
epochs: u64,
inputs: [[Quantized; M]; N],
labels: [[Quantized; N]; C],
learning_rate_ratio: Quantized
}

The following restrictions are important:

inputs: max 20 bits
labels: max 17 bits (they should be either 0 or 1, and represented in Quantized, this takes max 17 bits)
learning_rate_ratio = learning_rate * ratio: max 11 bits. This aims to support values up to $0.1 * 0.1 = 0.01$ in decimal numbers

The bitsize limit imposed on the input samples is a choice that aligns with the datasets we're testing with for this project, as well as a tradeoff between allowing for more precision versus more iterations. If the dataset you are working with is not capped by 20 bits in quantized representation, it is possible to normalize all input samples (so they have values between 0 and 1, or -1 and 1), which will make them no more than 17 bits long.

Train your own model

To train a model with your data, you first need to import the library in the Nargo.toml file as follows:

[dependencies]
noir_mpc_ml = { git = "https://github.com/hashcloak/noir-mpc-ml", tag = "v0.1.2" }

Suppose that you want to train a logistic model with a dataset with 30 samples, 4 features and 3 possible classes for classification. Then, your code should look like this:

use noir_mpc_ml::ml::train_multi_class;
use noir_mpc_ml::quantized::Quantized;

fn main(
    epochs: u64,
    data: [[Quantized; 4]; 30],
    labels: [[Quantized; 30]; 3],
    learning_rate_ratio: Quantized,
    parameters: pub [([Quantized; 4], Quantized); 3],
) {
    let parameters_train = train_multi_class(epochs, data, labels, learning_rate_ratio);
    assert(parameters == parameters_train);
}

Let us explain the parameters for the main() function:

The data parameter corresponds to the dataset. This parameter is an array of the form [[Quantized; M], N], where M is the number of features and N is the number of samples.
The labels parameter is an array of the form [[Quantized; C], N], where C is the number of classes in the classification problem and N is again the number of samples. If the $i$ -th sample belongs to class $c$ for $c \in {0, ..., C - 1}$ , then labels[i][c] == Quantized { x: 65536 } (which is the value of 1 represented in the field), and the other positions for label[i] should be equal to Quantized { x: 0 }.
The learning_rate_ratio equals learning_rate*ratio, where learning_rate is the learning rate $α$ that will be used during the training (assumed value between $[- 0.1, 0.1]$ ) and the ratio is the value of 1 / N. This is left to the Prover since we want to save the computation of this quantity in the circuit directly. We assume you want to train with at least $N = 10$ , so this value will bebetween $[0, 0.1]$ . In the code the combined value learning_rate_ratio is asserted to have max 11 bits ( $0.1 * 0.1 = 0.01$ has 11 bits).
The epochs parameters is the number of epochs $E$ that will be used for the training.
And parameters correspond to the parameters that the Prover will prove to.

The complete implementation and instructions can be found in the GitHub repository of the ML library implemented in Noir.

Optimizations

Once we had the basic functionality for training a model done, we decided to focus on optimizations. This is dangerous territory, as we would discover, but in many cases there are quite a few things you can do to optimize your code and bring the gatecount down. There are two main steps to take here: find opportunities for optimization and then design the actual optimizations itself. We'll touch on those in this section. Lastly, we'll also share how optimizations can lead to underconstraining your circuit and our lessons learned here.

How to find opportunities for optimizations

Somewhere, far far away in a Noir Discord thread, we noticed the mention of a "profiler" that is available in the Noir Github. Notably, the Aztec team itself uses it to profile large circuits. With a profiler, you are able to identify costly parts of your code and thus get direction on where to focus optimization efforts. We definitely wanted to give this tool a try! The only downside; it came with this warning "This profiler is not yet documented or advertised publicly as it is still slightly in flux and used only by the Aztec team." Luckily, we like a challenge.

As it turns out, using the profiler is totally doable. You can find the profiler tool here. These are the steps we took to make it work:

Git clone the Noir repo
Checkout the tag of the nargo version you intend to use. Make sure the installed bb version aligns with it.
Set PATH_TO_ARTIFACTto where the json of your compiled circuit is located, for example target/test.json.
Set PATH_TO_OUTPUT to where the svg file will be saved, for example /output.
Set PATH_TO_BB to where bb is installed.
Obtain gates flamegraph:

cargo run --package noir_profiler --bin noir-profiler  gates-flamegraph --artifact-path PATH_TO_ARTIFACT --output PATH_TO_OUTPUT --backend-path PATH_TO_BB

Obtain opcodes flamegraph

cargo run --package noir_profiler --bin noir-profiler  opcodes-flamegraph --artifact-path PATH_TO_ARTIFACT --output PATH_TO_OUTPUT

The one we focused on is the gates flamegraph. The first iterations looked something like this: Screenshot 2025-01-16 at 11.55.02

We highly recommend the use of this profiler, and have also mentioned this to the Aztec/Noir team. Apart from helping you steer your optimization efforts in a certain direction, it also helps developing intuition for the cost of certain things in circuits (in Noir). Below we will discuss some of the optimizations we implemented.

Unnecessary re-calculations

As developers, we often rely on compilers to optimize code for us, and we don't always look at a "simple" multiplication more or less. However, in the world of ZKP that can be costly. In this case it cost us 127,000 gates. Let us explain.

As discussed above, the sigmoid function is an important building block for the training function. In this implementation, it does an approximation instead of the sigmoid function itself. In this approximation, we decide in which of the 5 cases we find ourselves and based on that return one of the following outputs:

#![allow(unused)]
fn main() {
let outputs = [
    Quantized::new(6), // 0.0001
    (Quantized::new(1819) * x) + Quantized::new(9502), // 0.02776 and 0.145
    (Quantized::new(11141) * x) + Quantized::new(32768), // 0.17 and 0.5
    (Quantized::new(1819) * x) + Quantized::new(56031), // 0.02776 and 0.85498
    Quantized::new(65529), // 0.9999
];
}

The culprit was (Quantized::new(1819) * x), which appears twice. Of course we could have seen that directly in the code, but it stood out to us from the flamegraph. We might have assumed this would be optimized away somehow, but that was not the case. The fix was simple; assign the value to a variable and use it in both spots. This brought down the gatecount about ~127K gates at the time that we were benchmarking this.

Optimization dot product

The flamegraph revealed what the expensive operations in our code were: Quantized multiplication, and in particular scaling down after doing the multiplication.

After a multiplication strictly speaking it is needed to scale down to get the right answer. But what if you don't need the "right answer" right away? This was the case for the dot product in the code for obtaining the prediction:

#![allow(unused)]
fn main() {
fn get_prediction<let M: u32>(
    weights: [Quantized; M],
    inputs: [Quantized; M],
    bias: Quantized,
) -> Quantized {
    let mut z = Quantized::zero();
    for i in 0..M {
        z += weights[i] * inputs[i];
    }

    approx_sigmoid(z + bias)
}
}

At line 8 for each iteration a multiplication with a scale down happens, before adding all the resulting values together. In addition to the expensive scaling down operation, in each quantized multiplication 2 bitsize checks are done in order to prevent overflow during the multiplication (code ref). Instead of using this straight forward dot product, we customized it to scale down fewer times and adjust the bitsize checks to fit within our larger functionality and be cheaper.

First, let's look at delayed scaling down. The opportunity at line 8 is to realize that we can perfectly add multiple scaled up values (which happens after naively multiplying the quantized values and not scale down) and scale down just once after all additions have taken place. Since scaling down is so expensive, this seems like a good idea. Here, we're not yet taking into account the possible overflow, which we'll talk about next. The intermediate solution would look something like this:

#![allow(unused)]
fn main() {
fn get_prediction<let M: u32>(
    weights: [Quantized; M],
    inputs: [Quantized; M],
    bias: Quantized,
) -> Quantized {
    // let z = weights dot_product inputs
    let mut z = 0;
    for i in 0..M {
        // Perform operations directly on Field elements, scale down at the end
        z += weights[i].x * inputs[i].x;
    }

    // Scale the intermediate value down due to multiplications and add bias
    approx_sigmoid(Quantized { x: scale_down(z) } + bias)
}
}

The number of scaling down operations is brought down from M to 1. In addition to that, we're doing Field multiplications instead of Quantized multiplications which is cheaper, since we're avoiding the previously mentioned bitsize checks in quantized multiplication. However, we're in risk of overflow.

We need to bring some bitsize checks back to make sure no intermediate overflow is happening. Recall that inputs are being restricted to 20 bits from the start, so we don't need to do additional bit checks on those. We just need to make sure that the weights are max 105 bits and the intermediate z values don't go over 125 bits. In this way, the multiplication on the right hand side becomes maximum 125 bits, and their addition 126 bits, falling exactly within the necessary bounds:

#![allow(unused)]
fn main() {
assert_bitsize::<105>(weights[i]);
assert_bitsize::<125>(Quantized { x: z });
z += weights[i].x * inputs[i].x;
}

The same type of optimization has been applied in the train function (ref).

A surprising optimization

During our quest to squeeze out at the gates we possibly could, we encountered the following optimization that we only came to understand with help from the Noir team. Before:

#![allow(unused)]
fn main() {
let mut index = 4;
if x <= cuts[0] {
    index = 0;
} else if x <= cuts[1] {
    index = 1;
} else if x <= cuts[2] {
    index = 2;
} else if x <= cuts[3] {
    index = 3;
}
outputs[index]
}

After (~3k gates cheaper at the time):

#![allow(unused)]
fn main() {
let mut res = outputs[4]; // Default to the last index in case x is above all cuts
// Determine the correct interval index by checking against each cut
if x <= cuts[0] {
    res = outputs[0];
} else if x <= cuts[1] {
    res = outputs[1];
} else if x <= cuts[2] {
    res = outputs[2];
} else if x <= cuts[3] {
    res = outputs[3];
}
res
}

Note that the outputs array is already known beforehand in both cases:

#![allow(unused)]
fn main() {
let outputs = [
    Quantized::new(6), // 0.0001, 0.0001 / 2^-16 = 6.5536
    temp + Quantized::new(9502), //0.02776 and 0.145, 0.02776 / 2^-16 = 1819.27936, 0.145/2^-16 = 9502.72
    (Quantized::new(11141) * x) + Quantized::new(32768), //0.17 and 0.5, 0.17 / 2^-16 = 11141.12, 0.5/2^-16 = 32768
    temp + Quantized::new(56031), //0.02776 and 0.85498, 0.85498/2^-16 = 56031.96928
    Quantized::new(65529), //0.9999 / 2^-16 = 65529.4464
];
}

@TomAFrench of the Aztec team pointed out that because the array content is known at compile time, at runtime only the correct one has to be selected in the optimized code snippet. This is called value merging.

In the earlier version of the code, at compile time we don't know which entry of outputs we're going to read and this causes the value merging to happen on the index we'll be reading from. This, Tom explained to us, forces the generation of a memory block and do a dynamic read from array, which is more expensive.

The difference is that in the optimized code, we could replace array access with the actual value, whereas in the initial code, we can't do that. A good reminder that everything known at compile-time will be cheaper within circuits, and reasoning how to do the least amount of work at runtime will help us optimize the code.

Lesson learned: optimization or underconstraining?

One of the early optimizations we tried to introduce was for the byte decomposition of a Field. If you recall the section on multiplication for fixed point arithmetic above, we started out scaling down after multiplication using a trick to decompose the Field into bytes and then chop off 2 of them, resulting in a division by $2^{16}$ . It so happens that byte decomposition is quite expensive in circuits, and our code used a lot of multiplications and thus this scaling down trick. When studying the flamegraph, we knew: this is where we can make a big difference!

After browsing through the Noir documentation, we found an interesting code snippet:

#![allow(unused)]
fn main() {
fn test_to_be_bytes() {
    let field = 2;
    let bytes: [u8; 8] = field.to_be_bytes();
    assert_eq(bytes, [0, 0, 0, 0, 0, 0, 0, 2]);
    assert_eq(Field::from_be_bytes::<8>(bytes), field);
}
}

The Field type has both a function to decompose into bytes, to_be_bytes, and one to create a Field from bytes, from_be_bytes. What if we performed the first action in an unconstrained function and used the second one to constrain the actual byte decomposition? It seemed worth a try, and since we were learning about the optimizations for Noir we were not sure of the outcome. This is the code we used (do not use it!):

#![allow(unused)]
fn main() {
unsafe {
    bytes = get_bytes(temp);
}

assert(Field::from_be_bytes::<32>(bytes) == temp);
}

To our surprise, this indeed led to a gate count improvement.

Unfortunately, this is exactly when we introduced a security issue.

Once again, @TomAFrench from the Noir team helped us understand and pointed out this issue. Basically, our 'optimization' took a shortcut compared to the byte decomposition function from the standard library. The check we bypassed ensured that the value that the byte array represented fell within the Field. In our implementation, $n$ would be equal to $p + n$ for $p$ the modulus of the field, while in the correct implementation an additional check is added to prevent that. In conclusion, our circuit became underconstrained due to our optimization efforts.

To fix the problem, we had to use the the (constrained) version of to_be_bytes that we started out with. In the end, the code became obsolete because we used the improved scaling down code that Tom proposed, which doesn't use byte decomposition. However, for the future we are certainly warned and understand better the workings of these functions.

This was a harsh lesson to learn, and we hope our experience helps you avoid making the same mistake.

Co-Noir

The goal of this project was to be able to execute the training of the model with co-noir, a tool for creating collaborative SNARKs created by Taceo. This is what adds the special MPC sauce to the functionality, and we were lucky enough that Taceo had been doing the heavy lifting of creating a prover in MPC. This tool was relatively new and they were adding necessary features during the process to make this library practical, about which we will share a bit in this section. Furthermore, we've added the steps you can take to run the ml functionality with co-noir yourself.

Dancing with co-noir

After our initial rounds of optimization, we found major gatecount improvement by using unconstrained functions (these are functions that do not require constraints to be enforced in the ZK circuit). However, brillig (unconstrained) code was not yet supported by co-noir at that point. We communicated this to the Taceo team and only 10 days later, they had already added enough features to their brillig VM to support all of our optimizations that used unconstrained code :rocket:

Another challenge presented itself when the full code was indeed supported by the co-noir tooling, but we were trying to make the tests heavier and heavier by iterating for more epochs; up until now we had been testing with one epoch. First after increasing to 5 epochs, the network was getting overloaded by the increasing size of the circuits and the network interface trying to send it all at once. When that was fixed we got bold and tried to increase from 5 to 10 epochs. This time, the process got killed by RAM. To battle the RAM spikes, within two days the Taceo team implemented a better suited MPC-sorting algorithm, which made way to running the training algorithm for 10 and even 20 epochs.

CMUX optimization

During our communications with the Taceo team, they gave us another insight to optimize our code, namely by using conditional multiplexer (CMUX) friendly code structures. The form of operation

$x = c \cdot (a - b) + b$

is a bit more friendly to ZK/MPC operations and in typical code can look like this x = if c {a} else {b}.

Following this tip, we changed the code for scaling down a value by the factor. This is being done after a multiplication or a dot product requires to flip the sign if we're working with negative values, as explained above in the section about fixed point values. This is an old snippet, in which byte decomposition was done twice (because in ZK all branches are always executed):

// ...
let mut temp: Field = self.x * other.x;

// byte decomposition, needed for scaling down
let mut bytes: [u8; 32] = temp.to_be_bytes();

let negative = is_negative(temp);

// In case of a negative value, we flip the sign
// if that is the case, the byte decomposition has to be redone
if negative {
    temp = 21888242871839275222246405745257275088548364400416034343698204186575808495616
        - temp
        + 1;
    bytes = temp.to_be_bytes();
}

// Scaling down
// Flip sign back
// Return result

We first changed the return value of the function is_negative from a boolean to Field, where 0 represents false and 1 true. This can be decided checking whether the value used the lower or higher bits of the register (lower bits = positive, higher bits = negative). This was the code before:

#![allow(unused)]
fn main() {
pub fn is_negative(x: Field) -> bool {
    let (_, higher_bytes) = decompose(x);
    higher_bytes != 0
}
}

We present next the new version of the is_negative function:

#![allow(unused)]
fn main() {
// returns 1 for true, 0 for false
pub fn is_negative(x: Field) -> Field {
    let (_, higher_bytes) = decompose(x);
    if higher_bytes == 0 {
        0
    } else {
        1
    }
}
}

Since the is_negative returns 0 or 1, we could actually use the CMUX operation directly to optimize one byte decomposition away:

// ...
let mut temp: Field = self.x * other.x;
let negative = is_negative(temp);

// Calculate the temp value
temp = negative
    * (
        21888242871839275222246405745257275088548364400416034343698204186575808495616 
        - temp
        + 1
        - temp
    )
    + temp;

// Perform byte decomposition only once
let bytes: [u8; 32] = temp.to_be_bytes();

Run `ml` functionality with co-noir

To execute the functionality from the ml library with co-noir, you can use the skeleton project here. This includes necessary configurations for the three parties, their respective keys and certificates, and the generators for the BN254 curve. Also, it contains a script which executes the complete MPC-ZK flow:

Split input into shares
Witness generation in MPC
Proving in MPC
Creation of verification key
Verification of proof

Additional examples of a flow like this can be found in the examples folder of co-noir itself.

Then, in the main.nr you can add the desired functionality. Make sure to either use a local version of the library or replace the Nargo.toml import with the correct one. Example code:

use noir_mpc_ml::ml::train_multi_class;
use noir_mpc_ml::quantized::Quantized;

fn main(inputs: [[Quantized; 4]; 30], labels: [[Quantized; 30]; 3]) -> pub [([Quantized; 4], Quantized); 3] {
    let learning_rate = 6554;
    let ratio = 2185;

    let epochs = 10;

    let learning_rate = Quantized::new(learning_rate);
    let ratio = Quantized::new(ratio);
    let parameters = train_multi_class(epochs, inputs, labels, learning_rate, ratio);
    parameters
}

Benchmarking

We performed a benchmarking process to test the performance of our implementation. This benchmarking was done using two datasets: the Iris dataset and the Wine dataset. In this benchmarking, we measured the following performance metrics:

Accuracy with respect to a clear text implementation using floating-point numbers.
Number of gates and ACIR opcodes of the Noir implementation for different number of samples.
Training time using co-noir for different number of samples.

The benchmarks were executed in a server with an AMD EPYC Processor @ 2.0 GHz with 32 GB of RAM. The version of each tool used in these benchmarks are:

Noir: 1.0.0-beta.2
Barretenberg: 0.72.1
coNoir: 0.5.0

Datasets

We begin by describing the Iris dataset. The Iris dataset contains 50 samples for each type of Iris flower: setosa, versicolor, and virginica, having a total of 150 samples. For each sample, the dataset contains four features: the length and the width of the sepal and petal of each flower measured in centimeters. In this case, a logistic regression model will take the length and the with of the sepal and petal for a new length in centimeters, and the model will tell whether this flower is a setosa, a versicolor, or a virginica type.

On the other hand, the Wine dataset contains a total of 178 samples, where each sample is one of three types of wines grown in the same region of Italy but derived from three different cultivars. Each sample in the dataset has 13 features presented next:

Alcohol
Malic acid
Ash
Alcalinity of ash
Magnesium
Total phenols
Flavanoids
Nonflavanoid phenols
Proanthocyanins
Color intensity
Hue
OD280/OD315 of diluted wines
Proline

Results

For the Iris dataset, we obtained the following results for the number of gates and ACIR opcodes:

Epochs	Train samples	ACIR opcodes	Circuit size	Proving time
10	30	317,088	660,166	0m 39.295s
10	50	523,048	1,085,961	1m 15.344s
20	30	655,848	1,355,005	1m 18.012s
20	50	1,082,008	2,232,450	2m 35.643s
30	30	994,608	2,049,841	1m 24.117s
30	50	1,640,968	3,378,936	2m 19.931s

For the Wine dataset, we obtained the following results for the number of gates and ACIR opcodes:

Epochs	Train samples	ACIR opcodes	Circuit size	Proving time
10	30	614,088	1,402,019	1m 19.345s
10	50	1,007,788	2,301,844	2m 37.078s
20	30	1,260,378	2,866,087	2m 18.731s
20	50	2,068,678	4,708,962	1m 49.081s
30	30	1,906,668	4,330,154	1m 32.140s

For the last table with the Wine dataset, we did not measure the case for 30 epochs and 50 training samples given that it fills the RAM memory.

In the case of the co-noir training time, we have the following results for the Iris dataset:

Epochs	Train samples	Training time [sec.]	Accuracy
10	30	2,040	0.80
10	50	3,545	0.55
20	30	4,148	0.85

The case for 20 epochs and 50 samples was not possible to run because the generation of the witness takes too long and the co-noir process gets killed because of time out.

For the Wine dataset with 10 epochs and 30 training samples, it takes 4,365 seconds.

As a reference, a Python training using scikit-learn for 30 and 50 samples takes around ~0.006 seconds in average (yes, there is not much difference between them) using a laptop with 20 × 13th Gen Intel® Core™ i7-13700H with 32 GB of RAM. Although it is well understood that it is not possible (or at least very, VERY difficult) to obtain a training time similar to a clear text implementation, this shows that there is a lot of work to do in the realm of privacy-preserving machine learning to improve the performance of this kind of protocols. However, when protocols are running in servers, it is possible to increase the capabilities of the servers to speed-up the training and proving process using co-noir without sacrificing or compromising the security guarantees.

Finally, we compared our Noir implementation using fixed-point numbers with a Rust implementation using floating-point numbers with type f64. We found that both implementations obtain exactly the same accuracy in all the examples we ran. This means that the fact that we are using fixed point numbers in the secure training does not affect significantly the result with respect to a floating point training. To reproduce this experiments, you can use the logistic regression implementation in Rust presented in this repository. You can use the Rust implementation along with the scripts run_single_test.sh, accuracy_evaluation/evaluate_float_model.py and accuracy_evaluation/generate_rust_dataset.py to compare both accuracies.

Lessons learned, discussion, and final words

Over the course of this work, we have learned a lot about Noir, how to optimize it, logistic regression and the co-noir tool. In this final section we'll share our main lessons learned and some thoughts about the delivered library.

This work wouldn't have been the same without the use of the profiler tool, which we discussed in the Optimizations section. We've mentioned this to the Noir team as well, and hope it will soon be documented and recommended to all Noir developers. Although it doesn't do the optimization work for you, it points you in the right and most impactful direction. We've experienced this to be extremely helpful and it has helped us learn a lot about how to write more optimized Noir code.

The second lesson that stands out was the interaction and collaboration with the team that is building co-noir: Taceo. Apart from figuring out how to build this library, we had to be strategic and make sure we kept in close contact with the co-noir team in order for all the functionality to be supported. If you are planning to build something that will be run with co-noir, we highly recommend asking questions in the Taceo Discord because the team is very approachable and pro-active. This dynamic helped us create a co-noir supported version and prevented us from going down rabbit holes that would not work for our goal (for example the use of recursion, which we will discuss in a bit).

Finally, during this project we experienced how the quest for optimizations can lead to underconstrained circuits. It was extremely helpful to have the Noir team take a look at our code and point out where we went wrong. This lesson is very valuable for future endeavors in Noir, as we will be much more cautious with optimizing our code.

Moving on from the lessons learned, we want to highlight a few limitations of the current system. The above mentioned benchmarks are all dependent on numbers of: samples, features, epochs and target classes. In the collection of datasets the Iris and Wine dataset stem from, there are more datasets we could use but the current functionality has difficulty supporting them. For example, dataset Digits has 10 target classes and Diabetes has 27. This means the training algorithm has to be executed 10 of 27 times, which causes a huge increase in gatecount and execution time and makes it impractical at this moment. Furthermore, increasing the number of epochs or training samples also has a negative impact on the performance. Depending on how much they are increased, either the gatecount will be very high (or the execution time with co-noir very long) or the process will simply be killed because the circuit is too big or the network communication is too much.

Also, we dipped our toes shortly into checking the possibilities for using recursion or folding, which seemed like opportunities for optimization. In recursion it uses proving a proof inside another proof to achieve this, and folding does it by combining multiple instances into a single one. The immediate idea would be to use the fact that the train function is run once per class, completely independently in order to obtain a multiclass output. The idea would be to leverage this and optimize the process using one of the aforementioned techniques. From what we understand folding will be supported from the Aztec system, but not necessarily in vanilla Noir, and as of now it is still a work in progress. We considered trying recursion, but it seems folding will be preferred over recursion (according to Discord thread), and furthermore co-noir isn't planning to support either of them anytime soon.

There are also some other alternatives that were not covered in this project. One of those alternatives is the Softmax method to train multiclass logistic models. The Softmax method uses the Softmax function $σ (z)_{i} = \frac{e ^{z_{i}}}{\sum _{i = 1}^{n} e ^{z_{j}}}$ which is used to estimate the probability that a sample belongs to a certain class. To implement this, we would first need to research what approximation method would work in this case, similarly as how sigmoid is calculated using an approximation. It is important to remark that if it's possible to implement Softmax this could be a noticeable optimization to train multiclass models, because we would replace repeated training per class (as we did for this project) with the use of a single training using Softmax. Also we wanted to test other bigger datasets like the Digits dataset but this was limited by the Noir and co-noir performances.

As an additional exercise, let us imagine what would happen if co-noir gets an improvement in the performance with respect to the current one with a factor of 100-1000x. Increasing the performance in that factor may allow the training of logistic regression models with more samples, more features, and more target classes. An application that could benefit from this assumption is training ML models on image-based datasets. Training a high-quality model requires datasets with a large number of features and samples, which our solution has a practical limitation on as the circuit size gets very large. This expands the application of our techniques to medicine (as explained in the Introduction) that require models trained on X-ray, echography, and MRI images, which have many features and require training on a large amount of samples. Another example of datasets with many features are applications related to credit behavior (as in the South German Credit dataset) where also privacy and public auditability are desired features. Finally, the circuit also gets larger as the number of target classes increases, which is also a practical limitation. The speed up can open the path to train models with large number of target classes (as presented in the Student Performance dataset) which opens the possibility to train models on private demographic data.

We'd like to part ways by giving some ideas for future directions and of course thanking you for sticking around until the end! An interesting improvement could be made if a parallelizable for-loop were possible, similar to @for_range_opt in the library MP-SPDZ. This would make it possible to execute the training for each target class in parallel and make for a big performance improvement. Of course, this can be implemented in a frontend which calls the training in separate threads, but it would be a nice-to-have in Noir itself. We opened an enhancement feature request for this in Noir here. Another thing we will keep our eyes out for are the other MPC protocols that will be supported by co-noir. As discussed in the introduction, currently they (and thus we) have been working with Rep3 and Shamir secret sharing, with 3 parties. If the developments during this work are any indication, we're sure there will be more features added soon to the co-noir tool, and those would be very interesting to try out.

Finally, this work is a building block of machine learning, which we achieved by using the approximation of the sigmoid function. Without this, we are unsure whether implementing the functionality would have been possible. This opens up the possibility of trying to implement other, potentially complex, building blocks for ML with similar approximation algorithms. This could be focused both on vanilla Noir or again in the context of co-noir.

Once again, we'd like to thank Aztec Labs for sponsoring this work through the grant for NRG #2 and their valuable feedback. We also want to thank Taceo team for their development of the co-noir tool and their help with every question we had. Keep in touch with our team at HashCloak via our website and follow our updates on X/Twitter.

References

Keller, M. (2020). MP-SPDZ: A Versatile Framework for Multi-Party Computation. Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security, 1575–1590. Presented at the Virtual Event, USA. doi:10.1145/3372297.3417872.
Catrina, O., & Saxena, A. (2010). Secure Computation with Fixed-Point Numbers. In R. Sion (Ed.), Financial Cryptography and Data Security (pp. 35–50). Berlin, Heidelberg: Springer Berlin Heidelberg.
Hastie, T., Tibshirani, R., & Friedman, J. H. (2009). The elements of statistical learning: data mining, inference, and prediction. 2nd ed. New York, Springer.
Cramer, R., Damgård, I., & Nielsen, J. B. (2015). Secure multiparty computation and secret sharing. Cambridge UP.
Hong, C., Huang, Z., Lu, W.-J., Qu, H., Ma, L., Dahl, M., & Mancuso, J. (2020). Privacy-preserving collaborative machine learning on genomic data using TensorFlow. CoRR, abs/2002.04344. Retrieved from https://arxiv.org/abs/2002.04344.
Araki, T., Barak, A., Furukawa, J., Lichter, T., Lindell, Y., Nof, A., … Weinstein, O. (2017). Optimized Honest-Majority MPC for Malicious Adversaries — Breaking the 1 Billion-Gate Per Second Barrier. 2017 IEEE Symposium on Security and Privacy (SP), 843–862. doi:10.1109/SP.2017.15
Eerikson, H., Keller, M., Orlandi, C., Pullonen, P., Puura, J., & Simkin, M. (2020). Use Your Brain! Arithmetic 3PC for Any Modulus with Active Security. In Y. Tauman Kalai, A. D. Smith, & D. Wichs (Eds.), 1st Conference on Information-Theoretic Cryptography (ITC 2020) (p. 5:1-5:24). doi:10.4230/LIPIcs.ITC.2020.5