Introduction

In this report, we share our project and experiences on work for Noir Research Grant Request (NRG) #2 on Private Shared States. For this, we implemented machine learning functionality in the context of co-snarks, where MPC and ZK are combined to do both multiparty computation and get a proof of said computation. Specifically, we implemented logistic regression in Noir and executed this with co-noir, a co-snark tool for Noir developed by Taceo. Also, we set up a benchmarking suite to get insights both on circuitsize in Noir, and the execution time when running with co-noir. The results of the benchmarking show that our protocol that takes ~1.3 million gates to train a model using 30 samples of the Iris dataset using 20 epochs. Using co-noir, our implementation takes ~1.1 hours on a local machine using a protocol for three parties. We will add more details about practical results in our benchmarking section. The library can be found here.

In this post, we give an introduction to the concepts we've worked with: MPC, ZK and logistic regression. Also, we present the details of the implementation, as well as applied optimizations, benchmarks and lessons learned.

Nowadays, machine learning (ML) is widespread everywhere because of its rich field of application to real-world problems and the vast amount of data available. However, the fact that there is a lot of information does not mean that all of this information is useful to build ML models. Some of this information has restricted use according to law or company rules and privacy rights. This is where cryptography comes into play. In the context of ML, cryptography gives guarantees to train ML models using restricted information without breaking the rules or the rights proposed by its owners. In particular, two cryptographic tools can be used along with ML to train models with some security guarantees: secure multi-party computation (MPC) and zero-knowledge proofs (ZK). The goal of this project is to train a logistic regression model in a distributed and collaborative way with two main features: (1) the distributed interaction does not reveal additional information beyond the final trained model, and (2) the training is publicly auditable, which means that any third party can confirm that the model was trained with the claimed data and in a correct way.

BigPicture (3)

Let's begin by explaining the concept of MPC. In MPC, a group of parties $P_{1}, \dots, P_{n}$ want to evaluate a publicly known function $f$ on private inputs so that each party learns the output of the evaluation but no additional information. The problem here is that the inputs belong to different parties. For example, let's suppose that each party $P_{i}$ has an input $x_{i}$ , so they want to compute the evaluation $f (x_{1}, \dots, x_{n})$ in such a way that the parties do not learn something beyond the evaluation of the function. To accomplish this task, the parties engage in a protocol, which means that the parties participate in a communication session between them following specific rules. At the end of the session, the parties will obtain the result of the evaluated function, keeping the privacy of their inputs. However, not all participants have the best intentions, and some of them can try to learn more information beyond the input of the function, or they will try to prevent certain parties from knowing the correct output. Those evil parties will be called corrupted parties. Moreover, we will allow corrupt parties to collaborate and exchange information in MPC. Theoretically, we can represent this by saying that there is an adversary, similar to a mastermind that controls the actions of the corrupted parties and can read all the messages that the corrupted parties receive.

You may be asking: "Fine, but what is the relationship between ML and MPC?". Well, training an ML model can be written as a function whose inputs are the training data samples, and the output is the parameters of the trained model. Also, there are situations in which multiple owners have their data samples but can not jointly use the information because of privacy regulations. For example, consider a set of hospitals that want to train an ML model to predict whether a patient has breast cancer by analyzing X-ray images. Each hospital has its data but can not gather all the images in one place to train a robust ML model because sharing those images violates patient privacy. As a solution, each hospital can participate as a party in an MPC protocol to train a model with joint information, keep their information secret from the other hospitals participating in the protocol, and obtain the parameters of the trained model.

Let's cover the second world: zero-knowledge proofs. A ZK proof allows an entity holding a piece of information, called the Prover, to prove a mathematical statement involving that piece to another entity called the Verifier without revealing any additional information. An example in the context of ML can be framed as follows. Imagine that the Prover and the Verifier have the parameters of a trained ML model; the Prover can prove to the Verifier the following statement: "I have a training dataset that was used to train the model you have." The Prover can prove that statement without revealing the training dataset.

Now, we can mix all the ingredients (ML, MPC, and ZK proofs) in a recipe to privately train logistic regression models collaboratively such that other third parties can verify that the training was done using the correct inputs and following the proper rules. ML contributes to the logistic regression training and all the theories behind the training algorithms, MPC is the ingredient that contributes to the collaborative training in a private way, and ZK is the ingredient that provides public auditability of the training process.

Prover (2)

There are some security assumptions that we need to remark on. First, MPC protocols provide the following privacy guarantee: "The messages received by a party do not reveal additional information beyond the result of the function evaluation." That does not mean that MPC protocols protect the private information of the parties from the information leaked by the output of the computation. This is a very standard MPC security assumption. In the context of ML, MPC does not protect the privacy of the datasets owned by each party from the information that the trained model may leak. Some advanced adversaries can analyze the trained model and come up with information about the private data of the parties involved in the protocol. People can use tools like differential privacy to prevent this attack. Second, in this project we depend on the support offered by the co-noir. Hence, we will use the protocol Rep3 (which refers to the protocol of Arraki et al. with modification presented in Eerikson et al.), and the protocol based on Shamir secret sharing. Given that co-noir currently supports a security model of honest-majority and semi-honest adversaries, we will use the same security model. Third, during this project, we assume that three parties are engaged in the protocol, and one is corrupt as a starting point.

To achieve the main goal in practice, we used the co-noir tool. This tool allows us to write the training algorithm in the Noir programming language, train the model using a distributed protocol, and generate a proof of the training execution in a distributed way in parallel. All of this without revealing any additional information about the data samples provided by each participant.

In the project, we achieved the following results:

An implementation of a fixed-point library used in the logistic regression algorithm.
An implementation of a training algorithm for logistic regression using the Noir programming language that is compatible with the co-noir tool. This training algorithm supports multi-class classification.
We applied optimizations to the Noir code to reduce the number of gates and optimize the training time using co-noir.
We executed a benchmarking of our library using the Iris dataset and the Wine dataset. We chose those datasets because they are the starting point in ML as they are simple datasets, so that we set a baseline for more complicated datasets.

After the development, we developed a protocol that takes ~1.3 million gates to train a model using 30 samples of the Iris dataset using 20 epochs. Using co-noir, our implementation takes ~1.1 hours on a local machine using a protocol for three parties. We will add more details about practical results in our benchmarking section.

This project results from a grant titled "NRG #2: Publicly Verifiable & Private Collaborative ML Model Training" funded by Aztec Labs. We want to thank Aztec Labs for their support.