Benchmarking

We performed a benchmarking process to test the performance of our implementation. This benchmarking was done using two datasets: the Iris dataset and the Wine dataset. In this benchmarking, we measured the following performance metrics:

Accuracy with respect to a clear text implementation using floating-point numbers.
Number of gates and ACIR opcodes of the Noir implementation for different number of samples.
Training time using co-noir for different number of samples.

The benchmarks were executed in a server with an AMD EPYC Processor @ 2.0 GHz with 32 GB of RAM. The version of each tool used in these benchmarks are:

Noir: 1.0.0-beta.2
Barretenberg: 0.72.1
coNoir: 0.5.0

Datasets

We begin by describing the Iris dataset. The Iris dataset contains 50 samples for each type of Iris flower: setosa, versicolor, and virginica, having a total of 150 samples. For each sample, the dataset contains four features: the length and the width of the sepal and petal of each flower measured in centimeters. In this case, a logistic regression model will take the length and the with of the sepal and petal for a new length in centimeters, and the model will tell whether this flower is a setosa, a versicolor, or a virginica type.

On the other hand, the Wine dataset contains a total of 178 samples, where each sample is one of three types of wines grown in the same region of Italy but derived from three different cultivars. Each sample in the dataset has 13 features presented next:

Alcohol
Malic acid
Ash
Alcalinity of ash
Magnesium
Total phenols
Flavanoids
Nonflavanoid phenols
Proanthocyanins
Color intensity
Hue
OD280/OD315 of diluted wines
Proline

Results

For the Iris dataset, we obtained the following results for the number of gates and ACIR opcodes:

Epochs	Train samples	ACIR opcodes	Circuit size	Proving time
10	30	317,088	660,166	0m 39.295s
10	50	523,048	1,085,961	1m 15.344s
20	30	655,848	1,355,005	1m 18.012s
20	50	1,082,008	2,232,450	2m 35.643s
30	30	994,608	2,049,841	1m 24.117s
30	50	1,640,968	3,378,936	2m 19.931s

For the Wine dataset, we obtained the following results for the number of gates and ACIR opcodes:

Epochs	Train samples	ACIR opcodes	Circuit size	Proving time
10	30	614,088	1,402,019	1m 19.345s
10	50	1,007,788	2,301,844	2m 37.078s
20	30	1,260,378	2,866,087	2m 18.731s
20	50	2,068,678	4,708,962	1m 49.081s
30	30	1,906,668	4,330,154	1m 32.140s

For the last table with the Wine dataset, we did not measure the case for 30 epochs and 50 training samples given that it fills the RAM memory.

In the case of the co-noir training time, we have the following results for the Iris dataset:

Epochs	Train samples	Training time [sec.]	Accuracy
10	30	2,040	0.80
10	50	3,545	0.55
20	30	4,148	0.85

The case for 20 epochs and 50 samples was not possible to run because the generation of the witness takes too long and the co-noir process gets killed because of time out.

For the Wine dataset with 10 epochs and 30 training samples, it takes 4,365 seconds.

As a reference, a Python training using scikit-learn for 30 and 50 samples takes around ~0.006 seconds in average (yes, there is not much difference between them) using a laptop with 20 × 13th Gen Intel® Core™ i7-13700H with 32 GB of RAM. Although it is well understood that it is not possible (or at least very, VERY difficult) to obtain a training time similar to a clear text implementation, this shows that there is a lot of work to do in the realm of privacy-preserving machine learning to improve the performance of this kind of protocols. However, when protocols are running in servers, it is possible to increase the capabilities of the servers to speed-up the training and proving process using co-noir without sacrificing or compromising the security guarantees.

Finally, we compared our Noir implementation using fixed-point numbers with a Rust implementation using floating-point numbers with type f64. We found that both implementations obtain exactly the same accuracy in all the examples we ran. This means that the fact that we are using fixed point numbers in the secure training does not affect significantly the result with respect to a floating point training. To reproduce this experiments, you can use the logistic regression implementation in Rust presented in this repository. You can use the Rust implementation along with the scripts run_single_test.sh, accuracy_evaluation/evaluate_float_model.py and accuracy_evaluation/generate_rust_dataset.py to compare both accuracies.