Benchmarking

We performed a benchmarking process to test the performance of our implementation. This benchmarking was done using two datasets: the Iris dataset and the Wine dataset. In this benchmarking, we measured the following performance metrics:

  • Accuracy with respect to a clear text implementation using floating-point numbers.
  • Number of gates and ACIR opcodes of the Noir implementation for different number of samples.
  • Training time using co-noir for different number of samples.

The benchmarks were executed in a server with an AMD EPYC Processor @ 2.0 GHz with 32 GB of RAM. The version of each tool used in these benchmarks are:

  • Noir: 1.0.0-beta.2
  • Barretenberg: 0.72.1
  • coNoir: 0.5.0

Datasets

We begin by describing the Iris dataset. The Iris dataset contains 50 samples for each type of Iris flower: setosa, versicolor, and virginica, having a total of 150 samples. For each sample, the dataset contains four features: the length and the width of the sepal and petal of each flower measured in centimeters. In this case, a logistic regression model will take the length and the with of the sepal and petal for a new length in centimeters, and the model will tell whether this flower is a setosa, a versicolor, or a virginica type.

On the other hand, the Wine dataset contains a total of 178 samples, where each sample is one of three types of wines grown in the same region of Italy but derived from three different cultivars. Each sample in the dataset has 13 features presented next:

  1. Alcohol
  2. Malic acid
  3. Ash
  4. Alcalinity of ash
  5. Magnesium
  6. Total phenols
  7. Flavanoids
  8. Nonflavanoid phenols
  9. Proanthocyanins
  10. Color intensity
  11. Hue
  12. OD280/OD315 of diluted wines
  13. Proline

Results

For the Iris dataset, we obtained the following results for the number of gates and ACIR opcodes:

EpochsTrain samplesACIR opcodesCircuit sizeProving time
1030317,088660,1660m 39.295s
1050523,0481,085,9611m 15.344s
2030655,8481,355,0051m 18.012s
20501,082,0082,232,4502m 35.643s
3030994,6082,049,8411m 24.117s
30501,640,9683,378,9362m 19.931s

For the Wine dataset, we obtained the following results for the number of gates and ACIR opcodes:

EpochsTrain samplesACIR opcodesCircuit sizeProving time
1030614,0881,402,0191m 19.345s
10501,007,7882,301,8442m 37.078s
20301,260,3782,866,0872m 18.731s
20502,068,6784,708,9621m 49.081s
30301,906,6684,330,1541m 32.140s

For the last table with the Wine dataset, we did not measure the case for 30 epochs and 50 training samples given that it fills the RAM memory.

In the case of the co-noir training time, we have the following results for the Iris dataset:

EpochsTrain samplesTraining time [sec.]Accuracy
10302,0400.80
10503,5450.55
20304,1480.85

The case for 20 epochs and 50 samples was not possible to run because the generation of the witness takes too long and the co-noir process gets killed because of time out.

For the Wine dataset with 10 epochs and 30 training samples, it takes 4,365 seconds.

As a reference, a Python training using scikit-learn for 30 and 50 samples takes around ~0.006 seconds in average (yes, there is not much difference between them) using a laptop with 20 × 13th Gen Intel® Core™ i7-13700H with 32 GB of RAM. Although it is well understood that it is not possible (or at least very, VERY difficult) to obtain a training time similar to a clear text implementation, this shows that there is a lot of work to do in the realm of privacy-preserving machine learning to improve the performance of this kind of protocols. However, when protocols are running in servers, it is possible to increase the capabilities of the servers to speed-up the training and proving process using co-noir without sacrificing or compromising the security guarantees.

Finally, we compared our Noir implementation using fixed-point numbers with a Rust implementation using floating-point numbers with type f64. We found that both implementations obtain exactly the same accuracy in all the examples we ran. This means that the fact that we are using fixed point numbers in the secure training does not affect significantly the result with respect to a floating point training. To reproduce this experiments, you can use the logistic regression implementation in Rust presented in this repository. You can use the Rust implementation along with the scripts run_single_test.sh, accuracy_evaluation/evaluate_float_model.py and accuracy_evaluation/generate_rust_dataset.py to compare both accuracies.