After running this, it's not clear if these are useful. It's probably the case that we have lots and lots and lots of non-CT code here that needs fixing. I'm not familiar enough with the EC code to say.

However, there are some obvious shortcomings in code that we know to be good:

==21428== Conditional jump or move depends on uninitialised value(s)
==21428==    at 0x4F45559: ec_Curve25519_pt_mul (ecp_25519.c:122)
==21428==    by 0x4F3D75E: ECDH_Derive (ec.c:575)
==21428==    by 0x1137F5: ECDH_Derive (loader.c:1271)
==21428==    by 0x10E100: ectest_ecdh_kat (fbectest.c:143)
==21428==    by 0x10E7B5: main (fbectest.c:273)
==21428==
==21428== Conditional jump or move depends on uninitialised value(s)
==21428==    at 0x4F45435: ec_Curve25519_pt_validate (ecp_25519.c:89)
==21428==    by 0x4F3D33E: EC_ValidatePublicKey (ec.c:452)
==21428==    by 0x4F3D65D: ECDH_Derive (ec.c:555)
==21428==    by 0x1137F5: ECDH_Derive (loader.c:1271)
==21428==    by 0x10E100: ectest_ecdh_kat (fbectest.c:143)
==21428==    by 0x10E7B5: main (fbectest.c:273)

These are both calls to NSS_SecureMemcmp[Zero], which uses a constant-time process. Valgrind doesn't know that though. That suggests that we need some way to (temporarily) unpoison memory for those operations we know to be safe.

Given the number of problems being reported, it might turn out to be time-consuming to understand all of the problems before landing this. Having bugs on file to investigate each item would be what you would normally do. However, we have a replacement strategy for some of these primitives, so maybe we don't want to use that process there. @bbeurdouche probably has opinions on how much effort we want to invest in this.

If we want to run this process and we aren't able to get a good handle on the existing errors, we might need to do some processing of the valgrind output so that we can filter out known problems. A file that lists known functions with apparent non-CT code would be the way I would manage that. Each function could have an open bug associated with it so that we can track investigation and CT fixes or suppressions. The goal here would be to ensure that we don't introduce new non-CT code while we work through the mess we have on hand.

I wouldn't want to land something like this without understanding that strategy.

I've a few tweaks to the setup that I found to be helpful. In addition to the two changes I mention in the review here, calling PR_Cleanup() at the end of bltest and fbectest reduced the number of apparent memory leaks.

cmd/fbectest/fbectest.c
281–282	Probably a taste thing, but I would have used j as a loop index here like so: for (int j = 0; ...; ++j) { if (index < 0 \|\| i + j == index) { if (ectest_validate_point(&nonnist_testvecs_bad_values[j]) == SECSuccess) { printf("not okay %d - %s\n", i + j, nonnist_testvecs_bad_values[j].name); ...
tests/ct/bl.sh
1	This file needs its execute bit set (`chmod +x .../bl.sh`).
tests/ct/bl.txt
46	If this is because you are using ec.sh to provide more complete coverage, it's worth adding a comment here.
tests/ct/ct.sh
1	chmod +x here too
21
27
29
36
37
38

This revision now requires changes to proceed.Dec 23 2021, 1:59 AM

Hey, thanks for the response @mt.

Re: the tweaks suggested in the review comments, I will work those in and update this.

Re: false positives and result analysis: I have written a blog post after looking at the results and analyzing them (not in very much depth but I looked at all of them). TL;DR; is that surely there are false positives but also some concerning results. The false positives like those you mention, however do not stem from the fact that Valgrind wouldn't know that the NSS_SecureMemcmp[Zero] is constant-time it can detect that it is as it doesn't branch on the secret values. However, what Valgrind reports is the branching on the result of NSS_SecureMemcmp[Zero] which is not constant-time (but benign, as the results of some of these comparisons are necessarily public). I think a one-time effort could be put in to define these known false-positives into a Valgrind ignore list after more analysis of them, to get CI to output only true positives on the current code and then potentially a clean result if fixed.

Re: the commented out ECDSA test: That was due to Bug 1743302.

Maybe for PORT_SecureMemcmp[Zero] you can add VALGRIND_MAKE_MEM_DEFINED() calls. Same for some of the other cases that are clearly false positives. For instance, I think that it is reasonable to use that marking when a value is blinded (after applying the blinding, that is).

By the way, I don't know how much effort you are willing to invest in this, so let us know what your constraints are and we'll work with that.

In D132307#4391297, @mt wrote:

Maybe for PORT_SecureMemcmp[Zero] you can add VALGRIND_MAKE_MEM_DEFINED() calls. Same for some of the other cases that are clearly false positives. For instance, I think that it is reasonable to use that marking when a value is blinded (after applying the blinding, that is).

That would get unwieldy quite fast I think and would pollute the crypto primitive code with the Valgrind annotations. If the annotations being not only in the test code but also in the primitive code is not a problem, that that is a possibility. However, I think it is better to use the other Valgrind feature and supply an "ignore file" of calls at particular call-sites that Valgrind's memcheck should ignore. Then this file can have the ignores along with a short description why the leak is a false positive.

By the way, I don't know how much effort you are willing to invest in this, so let us know what your constraints are and we'll work with that.

As a co-author of https://eprint.iacr.org/2021/1650 I would like to see some sort of constant-timeness testing/verification in NSS, whether that ends up being using Valgrind or some other tool (there is quite a lot of the, see https://neuromancer.sk/article/26). And can spend some time on it. Perhaps some other tool would be better suited to the NSS use-case.