==========================================================

Document: N2221

Related:

N2223: Clarifying the C Memory Object Model: Introduction to N2219 - N2222, N2220, N2089, Section 2 of N2012, Sections 3.1 and 3.2 (Q48-59) of N2013, DR338, DR451, N1793, N1818.

1 Summary

This is a revision of N2089

N2089 was based on N2012 (Section 2), adding a concrete Technical Corrigendum proposal for discussion, revising the text, and adding concrete examples (from N2013).

1.1 Problem

Reading uninitialised values has many conceivable semantics in C, including:

  1. undefined behaviour (meaning that a conforming compiler is free to compile the program to arbitrary code, with or without a warning)

  2. making the result of any expression involving that value unpredictable

  3. giving an arbitrary and unstable value (i.e, maybe with a different value on another read)

  4. giving an arbitrary but stable value (with the same value if you read again)

Our survey (Question 2/15) shows that reading of uninitialised values is used in practice, e.g. when copying or serialising a struct which has some uninitialised members or uninitialised padding, (more rarely) when comparing against such a struct, and in debugging. Making such programs have undefined behaviour (option (a)) seems undesirable, permitting future compiler optimisations to break reasonable code. On the other hand, current mainstream compilers do optimise in ways that make (c) and (d) unsound. How much benefit they derive from that is not clear to us.

The C11 standard allows reading of uninitialised values in certain circumstances. For types that (in the particular implementation in question) do not have any trap representations, this is undefined iff "the lvalue designates an object of automatic storage duration that could have been declared with the register storage class (never had its address taken)" (see 6.3.2.1p2 and DR338). This clause seems to have been added to cope with the Itanium NaT, but it's not clear that it correctly models that (see N2220, and for other implementations it seems to complicate and weaken the language for no purpose.

In cases where reading uninitialised values is allowed (for the moment again considering only types without trap representations for the implementation in question), the standard gives an unspecified value, but for several questions, detailed below, the current standard text is unclear how these behave. This makes it unclear exactly what analysis and optimisation is allowed.

1.2 Outline of Proposal

The best solution we can see, given the existing code and compilers, is to adopt a semantics that essentially gives (b).

We do so by a modest change to the C abstract machine: for any scalar type, we extend the set of values of that type with a symbolic "unspecified value" token, then we can give rules defining how that is propagated, e.g. if one adds an unspecified value and a concrete integer. The unspecified value token does not have a bit-level representation. We detail a possible technical corrigendum for all this below, after the examples.

As always, the "as if" rule applies: the fact that the abstract machine manipulates an explicit "unspecified value" token doesn't mean that implementations have to. Normal implementations may at compile-time but will not at runtime: at runtime they will typically have some arbitrary bit pattern where the abstract machine has the unspecified value token, and the looseness of the rules for operations on unspecified values licenses compiler optimisations. Our unspecified value token is (roughly) a language-level analogue of the LLVM undef.

We believe that the same machinery can be used to handle structure padding and the padding after the current member of a union, as in Section 3.3 of our revised N2013.

This semantics seems to be a reasonable and coherent choice, with several benefits:

However, there are some things it does not support:

So all this should be discussed.

Probably there should be a language-level "freeze" of some kind, to support code that knowingly manipulates potentially uninitialised values. We do not here attempt to specify that.

If we are keeping the concept of trap representations, with _Bool as a type (possibly the only one in normal implementations) that has them, then copying a partly uninitialised struct member-by-member will give rise to UB. How about copying a partially uninitialised struct as a whole? Is this an argument for making _Bool have no trap representations, instead making it UB to use non-canonical values in boolean operations or control-flow choices?

Do we need to change the Q50 answer to make control-flow choices based on unspecified values UB rather than (our previous proposal) runtime-nondeterministic? further limits what one can do with uninitialised values. LLVM optimisations hoisting expressions past control-flow need this UB (see Taming Undefined Behavior in LLVM) to be sound. Without changing our source semantics, the mapping to LLVM will need freeze() instructions to be placed around the code of every C controlling expression. It is unclear how costly this is. Changing our source semantics to have control-flow choices based on unspecified values be UB would remove the need for these freeze() instructions, but further limit what one can do with uninitialised values.

A "safe C" might also want to provide (d) as an option (e.g. with a -f(no)-concrete-unspecified-values flag), especially given that a third of our survey respondents believe they can rely on that.

2 Questions and Examples

2.1 Q48 Does reading an uninitialised object give rise to undefined behaviour?

Example trap_representation_1.c 
int main() {
  int i;
  int *p = &i;
  int j=i;   // should this have undefined behaviour?
  // note that i is read but the value is not used
}

Example trap_representation_2.c 
int main() {
  int i;
  int j=i;   // should this have undefined behaviour?
  // note that i is read but the value is not used
}

In C11 the first has defined behaviour and the second has undefined behaviour. In our proposal, both have defined behaviour, with the read of i giving the unspecified value token, and that being written to j. Note that these are minimal test cases: we're not saying that these are in themselves desirable code - in the cases that arise in practice, the creation of a (perhaps partially) uninitialised value and its use are separated. Note also that on many implementations int has no unused representation bit patterns, so on those there can be no trap representations here. In the Itanium case, the NaT bit is per-register data, not a memory-storable bit pattern, and the "address taken" clause of 6.3.2.1p2 seems to have been intended to ensure that i is allocated in memory - but an optimising compiler might easily keep a value in registers down some call hierarchy.

See N2220 for further discussion of trap representations. The Proposed TC here is written assuming Option (b) from there; if one were removing trap representations altogether, it could be simplified somewhat.

2.2 Q49 Can library calls with unspecified-value arguments be assumed to execute with an arbitrary choice of a concrete value (not necessarily giving rise to undefined behaviour)?

We start with this so that printf can be used in later examples.

Example unspecified_value_library_call_argument.c
#include <stdio.h>
int main() 
{
  unsigned char c; 
  unsigned char *p = &c;
  printf("char 0x%x\n",(unsigned int)c);
  // should this have defined behaviour?
}

ISO C11 is unclear. The DR451 CR says "library functions will exhibit undefined behavior when used on indeterminate values" but here we are more specifically looking at unspecified values. We see no benefit from making this undefined behaviour, and we are not aware that compilers assume so. It prevents (e.g.) serialising or debug printing of partially uninitialised structs, or (if padding bytes are treated the same as other uninitialised values) byte-by-byte serialising of structs containing padding. Accordingly, we suggest that library functions such as printf, when called with an unspecified value, are executed by first making an unspecified (nondeterministic) choice at call-time of a concrete value. This permits the instability of uninitialised values that we see in practice.

The DR451 CR also says "The committee agrees that this area would benefit from a new definition of something akin to a 'wobbly' value and that this should be considered in any subsequent revision of this standard. The committee also notes that padding bytes within structures are possibly a distinct form of 'wobbly' representation.". As far as we can see, our proposal subsumes the need for a distinct notion of 'wobbly' representation.

2.3 Q50 Can control-flow choices based on unspecified values be assumed to make an unspecified (arbitrary) choice (not giving rise to undefined behaviour)?

Example unspecified_value_control_flow_choice.c
#include <stdio.h>
int main() 
{
  unsigned char c; 
  unsigned char *p = &c;
  if (c == 'a') 
    printf("equal\n");
  else
    printf("nonequal\n");
  // should this have defined behaviour?
}

ISO C11 is unclear (it does not discuss this). We suggest "yes": permitting a runtime unspecified (nondeterministic) choice at any control-flow choice between specified alternatives based on an unspecified values. The only potential reason otherwise that we are aware of is (as noted by Joseph Myers) jump tables indexed by such a value, if implementations don't do a range check, but that seems likely to lead to security weaknesses. More conservatively, one could conceivably treat any switch whose controlling expression has an unspecified value as having undefined behaviour. Computed gotos (if they were allowed in the standard) on unspecified values should give undefined behaviour.

2.4 Q51 In the absence of any writes, is an unspecified value potentially unstable, i.e., can multiple usages of it give different values?

Example unspecified_value_stability.c
#include <stdio.h>
int main() {
  // assume here that int has no trap representations and 
  // that printing an unspecified value is not itself 
  // undefined behaviour
  int i;
  int *p = &i;
  // can the following print different values?
  printf("i=0x%x\n",i);
  printf("i=0x%x\n",i);
  printf("i=0x%x\n",i);
  printf("i=0x%x\n",i);
}

Clang sometimes prints distinct values here (this is consistent with the Clang internal documentation: ). Accordingly, we think the answer has to be "yes".

2.5 Q52 Do operations on unspecified values result in unspecified values?

Example unspecified_value_strictness_and_1.c
#include <stdio.h>
int main() {
  unsigned char c; 
  unsigned char *p=&c;
  unsigned char c2 = (c | 1);
  unsigned char c3 = (c2 & 1);
  // does c3 hold an unspecified value (not 1)?  
  printf("c=%i  c2=%i  c3=%i\n",(int)c,(int)c2,(int)c3);
}

An LLVM developer remarks that different parts of LLVM assume that undef is propagated aggressively or that it represents an unknown particular number.

In our proposal, the read of c will give the unspecified value token, the result of the binary operation |, if given at least one unspecified-value arguments, will also be the unspecified value token, which will be written to c2. Likewise, the binary & will be strict in unspecified-value-ness, and c3 will end up as the unspecified value. The printf will then make nondeterministic choices for each of these, allowing arbitrary character-valued integers to be printed by implementations.

Our unspecified value token is a per-scalar-type-object entity, not a per-bit entity (except for single-bit bitfields, where the two coincide).

(Note this would make the N1793 Fig.4 printhexdigit not useful when applied to an uninitialised structure member.)

2.6 Q54 Must unspecified values be considered daemonically for identification of other possible undefined behaviours?

Example unspecified_value_daemonic_1.c
int main() {
  int i;
  int *p = &i;
  int j = i;   
  int k = 1/j; // should this have undefined behaviour?
}

The division operation has undefined behaviour for certain concrete argument values, i.e. 0, to accommodate implementation behaviour. If there is an abstract-machine execution in which the second argument is an unspecified value, then a corresponding execution of an actual implementation might divide by zero, so in the abstract machine division should be daemonic: division by an unspecified value should be just as "bad" as division by zero. The same holds for other partial operations and library calls.

2.7 Q55 Can a structure containing an unspecified-value member can be copied as a whole?

This seems to be relied on in practice, and consistent with the "unspecified value token" semantics we have so far, so we suggest "yes" (except in the (Itanium) implementation-defined case above where all reads of uninitialised values give UB). The copy will have an unspecified value for the same member.

Consistent with this, forming a structure value should not be strict in unspecified-value-ness: in the following example, the read of the structure value from s1 and write to s2 should both be permitted, and should copy the value of i1=1. The read of the uninitialised member should not give rise to undefined behaviour (is this contrary to the last sentence of 6.3.2.1p2, or could the structure not `have been declared with the register storage class'' in any case?) . Whats2.i2` holds after the structure copy depends on the rest of the unspecified-value semantics; in our proposal, it holds the unspecified value token.

Example unspecified_value_struct_copy.c
#include <stdio.h>
typedef struct { int i1; int i2; } st;
int main() {
  st s1;
  s1.i1 = 1;
  st s2;
  s2 = s1; // should this have defined behaviour?
  printf("s2.i1=%i\n",s2.i1);
}

This and the following questions investigate whether the property of being an unspecified value is associated with arbitrary (possibly aggregate) C values, or with "leaf" (scalar-type) values, or with individual bitfields, or with individual representation bytes of values, or with individual representation bits of values.

In principle there is a similar question for unions: can a union value as a whole be an unspecified value? There might be a real semantic difference, between an unspecified value as whole and a union that contains a specific member which itself is an unspecified value. But it's unclear whether there is a test in ISO C that distinguishes the two.

2.8 Q56 Given multiple bitfields that may be in the same word, can one be a well-defined value while another is an unspecified value?

Example besson_blazy_wilke_bitfields_1u.c
#include <stdio.h>
struct f {
  unsigned int a0 : 1; unsigned int a1 : 1;
} bf ;
int main() {
  unsigned int a;
  bf.a1 = 1; 
  a = bf.a1; 
  printf("a=%u\n",a);
}

This example is from Besson, Blazy, and Wilke 2015.

For consistency with the rest of our per-leaf-value proposal, we suggest "yes".

2.9 Q57 Are the representation bytes of an unspecified value themselves also unspecified values? (not an arbitrary choice of concrete byte values)

Example unspecified_value_representation_bytes_1.c
#include <stdio.h>
int main() {
  // assume here that the implementation-defined 
  // representation of int has no trap representations
  int i;
  unsigned char c = * ((unsigned char*)(&i)); 
  // does c now hold an unspecified value?
  printf("i=0x%x  c=0x%x\n",i,(int)c);
  printf("i=0x%x  c=0x%x\n",i,(int)c);
}

The best answer to this is unclear from all points of view: ISO C11 doesn't address the question; we don't know whether existing compilers assume these are unspecified values, and we don't know whether existing code relies on them not being unspecified values.

For stylistic consistency one might take the answer to be "yes", but then (given the suggested answers above) a bytewise hash or checksum computation involving them would produce an unspecified value. In a more concrete semantics, it could produce different results in different invocations, even if the value is not mutated in the meantime.

We don't have sufficient grounds for a strong conclusion at present. We tentatively suggest "yes".

2.10 Q58 If one writes some but not all of the representation bytes of an uninitialized value, do the other representation bytes still hold unspecified values?

Example unspecified_value_representation_bytes_4.c
#include <stdio.h>
int main() {
  // assume here that the implementation-defined
  // representation of int has no trap representations
  int i;
  printf("i=0x%x\n",i);
  printf("i=0x%x\n",i);
  unsigned char *cp = (unsigned char*)(&i);
  *(cp+1) = 0x22;
  // does *cp now hold an unspecified value?
  printf("*cp=0x%x\n",*cp);
  printf("*cp=0x%x\n",*cp);
}

This too is unclear. One could take the first such access as "freezing" the unspecified value and its representation bytes, but we don't know whether that would be sound with respect to current compiler behaviour. The simplest choice is "yes".

2.11 Q59 If one writes some but not all of the representation bytes of an uninitialized value, does a read of the whole value still give an unspecified value?

Example unspecified_value_representation_bytes_2.c
#include <stdio.h>
int main() {
  // assume here that the implementation-defined
  // representation of int has no trap representations
  int i;
  printf("i=0x%x\n",i);
  printf("i=0x%x\n",i);
  * (((unsigned char*)(&i))+1) = 0x22;
  // does i now hold an unspecified value?
  printf("i=0x%x\n",i);
  printf("i=0x%x\n",i);
}

Again "yes" is the simplest choice, but one could argue instead that a read of the whole should give any nondeterministically chosen value consistent with the concretely written bytes.

3 Proposed Technical Corrigendum

3.1 Values

Values, unspecified values, indeterminate values, and trap representations are currently defined as follows (3.19):

3.19
1 value
precise meaning of the contents of an object when interpreted as having a specific type

3.19.2
1 indeterminate value
either an unspecified value or a trap representation

3.19.3
1 unspecified value
valid value of the relevant type where this International Standard imposes no requirements on which value is chosen in any instance
2 NOTE
An unspecified value cannot be a trap representation.

3.19.4
1 trap representation
an object representation that need not represent a value of the object type

There "unspecified value" is used to speak of a particular but unknown concrete value (one might be misled by the language of 3.19.3 into thinking that an uninitialised variable that cannot be a trap representation must hold some particular concrete value at runtime, and hence that it will be stable if read multiple times).

Instead, in our proposal there is a "symbolic" unspecified value token at each type, and the abstract machine operates over the disjoint union of the normal concrete semantic values and this token. The notions of value, indeterminate value, and trap representation change accordingly, but in a way that basically preserves the way these terms are currently used:

3.19
1 concrete value
a concrete semantic value (the meaning of the non-trap-representation contents of an object when interpreted as having a specific type)

1 value
either a concrete value or the abstract unspecified value token for a specific type

3.19.2
1 indeterminate value
either a concrete value, or the unspecified value token for a specific type, or a trap representation for a specific type

3.19.3
1 unspecified value
an abstract token, distinct from all concrete values.
2 NOTE unspecified values typically will not have any runtime representation; they are used in the C abstract machine to define what optimisations are allowed, and implementations may use them for compile-time analysis. In the C abstract machine there is a single unspecified value for each type.

3.19.4
1 trap representation
a concrete object representation that does not represent a value of the object type

3.2 Creation of unspecified values

In C11 unspecified values arise in two main ways:

  1. As the initial values of uninitialised objects (for types without trap representations):
  1. As the value of padding bytes within structs/unions

These are unchanged in our proposal, but we have to remove the following, to make reading uninitialised values for types without trap representations be defined behaviour:

3.3 Control-flow based on an unspecified value

Following the general principle that an unspecified value in the C abstract machine should permit the implementation behaviour when given an arbitrary concrete value, control-flow choices between bounded alternatives should be an unspecified choice between them, while control-flow choices to an unbounded choice of locations should give unspecified behaviour. Accordingly, we suggest the following changes to the standard text:

For switch statements, we could either follow the above or make it unspecified behaviour.

3.4 Operations on unspecified values should be daemonic

The semantics of expression operators should be "daemonic" when given an operand that evaluates to an unspecified value. That is, for any operand of an expression operator, if there is a given concrete value for which the behaviour of that operator is undefined, then if that operand is an unspecified value, then the behaviour of the operator is undefined (because an exceptional condition may occur).

For example if the value of the second operand of the / operator is zero, it gives undefined behaviour, and hence if the second operand is an unspecified value, it should likewise give undefined behaviour. Similarly if the value of any operand of a signed + operator is unspecified, the operation is undefined.

Accordingly, we suggest the following change to the text of the standard:

To give an implementation intuition: daemonicity means that implementations do not need to take care not to introduce undefined behaviours when choosing a concrete value for something that in the C abstract machine is an unspecified value.

3.5 Operations on unspecified values should be strict

We suggest that the semantics of expression operators should be made strict on unspecified values. That is, except in the cases of daemonic undefined behaviour above, if the value of any operand of an expression operator is an unspecified value, then the result is an unspecified value.

For example if the value of the first operand of the / operator is unspecified, then the result has unspecified value.

Accordingly, we suggest the following changes to the text of the standard:

3.6 Strict and daemonic expression details

Fleshing out the details of strict and daemonic treatment of unspecified values, we propose the following changes (one could just rely on the diffs from the two sections above, but that seems likely to lead to confusion - it seems worth adding the more verbose but detailed diff below too).

3.7 Calling standard library functions on unspecified values

To permit (for example) bytewise marshalling of structs that may contain unspecified value members or unspecified value padding, we have to allow invocation of standard library functions that make system calls with unspecified value arguments. The following sentence should therefore be added to clause (§7.1.4#1):

If an argument to a standard library function has an unspecified value, it is replaced by a nondeterministic concrete value in an unspecified fashion.

NOTE: This also makes standard library functions that have undefined behaviour for specific arguments daemonic, in the same way as operators.

NOTE: It might be preferable to restrict these conversions to library functions performing I/O (or more generally having to do with system calls).

NOTE: A return from main() with an unspecified value should be similar, making a nondeterministic choice of a concrete value in an unspecified fashion.

3.8 Representations of types

§6.2.6 Representations of types has to be adapted to account for the fact that values can be either concrete values (which have representations) or unspecified value tokens (which do not), and to specify that when one reads a byte from an unspecified value, one gets an unspecified value at type unsigned char.

3.9 Mixed unspecified value and concrete representation bytes

Consider either (1) a object which has not been initialised, and therefore has an unspecified value, which then has one or more (but not all) of its representation byte written to with a concrete value, or (2) an object that has been initialised but which has one or more (but not all) of its representation bytes overwritten from some other uninitialised object. In either case we have a mix of unspecified-value and concrete representation bytes. When reading from the whole object, the read value should be the unspecified value of the appropriate type. In particular if the read value is then stored back to the object, the representation bytes that were made concrete disappear.