One Grumpy Code Monkey: Pointers in C

One of the biggest hurdles for people learning C is learning about pointers.

Conceptually, pointers are simple beasties: the value of a pointer expression is the location (address) of another object in memory. What tends to trip people up are the following:

C (and C++) pointer declaration and expression syntax, especially with complex pointer types;
Passing pointers as arguments to functions;
The affect of the base type on pointer arithmetic;
Using multiple levels of indirection; and
The behavior of array expressions with respect to pointers.

Pointer declarations and expressions

C declaration syntax is expression-centric, rather than object-centric; another way of describing it is that C follows a "declaration mimics use" paradigm. In short, the form of a C declaration should closely match the form of an expression of that type in an executable statement. For example, if we have a pointer to an integer and we want to access that integer value, we dereference the pointer using the unary * operator, like so:


x = *p;

The type of the expression *p is int, so the declaration of the pointer p is written as


int *p;

In the declaration above, int is the type specifier and *p is the declarator. The declarator introduces the name of the item being declared, along with additional type information not provided by the type specifier. The int-ness of p is provided by the type specifier int, while the pointer-ness is provided by the * in the declarator. Note that in executable code, whitespace doesn't matter:


x = * p;

will be parsed as x = (*p);. Similarly, whitespace doesn't matter in the declaration, so any of the following are legal:


int *p;
int* p;
int * p;

In all cases, the * operator is part of the declarator, so all of the above variations parse as


int (*p);

This means that in a declaration like


int* a, b;

only a is declared as a pointer to int; b is declared as a regular int.

Similarly, if we have an array of pointers to integers, and we want to access a particular integer value, we subscript the array and dereference the result, like so:


x = *ap[i];

Again, the type of the expression *ap[i] is int, so the declaration of the array is written as


int *ap[N];

Similar to the simple pointer declaration above, the int-ness of ap is specified by the type specifier, while the array-ness and pointer-ness are specified by the declarator. Since the subscript operator has a higher precendence than the dereference operator, the expression *ap[i] is parsed as *(ap[i]). Suppose that instead of an array of pointers to integers, we had a pointer to an array of integers; to access a particular integer value, we'd write


x = (*pa)[i];

because we have to dereference the pointer to the array before applying the subscript operator. By now, it should be clear that the declaration for a pointer to an array of integers would be


int (*pa)[N];

Pointer expressions and declarations can get arbitrarily complex: you can have arrays of pointers, pointers to arrays, pointers to arrays of pointers, pointers to functions, pointers to functions returning pointers, arrays of pointers to functions returning pointers to arrays, pointers to pointers, pointers to pointers to pointers, etc. Here are the basic rules for any type T:


T *p;      // p is a pointer to T
T *a[N];   // a is an N-element array of pointers to T
T (*a)[N]; // a is a pointer to an N-element array of T
T *f();    // f is a function returning a pointer to T
T (*f)();  // f is a pointer to a function returning T

As mentioned above, the * operator is the dereference operator; given any expression p of type T * (pointer to T), the value of the
expression *p is the value of the thing being pointed to and its type is T. The inverse of the * operator is the unary &, or address-of operator. Given any lvalue expression e of type T, the value of the expression &e is the address of e and the type of the expression is T *.

Given the following declarations


int i = 1;
int *p = &i;

and the following memory map:

Item	Address	Contents
Item	Address	0x00	0x01	0x02	0x03
i	0x80C01230	0x00	0x00	0x00	0x01
p	0x80C01234	0x80	0xC0	0x12	0x30

then the following are all true:

Expression	Value	Type
i	0x00000001	int
&i	0x80C01230	int * // pointer to int
p	0x80C01230	int * // pointer to int
*p	0x00000001	int
&p	0x80C01234	int ** // pointer to pointer to int

Note the wrinkle introduced with the last expression &p; we have a pointer to pointer to int (again, the type of an expression &e is T *, and in this case T is int *). It's possible (in some cases unavoidable) to have multiple levels of indirection with pointers.

Pointers and function arguments

One of the primary reasons we need to use pointers in C is to allow functions to modify their arguments. C passes all function arguments by value; the formal argument is a distinct object in memory from the actual argument and receives a copy of the value in the actual argument. For example, given the code


void swap(int x, int y)
{
  int tmp = x;
  x = y;
  y = tmp;
}

int main(void)
{
  int a = 5, b = 10; 
  printf("before: a = %d, b = %d\n", a, b);
  swap(a, b);
  printf("after: a = %d, b = %d\n", a, b);
  return 0;
}

the formal parameters x and y are distinct objects in memory from a and b, so the changes to x and y in the swap function are not reflected in a and b. Again, a hypothetical memory map may help:

At the beginning of swap
Item	Address	Contents
Item	Address	0x00	0x01	0x02	0x03
a	0x80C01230	0x00	0x00	0x00	0x05
b	0x80C01234	0x00	0x00	0x00	0x0A
x	0x8800C000	0x00	0x00	0x00	0x05
y	0x8800C004	0x00	0x00	0x00	0x0A

At the end of swap
Item	Address	Contents
Item	Address	0x00	0x01	0x02	0x03
a	0x80C01230	0x00	0x00	0x00	0x05
b	0x80C01234	0x00	0x00	0x00	0x0A
x	0x8800C000	0x00	0x00	0x00	0x0A
y	0x8800C004	0x00	0x00	0x00	0x05

Modifying the contents of address 0x8800C000, or x, has no effect on the contents of 0x80C01230, or a.

Since we want the swap function to modify the values of a and b in main, we must pass pointers to those objects so that the function may access those memory locations:


void swap(int *x, int *y)
{
  int tmp = *x;
  *x = *y;
  *y = tmp;
}

int main(void)
{
  int a = 5, b = 10;
  printf("before: a = %d, b = %d\n", a, b);
  swap(&a, &b);
  printf("after: a = %d, b = %d\n", a, b);
  return 0;
}

Instead of passing a and b to swap, we pass the expressions &a and &b, which are pointer values. Similarly, instead of declaring swap's parameters x and y as integers, we declare them as pointers to integers.

The memory map now looks like this:

At the beginning of swap
Item	Address	Contents
Item	Address	0x00	0x01	0x02	0x03
a	0x80C01230	0x00	0x00	0x00	0x05
b	0x80C01234	0x00	0x00	0x00	0x0A
x	0x8800C000	0x80	0xC0	0x12	0x30
y	0x8800C004	0x80	0xC0	0x12	0x34

At the end of swap
Item	Address	Contents
Item	Address	0x00	0x01	0x02	0x03
a	0x80C01230	0x00	0x00	0x00	0x0A
b	0x80C01234	0x00	0x00	0x00	0x05
x	0x8800C000	0x80	0xC0	0x12	0x30
y	0x8800C004	0x80	0xC0	0x12	0x34

The value of the parameter x is the address of a, or 0x80C01230; thus, the expression *x refers to the same area of memory as a, so assigning to the expression *x in swap has the same effect as assigning to the expression a in main.

Basically, if you want a function to modify something in its caller, you must pass a pointer to that thing.

Types and pointer arithmetic

When you index into an array using the [] operator or apply the ++ or -- operators to an array expression, you are performing pointer arithmetic. The result of the operation depends on the base type of the pointer. When you advance a pointer value by 1, that's in terms of the size of the base type; if the base type is 1 byte wide, you advance 1 byte. If the base type is 4 bytes wide, you advance 4 bytes.

Array indexing is implemented in terms of pointer arithmetic; when you write the expression a[i], it is evaluated as *(a + i) (remembering that i indicates a the number of elements past a, not the number of bytes; if the base type of a is N bytes wide, then the result of a + i is i * N bytes from a).

Note that this introduces two quirks of C syntax. The first is that array indexing is commutative; if a[i] is a legal expression, then so is i[a] (a + i == i + a), and they give the same result. The second is that we may apply the subscript operator to a pointer expression as well as an array expression (subject to certain restrictions). We'll get into this a little more in the section on pointers and arrays.

Multiple levels of indirection

There are a number of occasions when you have to deal with pointers to pointers. One common one is when you want a function to modify a pointer value. It's similar to the swap issue above, but instead of wanting the function to modify an integer in main, we want it to modify a pointer to integer:


  void swap(int **x, int **y)
  {
    int *tmp = *x;
    *x = *y;
    *y = tmp;
  }

  int main(void)
  {
    int a = 5;
    int b = 10;
    int *pa = &a;
    int *pb = &b;

    printf("before: a = %d, b = %d, pa = %p, pb = %p\n", a, b, (void *) pa, (void *) pb);
    swap(&pa, &pb);
    printf("after:  a = %d, b = %d, pa = %p, pb = %p\n", a, b, (void *) pa, (void *) pb);
    return 0;
  }

At the beginning of swap
Item	Address	Contents
Item	Address	0x00	0x01	0x02	0x03
a	0x80C01230	0x00	0x00	0x00	0x05
b	0x80C01234	0x00	0x00	0x00	0x0A
pa	0x80C01238	0x80	0xC0	0x12	0x30
pb	0x80C0123C	0x80	0xC0	0x12	0x34
x	0x8800C004	0x80	0xC0	0x12	0x38
y	0x8800C008	0x80	0xC0	0x12	0x3C

At the end of swap
Item	Address	Contents
Item	Address	0x00	0x01	0x02	0x03
a	0x80C01230	0x00	0x00	0x00	0x05
b	0x80C01234	0x00	0x00	0x00	0x0A
pa	0x80C01238	0x80	0xC0	0x12	0x34
pb	0x80C0123C	0x80	0xC0	0x12	0x30
x	0x8800C004	0x80	0xC0	0x12	0x38
y	0x8800C008	0x80	0xC0	0x12	0x3C

This example was kind of artificial; we'll see some more realistic scenarios in the next section.

Arrays and pointers

There is a close correspondence between arrays and pointers in C, as described in the section above on pointer arithmetic. This falls out of C's treatment of array expressions.

Except when it is the operand of the sizeof or unary & operator, or it is a string literal being used to initialize another array in a declaration, an expression with type "N-element array of T" will have its type implicitly converted ("decay") to type "pointer to T", and its value will be the address of the first element in the array.

This behavior is responsible for more grief among C programmers than just about any other aspect of the language. When you pass an array argument as a parameter to a function such as printf(), what actually gets passed is a pointer value, not an array.

For example:


#include <math.h>
#include <time.h>

void randomize(int *arr, size_t elements)
{
  size_t i;
  for (i = 0; i < elements; i++)
    arr[i] = rand();
}

int main(void)
{
  int values[20];
  size_t i;

  randomize(values, 20);
  for (i = 0; i < 20; i++)
    printf("values[%2d] = %d\n", i, values[i]);

  return 0;
}

Note that the type of values in main is int [20] (20-element array of int), but the parameter arr in randomize is type int *. When the compiler sees the array expression values in the call to randomize, it implicitly converts the type from an array of integer to a pointer to integer. Thus the declaration of arr as a pointer to int in randomize, rather than an array.

In an effort to be confusing, C allows you to declare the parameter using array syntax, such as


void randomize(int arr[], size_t elements)

or even


void randomize(int arr[20], size_t elements)

However, in both cases, the declaration is interpreted as int *arr. Note that this is only true in the context of a function parameter declaration.

As mentioned above, one of the exceptions to the conversion rule is when the array expression is the operand of the unary & operator; if the type of an expression e is "N-element array of T", then the type of the expression &e is "pointer to N-element array of T", or T(*)[N]. The following tables show the types of various expressions involving arrays:

Declaration: T a[N];
Expression	Type	Decays to
a	T [N]	T *
&a	T (*)[N]
*a	T
a[i]	T

Declaration: T a[N][M];
Expression	Type	Decays to
a	T [N][M]	T (*)[M]
&a	T (*)[N][M]
*a	T [M]	T *
a[i]	T [M]	T *
&a[i]	T (*)[M]
*a[i]	T
a[i][j]	T

Declaration: T a[N][M][L];
Expression	Type	Decays to
a	T [N][M][L]	T (*)[M][L]
&a	T (*)[N][M][L]
*a	T [M][L]	T (*)[L]
a[i]	T [M][L]	T (*)[L]
&a[i]	T (*)[M][L]
*a[i]	T [M]	T *
a[i][j]	T [M]	T *
&a[i][j]	T (*)[M]
*a[i][j]	T
a[i][j][k]	T

The pattern for higher-dimensional arrays should be clear from here on out.

One Grumpy Code Monkey

Thursday, November 04, 2010

Pointers in C

0 Comments:

About Me

Previous Posts