Thursday, November 04, 2010

Pointers in C

One of the biggest hurdles for people learning C is learning about pointers.

Conceptually, pointers are simple beasties: the value of a pointer expression is the location (address) of another object in memory. What tends to trip people up are the following:


  • C (and C++) pointer declaration and expression syntax, especially with complex pointer types;
  • Passing pointers as arguments to functions;
  • The affect of the base type on pointer arithmetic;
  • Using multiple levels of indirection; and
  • The behavior of array expressions with respect to pointers.



Pointer declarations and expressions

C declaration syntax is expression-centric, rather than object-centric; another way of describing it is that C follows a "declaration mimics use" paradigm. In short, the form of a C declaration should closely match the form of an expression of that type in an executable statement. For example, if we have a pointer to an integer and we want to access that integer value, we dereference the pointer using the unary * operator, like so:

x = *p;

The type of the expression *p is int, so the declaration of the pointer p is written as

int *p;

In the declaration above, int is the type specifier and *p is the declarator. The declarator introduces the name of the item being declared, along with additional type information not provided by the type specifier. The int-ness of p is provided by the type specifier int, while the pointer-ness is provided by the * in the declarator. Note that in executable code, whitespace doesn't matter:

x = * p;

will be parsed as x = (*p);. Similarly, whitespace doesn't matter in the declaration, so any of the following are legal:

int *p;
int* p;
int * p;

In all cases, the * operator is part of the declarator, so all of the above variations parse as

int (*p);

This means that in a declaration like

int* a, b;

only a is declared as a pointer to int; b is declared as a regular int.

Similarly, if we have an array of pointers to integers, and we want to access a particular integer value, we subscript the array and dereference the result, like so:

x = *ap[i];

Again, the type of the expression *ap[i] is int, so the declaration of the array is written as

int *ap[N];

Similar to the simple pointer declaration above, the int-ness of ap is specified by the type specifier, while the array-ness and pointer-ness are specified by the declarator. Since the subscript operator has a higher precendence than the dereference operator, the expression *ap[i] is parsed as *(ap[i]). Suppose that instead of an array of pointers to integers, we had a pointer to an array of integers; to access a particular integer value, we'd write

x = (*pa)[i];

because we have to dereference the pointer to the array before applying the subscript operator. By now, it should be clear that the declaration for a pointer to an array of integers would be

int (*pa)[N];

Pointer expressions and declarations can get arbitrarily complex: you can have arrays of pointers, pointers to arrays, pointers to arrays of pointers, pointers to functions, pointers to functions returning pointers, arrays of pointers to functions returning pointers to arrays, pointers to pointers, pointers to pointers to pointers, etc. Here are the basic rules for any type T:

T *p; // p is a pointer to T
T *a[N]; // a is an N-element array of pointers to T
T (*a)[N]; // a is a pointer to an N-element array of T
T *f(); // f is a function returning a pointer to T
T (*f)(); // f is a pointer to a function returning T

As mentioned above, the * operator is the dereference operator; given any expression p of type T * (pointer to T), the value of the
expression *p is the value of the thing being pointed to and its type is T. The inverse of the * operator is the unary &, or address-of operator. Given any lvalue expression e of type T, the value of the expression &e is the address of e and the type of the expression is T *.

Given the following declarations

int i = 1;
int *p = &i;

and the following memory map:





























ItemAddressContents
0x000x010x020x03
i0x80C012300x000x000x000x01
p0x80C012340x800xC00x120x30


then the following are all true:
































ExpressionValueType
i0x00000001int
&i0x80C01230int * // pointer to int
p0x80C01230int * // pointer to int
*p0x00000001int
&p0x80C01234int ** // pointer to pointer to int


Note the wrinkle introduced with the last expression &p; we have a pointer to pointer to int (again, the type of an expression &e is T *, and in this case T is int *). It's possible (in some cases unavoidable) to have multiple levels of indirection with pointers.

Pointers and function arguments

One of the primary reasons we need to use pointers in C is to allow functions to modify their arguments. C passes all function arguments by value; the formal argument is a distinct object in memory from the actual argument and receives a copy of the value in the actual argument. For example, given the code

void swap(int x, int y)
{
int tmp = x;
x = y;
y = tmp;
}

int main(void)
{
int a = 5, b = 10;
printf("before: a = %d, b = %d\n", a, b);
swap(a, b);
printf("after: a = %d, b = %d\n", a, b);
return 0;
}

the formal parameters x and y are distinct objects in memory from a and b, so the changes to x and y in the swap function are not reflected in a and b. Again, a hypothetical memory map may help:














































Item Address Contents
0x00 0x01 0x02 0x03
a 0x80C01230 0x00 0x00 0x00 0x05
b 0x80C01234 0x00 0x00 0x00 0x0A
x 0x8800C000 0x00 0x00 0x00 0x05
y 0x8800C004 0x00 0x00 0x00 0x0A
At the beginning of swap

















































Item Address Contents
0x00 0x01 0x02 0x03
a 0x80C01230 0x00 0x00 0x00 0x05
b 0x80C01234 0x00 0x00 0x00 0x0A
x 0x8800C000 0x00 0x00 0x00 0x0A
y 0x8800C004 0x00 0x00 0x00 0x05
At the end of swap




Modifying the contents of address 0x8800C000, or x, has no effect on the contents of 0x80C01230, or a.

Since we want the swap function to modify the values of a and b in main, we must pass pointers to those objects so that the function may access those memory locations:

void swap(int *x, int *y)
{
int tmp = *x;
*x = *y;
*y = tmp;
}

int main(void)
{
int a = 5, b = 10;
printf("before: a = %d, b = %d\n", a, b);
swap(&a, &b);
printf("after: a = %d, b = %d\n", a, b);
return 0;
}


Instead of passing a and b to swap, we pass the expressions &a and &b, which are pointer values. Similarly, instead of declaring swap's parameters x and y as integers, we declare them as pointers to integers.

The memory map now looks like this:














































Item Address Contents
0x00 0x01 0x02 0x03
a 0x80C01230 0x00 0x00 0x00 0x05
b 0x80C01234 0x00 0x00 0x00 0x0A
x 0x8800C000 0x80 0xC0 0x12 0x30
y 0x8800C004 0x80 0xC0 0x12 0x34
At the beginning of swap

















































Item Address Contents
0x00 0x01 0x02 0x03
a 0x80C01230 0x00 0x00 0x00 0x0A
b 0x80C01234 0x00 0x00 0x00 0x05
x 0x8800C000 0x80 0xC0 0x12 0x30
y 0x8800C004 0x80 0xC0 0x12 0x34
At the end of swap




The value of the parameter x is the address of a, or 0x80C01230; thus, the expression *x refers to the same area of memory as a, so assigning to the expression *x in swap has the same effect as assigning to the expression a in main.

Basically, if you want a function to modify something in its caller, you must pass a pointer to that thing.

Types and pointer arithmetic

When you index into an array using the [] operator or apply the ++ or -- operators to an array expression, you are performing pointer arithmetic. The result of the operation depends on the base type of the pointer. When you advance a pointer value by 1, that's in terms of the size of the base type; if the base type is 1 byte wide, you advance 1 byte. If the base type is 4 bytes wide, you advance 4 bytes.

Array indexing is implemented in terms of pointer arithmetic; when you write the expression a[i], it is evaluated as *(a + i) (remembering that i indicates a the number of elements past a, not the number of bytes; if the base type of a is N bytes wide, then the result of a + i is i * N bytes from a).

Note that this introduces two quirks of C syntax. The first is that array indexing is commutative; if a[i] is a legal expression, then so is i[a] (a + i == i + a), and they give the same result. The second is that we may apply the subscript operator to a pointer expression as well as an array expression (subject to certain restrictions). We'll get into this a little more in the section on pointers and arrays.

Multiple levels of indirection

There are a number of occasions when you have to deal with pointers to pointers. One common one is when you want a function to modify a pointer value. It's similar to the swap issue above, but instead of wanting the function to modify an integer in main, we want it to modify a pointer to integer:

void swap(int **x, int **y)
{
int *tmp = *x;
*x = *y;
*y = tmp;
}

int main(void)
{
int a = 5;
int b = 10;
int *pa = &a;
int *pb = &b;

printf("before: a = %d, b = %d, pa = %p, pb = %p\n", a, b, (void *) pa, (void *) pb);
swap(&pa, &pb);
printf("after: a = %d, b = %d, pa = %p, pb = %p\n", a, b, (void *) pa, (void *) pb);
return 0;
}
































































At the beginning of swap
Item Address Contents
0x00 0x01 0x02 0x03
a 0x80C01230 0x00 0x00 0x00 0x05
b 0x80C01234 0x00 0x00 0x00 0x0A
pa 0x80C01238 0x80 0xC0 0x12 0x30
pb 0x80C0123C 0x80 0xC0 0x12 0x34
x 0x8800C004 0x80 0xC0 0x12 0x38
y 0x8800C008 0x80 0xC0 0x12 0x3C






























































At the end of swap
Item Address Contents
0x00 0x01 0x02 0x03
a 0x80C01230 0x00 0x00 0x00 0x05
b 0x80C01234 0x00 0x00 0x00 0x0A
pa 0x80C01238 0x80 0xC0 0x12 0x34
pb 0x80C0123C 0x80 0xC0 0x12 0x30
x 0x8800C004 0x80 0xC0 0x12 0x38
y 0x8800C008 0x80 0xC0 0x12 0x3C



This example was kind of artificial; we'll see some more realistic scenarios in the next section.

Arrays and pointers

There is a close correspondence between arrays and pointers in C, as described in the section above on pointer arithmetic. This falls out of C's treatment of array expressions.

Except when it is the operand of the sizeof or unary & operator, or it is a string literal being used to initialize another array in a declaration, an expression with type "N-element array of T" will have its type implicitly converted ("decay") to type "pointer to T", and its value will be the address of the first element in the array.

This behavior is responsible for more grief among C programmers than just about any other aspect of the language. When you pass an array argument as a parameter to a function such as printf(), what actually gets passed is a pointer value, not an array.

For example:

#include <math.h>
#include <time.h>

void randomize(int *arr, size_t elements)
{
size_t i;
for (i = 0; i < elements; i++)
arr[i] = rand();
}

int main(void)
{
int values[20];
size_t i;

randomize(values, 20);
for (i = 0; i < 20; i++)
printf("values[%2d] = %d\n", i, values[i]);

return 0;
}


Note that the type of values in main is int [20] (20-element array of int), but the parameter arr in randomize is type int *. When the compiler sees the array expression values in the call to randomize, it implicitly converts the type from an array of integer to a pointer to integer. Thus the declaration of arr as a pointer to int in randomize, rather than an array.

In an effort to be confusing, C allows you to declare the parameter using array syntax, such as

void randomize(int arr[], size_t elements)

or even

void randomize(int arr[20], size_t elements)

However, in both cases, the declaration is interpreted as int *arr. Note that this is only true in the context of a function parameter declaration.

As mentioned above, one of the exceptions to the conversion rule is when the array expression is the operand of the unary & operator; if the type of an expression e is "N-element array of T", then the type of the expression &e is "pointer to N-element array of T", or T(*)[N]. The following tables show the types of various expressions involving arrays:




























Declaration: T a[N];
Expression Type Decays to
a T [N] T *
&a T (*)[N]
*a T
a[i] T











































Declaration: T a[N][M];
Expression Type Decays to
a T [N][M] T (*)[M]
&a T (*)[N][M]
*a T [M] T *
a[i] T [M] T *
&a[i] T (*)[M]
*a[i] T
a[i][j] T


























































Declaration: T a[N][M][L];
Expression Type Decays to
a T [N][M][L] T (*)[M][L]
&a T (*)[N][M][L]
*a T [M][L] T (*)[L]
a[i] T [M][L] T (*)[L]
&a[i] T (*)[M][L]
*a[i] T [M] T *
a[i][j] T [M] T *
&a[i][j] T (*)[M]
*a[i][j] T
a[i][j][k] T



The pattern for higher-dimensional arrays should be clear from here on out.

0 Comments:

Post a Comment

<< Home