» CS 430: Lecture 4

CS 430: Lecture 4 – Types

February 16, 2021 by Chris Johnson. Filed under lectures, programming languages, spring-2021.

Dear students,

You know how in English nouns and verbs are independent? You can pick any noun and mix it with any nearly any verb to for a grammatical sentence. Butter dances. Fish sneeze. We have a similar situation in programming languages. The grammars of our languages allow us to mix data and operations in ways that don’t make sense. "abcdefghijklmnopqrtuvwxyz" / 7. 62.length. We generally can’t change the grammars to prevent the expression of this nonsense. Instead, we add a notion of types to our data and our operations.

Imagine you didn’t have taste buds. Could anything go wrong? Taste buds are generally designed to help us eat good things and avoid bad things, much like a type system is designed to make sure operations are digesting good operands and not incompatible ones.

Today we’ll discuss types and how they affect programming languages and the humans that write code in them.

Type

A type describes a set of legal values and the set of operations that the data supports. In some instances, the set of values might be fully and explicitly enumerated, as in an enum. In other cases, the set of values is implicit and determined by rules made about the interpretation of memory, as with double or int. The set of operations may be builtin, as with int and double division. In other cases, you as a developer may be the one deciding what operations are supported. This is especially clear in object-oriented programming. When you create a class, you are creating a new type. When you add a new method to that class, you are giving that type a new operation.

In explicitly typed languages, types help document our code. Human readers get hints about function’s purpose by looking at the types of its return value and parameters. Folks have proposed adding implicit typing to Java, which makes types less visible to the programmer. Those proposals were rejected for years. Consider Gilad Bracha’s response to one of them:

Humans benefit from the redundancy of the type declaration in two ways. Humans benefit from the redundancy of the type declaration in two ways. First, the redundant type serves as valuable documentation—readers do not have to search for the declaration of getMap() to find out what type it returns. Second, the redundancy allows the programmer to declare the intended type, and thereby benefit from a cross check performed by the compiler.

In Java 10, implicit typing was added. Implicit typing is handy when you don’t really care about the type, as in this code:

for (var element : items) {
  System.out.println(element);
}

Keeping types explicit is not a bad idea, however.

Types are also useful in our development environments. If we are using autocomplete to fill in a parameter in a function call, the IDE can filter the list of possible completions to just those variables that are in scope and of the appropriate type.

Primitives

The simplest types we call primitives. These include the various integer types. Generally they are encoded using twos-complement, in which negative numbers are expressed as their positive form, but inverted and incremented. Some types allow negative numbers, some don’t. Java, for example, has no unsigned integer types. However, Java 8 introduced some methods in the Integer class for performing unsigned operations.

When we need numbers with fractional components, we turn to float or double. These numbers are encoded using a scientific notation called the IEEE-754 floating point standard. Floating-point numbers are decomposed into three fields: a sign, a base, and an exponent. Any finite representation of real numbers is prone to numerical drift, but floating-point numbers are tricky because the drift is not uniform. There are more representable numbers close to 0 than farther away. Some languages like SQL support simpler fixed-point numbers, where the number of digits after the decimal is predetermined.

Some languages have logical types ranging from a one-bit Boolean value to a multi-valued bitfield. Other languages assume the user can just use integers to represent true and false values.

Logical and numeric types are directly supported in the CPUs of today, which have circuitry dedicated to arithmetic, inversion, shifting, and comparison. Other primitive types are jury-rigged into the system and don’t have special hardware support.

Though computers started as an appliance that mathematicians and physicists used to process numbers, characters got their own types in the 1950s. There were various standards for mapping the arrangement of bits to alphabetic symbols. The 7-bit ASCII standard emerged as the standard in the United States, but was superceded by Unicode, which allowed for more than just 127 characters by growing the number of bytes used to represent a symbol. Many of our systems use UTF-8, a standard that is compatible with ASCII. However, if the first bits of the byte are 110, then the first byte is combined with the second byte. If the first bits of the byte are 1110, then three bytes are combined. If the first bits are 11110, then four bytes are combined. Through this scheme, all 1.1 million Unicode characters can be represented.

Compound Types

As atoms combine to form molecules, primitive types combine to form compound types. These compound types are found in today’s mainstream languages:

If we order values in a sequence of arbitrary length, we have an array or list.
If we order values in a sequence of fixed length, we have a tuple.
If we have a dynamic mapping between names and values, we have an associative array or dictionary.
If we have a static mapping between names and values, we have a record or struct or class.
If we have a container for values of several possible types, we have a union.

Programming languages people sometimes talk about product types and sum types. A product type is a type with multiple independent fields. The number of possible values is the product of the number of possible values of each field. A sum type is a union, which has a single shared field in which values of several types can be stored. The number of possible values is the sum of the number of possible values of each type.

Arrays and Lists

Arrays are generally contiguous in memory, with all of their elements right next to each other. Processing them in a loop tends to be fast because of caching. Accessing one element from RAM automatically brings in neighboring elements into the CPU’s cache. Lists are generally scattered across RAM, with each element storing an explicit link to the next element.

At their simplest, arrays are lines of data. But they can also be two-dimensional tables or three-dimensional volumes, and so on. Multidimensional arrays can take on several forms in memory:

Array of arrays: each element of the outer array points to that element’s inner array. Inner arrays are not likely to be adjacent in memory, meaning that caching is less effective.
Row major: the first row of the array is laid out contiguously in memory, followed by the second row, followed by the third, and so on. Caching is effective if the inner loop advances from column to column.
Column major: the first column of the array is laid out contiguously in memory, followed by the second column, followed by the third, and so on. Caching is effective if the inner loop advances from row to row.

When all the inner arrays are the same size, we have a rectangular array. When the inner arrays differ in size, we have jagged arrays. Only arrays of arrays can be jagged.

With lists, the programmer maintains links to each element. To reach the fifth element, one must iterate the through five links, making subscripting a linear time operation. With arrays, subscripting is constant time. The address of an element i in a 1D array is computed as base_address + i * sizeof(element). The address of an element (c, r) in a row-major 2D array is computed as r * width + c.

Strings

The product type of characters is strings. Different design decisions have been made regarding strings. They are generally implemented as arrays of characters. Sometimes this implementation is thinly veiled, as in C. Sometimes the backing array is hidden, as in Java. Which approach is better? There’s no easy answer to this kind of question. The transparency of C strings can makes them fast, versatile, and vulnerable. The opacity of Java strings makes them slower, constrained, and safe.

Additionally, we have several choices for representing the length of a string. In C, we terminate the string with the null terminator. At least, we are supposed to. Many say this was a bad design decision. It’s easy to obliterate the null terminator and write into memory that we don’t own. When malicious users do this, they are exploiting a buffer overflow vulnerability. Null termination turns certain operations, like calculating the length, into a linear time algorithm. An alternative is store the length as an integer.

The designers of Java made a bold move and decided to make strings immutable. James Gosling explained this choice in an interview:

From a strategic point of view, they tend to more often be trouble free. And there are usually things you can do with immutables that you can’t do with mutable things, such as cache the result. If you pass a string to a file open method, or if you pass a string to a constructor for a label in a user interface, in some APIs (like in lots of the Windows APIs) you pass in an array of characters. The receiver of that object really has to copy it, because they don’t know anything about the storage lifetime of it. And they don’t know what’s happening to the object, whether it is being changed under their feet.
You end up getting almost forced to replicate the object because you don’t know whether or not you get to own it. And one of the nice things about immutable objects is that the answer is, “Yeah, of course you do.” Because the question of ownership, who has the right to change it, doesn’t exist.

One of the things that forced Strings to be immutable was security. You have a file open method. You pass a String to it. And then it’s doing all kind of authentication checks before it gets around to doing the OS call. If you manage to do something that effectively mutated the String, after the security check and before the OS call, then boom, you’re in. But Strings are immutable, so that kind of attack doesn’t work. That precise example is what really demanded that Strings be immutable.

Alas, Java library developers and their clients must continue to be paranoid about other objects being mutated outside of their control. Some of the clients’ fear could have been mitigated by a mechanism of declaring parameters as constant.

Pointers and References

We can also have values that are links to other values. These links are likely implemented using the linked value’s address. When this addressing scheme is transparent, we have pointers. A value that holds an address has a pointer type. When the addresses are hidden or the linking is implemented in some other way, we have references.

Pointer types support incrementing and decrementing to advance through memory. When we dereference a pointer, we are taking away it its address nature and reaching in for the linked value. References generally don’t have any special operations beyond their linked value’s type.

One of the legal values for pointers in C and C++ is NULL, which is just a synonym for integer 0. C++11 introduced nullptr, which has a proper pointer type.

To revisit the behavior of pointers, we’ll walk through this code together:

int x = 1;
int y[4] = {2, 3, 4, 5};
int *p = &x;
*p = 6;
p = y;
*p = 7;
p++;
*p = 9;

Pointers are a feature of languages with addressable memory. Dynamic allocation of memory is technically a separate feature. C and C++ happen to support both features, so we lump them together. There’s no dynamic allocation in the code above, but there are pointers.

When pointers combine with dynamic allocation and explicit deallocation, pointers can be dangerous. Suppose p points to a block of memory. When we free(p), we release the block of memory but do nothing to p. It still points to the block. If we dereference p, we’ll access memory that is no longer ours. It might have been reallocated to some other object, or it might not have. In this situation, we say that p is a stale pointer. We’d prefer code to reliably crash than to exhibit nondeterministic behavior.

There are at least two strategies for dealing with stale pointers. One is to use tombstones. Instead of a pointing directly at the block of memory, a pointer points at another pointer, which points at the block. When you free the pointer, it reclaims the block and nulls out the tombstone. One can achieve a similar effect without the indirection of tombstones by exclusively using this version of free in C:

void freenull(void *p) {
  free(*p);
  *p = NULL;
}

We pass in the address of the pointer so that it can be modified by the function.

The other strategy is to use locks and keys, in which each allocation generates an identifier called a lock. The lock is stored in both the allocated block and in the pointer (the key). When a pointer is dereferenced, the locks are compared. If they match, the access is allowed. If not, an exception is thrown. When a block is freed, its lock is set to some other value, ensuring that future dereferences will have mismatched locks.

There’s another danger with pointers and dynamic allocation. If p gets overwritten or goes out of scope before we’ve called free, we have a memory leak.

One strategy for preventing memory leaks is to take away the responsibility of freeing memory from the programmer. Instead, some mechanism of automatic garbage collection is used.

A predictable approach to garbage collection is reference counting, in which a counter is stored with each dynamically allocated object. Every time the object is assigned to a new reference, its counter is incremented. Every time a reference is lost, as in a reassignment or by the reference going out of scope, the counter is decremented. If the counter ever goes to 0, the memory is freed. C++ provides optional reference counting via several smart pointer types available in its standard library.

A less predictable approach to garbage collection is the mark-and-sweep algorithm, in which the executing process is occasionally paused and the memory examined. The runtime has a list of all the objects that have been allocated on the heap. All references currently on the stack are traversed to see which of these objects are pointed to, recursing on any references found in the objects. Whatever objects are found not to have references are freed. Java has several garbage collection schemes available. Mark-and-sweep is one of them.

Type Systems

Once we’ve built up this notion of classifying data into types, we build up a bunch of rules about how this data may be operated on. This set of rules is called a type system. Language designers have a number of decisions to make about the type system.

Explicit vs. Implicit

Should the types be explicit, present in a variable’s declaration? Or should they be implicit? Both systems burden and benefit the programmer in different ways. Not having to write statements like this seems like a win:

Person person = new Person();
// Implicit typing can save a Person.
// var person = new Person();

But a type system with implicit types will likely lead to situations where the programmer has a different view of the program than the compiler.

Type Checking

How should types be compared when determining what is legal? The act of verifying a program’s adherence to the rules of the type system is called type checking. Any violation is a type error.

The type checker first examines the types to see if they are equivalent. Some type systems require that two types must have the exact same named type, a scheme that we call name equivalence. Under name equivalence, strings can be assigned to strings, booleans to booleans, and URL to URL. If instead of comparing names, the checker compares the underlying shape of the type, we have structure equivalence. Consider this C code:

typedef float celsius;
typedef float fahrenheit;

celsius a = 25.7f;
fahrenheit b = a;

The types have different names. Does C allow this code? It does. We can think of it as allowing structural equivalence because both types are shaped like a float.

However, consider this C code:

typedef struct { int x; } box;
typedef struct { int x; } bin;

box c;
c.x = 5;

bin d;
d = c;

The C compiler rejects this code. For record types, C requires name equivalence.

One can persuasively argue that C only allows name equivalence, as the typedef command just provides an alias to an existing named type, not a new type. The only way to get a completely new type in C is by defining a struct.

If two types are not equivalent, should we reject the code? Sometimes mismatches are clearly wrong, such as when we assign a Random to a String variable or null to an int. Other mismatches are not errors. For example, there may be rules for translating data of one type to another, such as passing an int to a function that expects a long. The int is automatically promoted to a long, and no type error is issued. This translation is called coercion. When there’s a sanctioned path to go from one type to another, we say the first type is compatible with the other.

In other situations, one type is a subtype of another. When we have a polymorphic hierarchy, we can write a function that targets the supertype, but any of its inheritors can be handled too. The types are neither equivalent nor in need of coercion.

In still other situations, code is written in which a type is left as a parameter. A placeholder type is used instead of an actual type. We call these templates in C++ or generics in Java. It’s up to the client of the code to decide what the actual type should be. The type checker does its usual job of comparing types against the type parameters.

Static vs. Dynamic

If we check perform type checking at compile time, we have static type checking. If we check types at runtime, we have dynamic type checking. Static type checking leads to much better runtime performance because we don’t have to keep asking questions about the types. But compilation is a lot slower. With dynamic typing, we have to continually inspect our data and ask whether an operation is legal.

Strong vs. Weak

There are several labels we put on type systems: static vs. dynamic, explicit vs. implicit, and strong vs. weak. The last of these is ill-defined. If you say that a language is strongly or weakly typed, you have said nothing. Generally, strong typing means that the type rules are enforced, that type errors are detected.

Aren’t type errors detected in every language? JavaScript checks for type errors. So do C++, C, Java, Ruby, Python, Haskell, and so on.

But some languages allow the type checking to be subverted. In C, you can cast a pointer to type A to a pointer of type B and perform B-ish operations on something that is not of type B. Similarly, you can reach inside a union and access the value through the lens of a different type, violating the actual type. That’s weak typing.

Sincerely,

P.S. It’s time for a haiku!

Black, woman, or young
voter_t‘s no static type
Amendments tweaked it