teaching machines

CS 430: Lecture 3 – Variables

February 2, 2021 by . Filed under lectures, programming languages, spring-2021.

Dear students,

Computer science has its roots in mathematics. It was the mathematicians who first had problems they wanted to mechanize. Computer science soon grew into a discipline of its own that appeals to people who don’t like mathematics. My hot take is that there’s one big reason for its success: computer science has more human-friendly names than f and x. When we name our data, we have a variable.

Making variables is a bit like breathing. We do it so often we don’t really think about it. But if we stop and examine what’s going on in our machine around variables, we find quite a lot going on. Let’s talk about some of those goings-on today.

von Neumann Architecture

An early and still prevalent model of a computer is the von Neumann architecture, which portrays the computer as a processor, a bank of memory, and a bus connecting the two. As the program executes, it reads from and writes to the cells of memory.

Cells are identified by an addressing scheme. What languages have you programmed in where you used addresses to refer to memory? Certainly C, but you’ve probably also used a language that uses addresses exclusively: the language of your spreadsheet software.

From the perspective of a von Neumann programmer, a variable is a named memory cell. We give names to things to attach identity and significance. When we call a cell i we’re imbuing semantic meaning, saying by convention that the cell holds an index.

When we name a memory cell, we downplay the importance of the data’s location. We generally don’t care where the data is placed in memory. Array variables offer a compromise. We give a meaningful name to a collection of values. We don’t really care where it appears in memory, but we do expect all the values to be located together in a contiguous sequence.

The von Neumann architecture departs from a mathematical representation of the universe. Math doesn’t have a bank of memory cells that change over time. When you name a value in math, you establish an unchangeable, immutable truth. When you name a memory cell in the von Neumann architecture, all you do is give the cell a nickname for easy future reference. The value inside that cell is mutable.

Static vs. Dynamic

Let’s examine six properties of variables. These properties take effect at different times in the life of our program. There are for rough “eras” in which we or the computer makes decisions about these properties:

Any decisions that have been made before execution and that do not change as the program executes are static. Any decisions that are determined or are altered during runtime are dynamic.

Name

The most visible property of a variable is its name. We use identifiers to name variables. As we discussed last time, identifiers tend to start with a letter or underscore, and subsequent symbols are letters, numbers, or underscores.

Some languages use other symbols to denote further other semantic meaning. In Perl, the name an array variable starts with @, the name of a scalar (single-valued) variable starts with $, and the name of hash (a collection of key-value pairs) starts with %. In Ruby, instance variables start with @, static variables start with @@, and global variables start with $. In JavaScript, you may have seen $(this) when working with jQuery. That $ is a legal identifier; it’s an alias to the longer jQuery function. When you compile a Java class that has an inner class, the qualified name of the inner class is Outer$Inner. In Ruby, methods can end in ?, which implies that the method is a boolean predicate.

Different languages have different conventions for how variable names are spelled. In Java and JavaScript, the convention is to use camelCase. In C, Python, and Ruby, the convention is to use snake_case. In CSS, the convention is to kebab-style.

Some companies, including Microsoft and Xerox, followed a naming convention called Hungarian notation. In Hungary, family names come first and personal names come second. In Hungarian notation, a variable’s name includes information about its family or type at its start. A string may be named sName or a count may be named nPeople. Strict adherence to this notation is generally not recommended by today’s software development influencers for myriad reasons. Changing a variable’s type may be challenging if the type is hardcoded in the name. Variable names will cluster in autocomplete menus, require more keystrokes to disambiguate. Modern IDEs make it easier to understand a variable’s type without needing to encode it in the name.

Not all names that fit the identifier rules can be used to name variables. Such words are called reserved words. In Java, goto is a reserved word. You can’t name a variable goto. Some times language specifications attach special meaning to particular names. These special names are called keywords. In many languages, keywords and reserved words are the same set of names, but this is not always the case. Java does not attach special meaning to goto; it just outlaws it.

Names are generally a static decision. You can find the names in the source code before execution. But it’s also possible for new variables to appear during execution, as happens when one runs eval on a string of user input.

Address

As mentioned earlier, the name of a variable provides a convenience over using the raw address of a memory cell. A name can appear in a couple of different contexts with different semantic meanings. Consider x in these two lines of code:

x = 7
y = x

In the first line, we care about the address of the memory cell to which x is bound. We drop a 7 in that cell with no concern for was in it previously. We call x in this context an lvalue, which historically meant a value that could appear on the left-hand side of an assignment but which now refers to a value that has a unique identity. In this case, that unique identity is a memory cell.

The address that a variable binds to may be determined statically or dynamically. In C, for example, variables declared as static are statically allocated, which means memory is set aside for them in the executable at compile time. We can also dynamically allocate chunks of memory that are not bound up in the function lifecycle use malloc.

An interesting consequence of the von Neumann architecture is that one can have multiple names referring to the same memory cell. This is called aliasing. The are some nice advantages to this. One can pass around copies of pointers to expansive chunks of memory very cheaply. But we can also trick ourselves into thinking that two variables are independent when, in fact, they point to the same memory.

Value

In the second line of the preceding code, we care about the value in the memory cell to which x is bound. We pull a 7 from that cell with no concern for where it’s located. We call x in this context an rvalue, which historically meant a value that could appear on the right-hand side of an assignment state or passed as a parameter.

Values are usually bound dynamically, at runtime, and they often change as the program executes. That’s why we call them “variables.” The first time we assign a variable is its initialization.

Some languages support constants, which are variables whose values can’t be changed. In C, we declare a constant variable with the const modifier. We tend to typeset the names of constants using capital letters.

In Java, we can declare a variable as final. In JavaScript, we use const. In Kotlin, we use val. It’s easy to think that the values of these variables are immutable, but that’s wrong. This Java code compiles fine, and the array is not immutable:

final String[] states = {"NJ", "MD"};
states[0] = "VA";

These modifiers do not make the value constant. They make the name unable to be reassigned to some other value.

In C and C++, there are actually two places where we can add the const modifier:

const int *xs = new int[2];
// xs[0] = 18; // Illegal, value is immutable
xs = nullptr; // Legal, name can be reassigned

int * const ys = new int[2];
// ys = nullptr; // Illegal, name cannot be reassigned
ys[0] = 18; // Legal, ys is mutable

In some languages, all variables are constants. They simply cannot be changed. Such languages are called pure, which is a pompous term, like when politicians say they are going to “right-size” an agency through budget cuts. We will look at a pure functional language this semester: Haskell.

Type

Type refers to the “shape” of data and the operations that it supports. In assembly, type is not a property of the memory. It’s baked into the instructions. The addl instruction, for example, adds two longs together. The memory cells it references do not know they hold longs. High-level languages add type information in various ways to ensure that the instructions and data jive.

Some languages attach type information to the variable. The compiler tracks the type of a variable. This is static typing. Since the type is known at compile time, the variable references can be type-checked early. Decisions about what code should be run can also be determined earily. For example, there are many overloaded definitions of cout in C++. When the compiler sees cout << name << endl, it can load up the address of the right version of cout into the executable so it doesn’t need to be figured out at runtime.

Other languages attach types to the values stored in the variables. The types are not associated with the variables. This is dynamic typing. When the code x / 10 is run, the value stored in x is retrieved from memory. The interpreter must figure out what x is before it knows how to execute /. In some languages, a type tag is stored with the value. The interpreter might look at this tag, see that the value’s an integer, and invoke an integer division routine. In other languages, the value has no type tag, but it does have a list of the operations it supports. For example, in Ruby we can say:

> 10.methods
 => [:-@, :**, :<=>, :upto, :<<, :<=, :>=, :==, :chr, :===, :>>, :[], :%, :&, :inspect, :*, :+, :ord, :-, :/, :size, ...]

When we ignore the type and instead directly ask about the operations a value supports, we have duck typing.

Type systems can be divided in other independent ways. Languages with explicit typing require the programmer to announce a variable’s type in the source code, as in the declaration int nfingers = 10. Other languages have implicit typing, in which the types are not mentioned in the source code, as in nfingers = 10. When we see implicit types, we generally think of dynamic typing, but statically-typed languages can be implicit too. In modern C++11 and beyond, we can say:

auto ages = {5, 7, 10, 12};

Other statically-typed languages support such type inference too, including Java, C#, Haskell, Kotlin, and Rust.

Lifetime

Variables have various durations or lifetimes. The textbook defines the following four categories of lifetimes:

Scope

Just as lifetime refers to the time in which a variable is visible, scope is the space in which a variable is visible. There are many possible ways to bound the scope of a variable. To restrict access to an instance variable to just the code of the class, we mark it private. To also grant access to subclasses, we mark it protected. To make a variable visible to many top-level functions, we make it global by declaring it at the top level. To make a variable local, we declare it within a function.

What happens when an inner scope introduces a variable with the same name as a variable in the outer scope? This doubling up of identifiers is called shadowing. It is legal in some languages and instances and illegal in others. Consider this code:

class Monster {
  private int hitPoints;

  public Monster(int hitPoints) {
    this.hitPoints = hitPoints;
  }
}

The constructor parameter shadows and supercedes the instance variable from the outer scope. Java allows us to access both, but gives preference to the inner scope. We must qualify the access to the variable from the outer scope. But shadowing a parameter with a local variable is not allowed in Java:

class Monster {
  private int hitPoints;

  public Monster(int hitPoints) {
    int hitPoints = ...;
  }
}

When we have multiple regions of code declaring and accessing variables, a choice arises: which variable do we access? There are two possibilities:

Let’s examine these two schemes with some real pseudocode:

function main()
  int a = 1
  int b = 2

  function f()
    int a = 0
    b += 1
    print(a, b)

  function g()
    int b = 4
    print(a, b)
    f()
    print(a, b)

  print(a, b)
  g()
  print(a, b)

First we sketch out the referencing environment of these functions by listing which variables are in scope, qualifying them by their owners. For this example, we consider the environments only at the ends of the functions.

Function Static Dynamic
main main.a, main.b main.a, main.b
f f.a, main.b f.a, g.b
g main.a, g.b main.a, g.b

In general, we can’t determine the referencing environment for dynamic scoping just by looking at the code. What appears on the call stack may depend on conditional statements, random numbers, and user input, all of which are unpredictable.

Let’s consider the runtime behavior of these programs by inspecting memory at various checkpoints.

Static Scoping

With static scoping, memory looks like this when we reach the first print statement:

Function a b
main 1 2

After function g declares b, we see:

Function a b
main 1 2
g 4

After function f declares a and updates b, we see:

Function a b
main 1 3
g 4
f 0

Note how main.b from the outer definition was updated. That’s static scoping in action.

When f finishes and control returns to g, we see:

Function a b
main 1 3
g 4

When g finishes and control returns to main, we see:

Function a b
main 1 3

This static scoping narrative leads to the following output:

1 2
1 4
0 3
1 4
1 3

Dynamic Scoping

With dynamic scoping, memory looks like this when we reach the first print statement:

Function a b
main 1 2

After function g declares b, we see:

Function a b
main 1 2
g 4

After function f declares a and updates b, we see:

Function a b
main 1 2
g 5
f 0

Note how g.b from the calling function was updated. That’s dynamic scoping in action.

When f finishes and control returns to g, we see:

Function a b
main 1 2
g 5

When g finishes and control returns to main, we see:

Function a b
main 1 2

This dynamic scoping narrative leads to the following output:

1 2
1 4
0 5
1 4
1 3

Dynamic scoping is harder to reason about than static scoping and we don’t see it many places these days. It’s how Unix environment variables work. The environment variables you see in a process come from the invoking environment. We use this to our advantage at the command-line.

Case Studies

Consider these three kinds of variables:

For each of the six properties that we’ve discussed, are those properties bound statically or dynamically? Let’s talk through each. For the instance variable, we have:

For the static variable, we have:

For the iterator, we have:

Conclusion

Variables are more complex creatures than you may have imagined. There’s appreciable depth to how they are named, how they are treated either as a memory address or as a value, how they are locked up or not locked up with a certain type of data, how long they live, and what code can access them. These are decisions our language designers belabor. When they make good decisions, we reap the benefits of hardly thinking about what’s going on under the hood.

Sincerely,