» CS 430: Lecture 3

CS 430: Lecture 3 – Variables

February 3, 2022 by Chris Johnson. Filed under lectures, programming languages, spring-2022.

Dear students,

Computer science has its roots in mathematics. It was the mathematicians who first had problems they wanted to mechanize. Computer science soon grew into a discipline of its own that appeals to people who don’t like mathematics. My hot take is that there’s one big reason for its success: computer science has more human-friendly names than f and x. When we name our data, we have a variable.

Making variables is a bit like breathing. We do it so often we don’t really think about it. But if we stop and examine what’s going on in our machine around variables, we find quite a lot going on. Let’s talk about some of those goings-on today.

von Neumann Architecture

An early and still prevalent model of a computer is the von Neumann architecture, which portrays the computer as a processor, a bank of memory, and a bus connecting the two. As the program executes, it reads from and writes to the cells of memory.

Cells are sometimes identified by an addressing scheme. What languages have you programmed in where you used addresses to refer to memory? Certainly C, but you’ve probably also used a language that uses addresses exclusively: the language of your spreadsheet software.

In most languages, we refer to cells by giving them names. The names impose identity and meaning that the cells do not otherwise have. When we name a cell i we’re saying that it holds an index. When we name a cell balance, we’re saying that it holds a number that goes up and down. When we name a cell foo, we’re saying to the cell, “Your life will be short and meaningless.”

When we name a memory cell, we downplay the importance of the data’s location. We generally don’t care where the data is placed in memory. Array variables strike a middle ground between not caring and caring about the location. With an array, we give a meaningful name to a collection of values. We don’t really care where it appears in memory, but we do expect all the values to be located together in a contiguous sequence.

The von Neumann architecture departs from a mathematical representation of the universe. Math doesn’t have a bank of memory cells that change over time. When you name a value in math, you establish an unchangeable, immutable truth. When you name a memory cell in the von Neumann architecture, all you do is give the cell a nickname for easy future reference. The value inside that cell is mutable.

Static vs. Dynamic

Let’s examine six properties of variables. These properties take effect at different times in the life of our program. There are four rough “eras” in which we or the computer makes decisions about these properties:

Language design time
Compile time
Load/link time
Runtime

Any decisions that have been made before execution and that do not change as the program executes are static. Any decisions that are determined or are altered during runtime are dynamic.

Name

The most visible property of a variable is its name. We use identifiers to name variables. As we discussed last time, identifiers tend to start with a letter or underscore, and subsequent symbols are letters, numbers, or underscores.

Some languages use other symbols to denote further semantic meaning. In Perl, the name an array variable starts with @, the name of a scalar (single-valued) variable starts with $, and the name of hash (a collection of key-value pairs) starts with %. In Ruby, instance variables start with @, static variables start with @@, and global variables start with $. In JavaScript, you may have seen $(this) when working with jQuery. That $ is a legal identifier; it’s an alias to the longer jQuery function. When you compile a Java class that has an inner class, the qualified name of the inner class is Outer$Inner. In Ruby, methods can end in ?, which implies that the method is a boolean predicate.

Different languages have different conventions for how variable names are spelled. In Java and JavaScript, the convention is to use camelCase. In C, Python, and Ruby, the convention is to use snake_case. In CSS, the convention is to kebab-style.

Some companies, including Microsoft and Xerox, followed a naming convention called Hungarian notation. In Hungary, family names come first and personal names come second. In Hungarian notation, a variable’s name includes information about its family or type at its start. A string may be named sName or a count may be named nPeople. Strict adherence to this notation is generally not recommended by today’s software development influencers for several reasons. Changing a variable’s type may be challenging if the type is hardcoded in the name. Variable names will cluster in autocomplete menus, require more keystrokes to disambiguate. Modern IDEs make it easier to understand a variable’s type without needing to encode it in the name.

Not all names that fit the identifier rules can be used to name variables. Such words are called reserved words. In Java, goto is a reserved word. You can’t name a variable goto. Some times language specifications attach special meaning to particular names. These special names are called keywords. In many languages, keywords and reserved words are the same set of names, but this is not always the case. Java does not attach special meaning to goto; it just outlaws it.

Names are generally a static decision. You can find the names in the source code before execution. But it’s also possible for new variables to appear during execution, as happens when one runs eval on a string of user input.

Address

As mentioned earlier, the name of a variable provides a convenience over using the raw address of a memory cell. A name can appear in a couple of different contexts with different semantic meanings. Consider x in these two lines of code:

x = 7
y = x

In the first line, we care about the address of the memory cell to which x is bound. We drop a 7 in that cell with no concern for was in it previously. We call x in this context an lvalue, which historically meant a value that could appear on the left-hand side of an assignment but which now refers to a value that has a unique identity. Usually identity means the address of the memory cell. Languages often provide an equality operator for checking if two lvalues have the same identity. In Java, that operator is ==.

The address that a variable binds to may be determined statically or dynamically. In C, for example, variables declared as static are statically allocated, which means memory is set aside for them in the executable at compile time. Any other variable is allocated at runtime. When we call a function, its parameters and local variables are assigned addresses on the stack. When the function returns, these cells are reclaimed and the addresses should no longer be used. We can also dynamically allocate chunks of memory that are not bound up in the function lifecycle using malloc.

An interesting consequence of the von Neumann architecture is that one can have multiple names referring to the same memory cell. This is called aliasing. There are some nice advantages to this. One can pass around copies of pointers to large chunks of memory very cheaply. But we can also trick ourselves into thinking that two variables are independent when, in fact, they point to the same memory.

Value

In the second line of the preceding code, we care about the value in the memory cell to which x is bound. We pull a 7 from that cell with no concern for where it’s located. We call x in this context an rvalue, which historically meant a value that could appear on the right-hand side of an assignment state or passed as a parameter. Languages often provide an equality operator for checking if two rvalues have the same value. In Java, that operator is equals.

Values are usually bound dynamically, at runtime, and they often change as the program executes. That’s why we call them “variables.” The first time we assign a variable is its initialization.

Some languages support constants, which are variables whose values can’t be changed. In C, we declare a constant variable with the const modifier. We tend to typeset the names of constants using capital letters.

In Java, we can declare a variable as final. In JavaScript, we use const. In Kotlin, we use val. It’s easy to think that the values of these variables are immutable, but that’s wrong. This Java code compiles fine, and the array is not immutable:

final String[] states = {"NJ", "MD"};
states[0] = "VA";

These modifiers do not make the value constant. They make the name unable to be reassigned to some other value.

In C and C++, there are actually two places where we can add the const modifier:

const int *xs = new int[2];
// xs[0] = 18; // Illegal, value is immutable
xs = nullptr; // Legal, name can be reassigned

int * const ys = new int[2];
// ys = nullptr; // Illegal, name cannot be reassigned
ys[0] = 18; // Legal, ys is mutable

In some languages, all variables are constants. They simply cannot be changed. Such languages are called pure, which is a pompous term, like when politicians say they are going to “right-size” an agency through budget cuts. We will look at a pure functional language this semester: Haskell. No variables can ever change their value in Haskell.

Type

Type refers to the “shape” of data and the operations that it supports. In assembly, type is not a property of a memory cell. The type of the data is imposed by the instructions. The addl instruction, for example, adds two longs together. The memory cells it references do not know they hold longs. Higher-level languages add type information in various ways to ensure that the instructions and data jive.

Some languages attach type information to the variable. The compiler tracks the type of a variable. This is static typing. Since the type is known at compile time, the variable references can be type-checked early. Decisions about what code should be run can also be determined early. For example, there are many overloaded definitions of << in C++. When the compiler sees cout << name << endl, it knows the type of cout and can load up the address of the right version of << into the executable so it doesn’t need to be figured out at runtime.

Other languages attach types to the values stored in the variables. The types are not associated with the variables. This is dynamic typing. When the code x / 10 is run, the value stored in x is retrieved from memory. The interpreter must figure out what x is before it knows how to execute /. In some languages, a type tag is stored with the value. The interpreter might look at this tag, see that the value is an integer, and invoke an integer division routine. In other languages, the value has no type tag, but it does have a list of the operations it supports. For example, in Ruby we can say:

> 10.methods
 => [:-@, :**, :<=>, :upto, :<<, :<=, :>=, :==, :chr, :===, :>>, :[], :%, :&, :inspect, :*, :+, :ord, :-, :/, :size, ...]

When we ignore the type and instead directly ask about the operations a value supports, we have duck typing.

Type systems can be divided in other independent ways. Languages with explicit typing require the programmer to announce a variable’s type in the source code, as in the declaration int nfingers = 10. Other languages have implicit typing, in which the types are not mentioned in the source code, as in nfingers = 10. When we see implicit types, we generally think of dynamic typing, but statically-typed languages can be implicit too. In modern C++11 and beyond, we can say:

auto ages = {5, 7, 10, 12};

Other statically-typed languages support such type inference too, including Java, C#, Haskell, Kotlin, and Rust.

Lifetime

Variables have various durations or lifetimes. The textbook defines the following four categories of lifetimes:

Static variables are available for the entire duration of the program’s execution.
Stack-dynamic variables are local variables tied to the lifecycle of their containing function. They are implicitly allocated space in the stack frame when the function is first called. When the function returns, the stack frame is implicitly deallocated and the local variables are wiped from memory.
Explicit heap-dynamic variables live independent of the function lifecycle. They or the structure they belong to are explicitly allocated on the heap using malloc or new. In C and C++, it is the programmer’s responsibility to deallocate variables when they are no longer needed. In other languages, a garbage collector runs every so often and frees any memory to which the program no longer has references too.
Implicit heap-dynamic variables also live on the heap apart from the function lifecycle, but there’s no explicit allocation. A mere assignment triggers the allocation. If the variable being assigned already pointed to heap-allocated memory, and no aliases also point to it, then that memory will be implicitly reclaimed.

Scope

Just as lifetime refers to the time in which a variable is visible, scope is the space in which a variable is visible. There are many possible ways to bound the scope of a variable. To restrict access to an instance variable to just the code of the class, we mark it private. To also grant access to subclasses, we mark it protected. To make a variable visible to many top-level functions, we make it global by declaring it at the top level. To make a variable local, we declare it within a function.

What happens when an inner scope introduces a variable with the same name as a variable in the outer scope? This doubling up of identifiers is called shadowing. It is legal in some languages and instances and illegal in others. Consider this code:

class Monster {
  private int hitPoints;

  public Monster(int hitPoints) {
    this.hitPoints = hitPoints;
  }
}

The constructor parameter shadows and supercedes the instance variable from the outer scope. Java allows us to access both, but gives preference to the inner scope. We must qualify the access to the variable from the outer scope. But shadowing a parameter with a local variable is not allowed in Java:

class Monster {
  private int hitPoints;

  public Monster(int hitPoints) {
    int hitPoints = ...;
  }
}

When we have multiple regions of code declaring and accessing variables, a choice arises: which variable do we access? There are two possibilities:

When accessing x, we look in the scope of the local code structure first. If we don’t find it, we proceed to the surrounding code structure. The way we resolve variables is by looking through the hierarchy of definitions. Because we are using the code as a map for determining this scope, this scheme is called static scoping (because the compiler can see the scoping hierarchy at compile time) or lexical scoping (because the source code reveals the scoping hierarchy).
When accessing x, we look for the most recent declaration of x in the stack. The way we resolve variables is by looking through the hierarchy of function calls. Because we can only determine what variable is being accessed at runtime, this scheme is called dynamic scoping.

Let’s examine these two schemes with some real pseudocode:

function main()
  int a = 1
  int b = 2

  function f()
    int a = 0
    b += 1
    print(a, b)

  function g()
    int b = 4
    print(a, b)
    f()
    print(a, b)

  print(a, b)
  g()
  print(a, b)

First we sketch out the referencing environment of these functions by listing which variables are in scope, qualifying them by their owners. For this example, we consider the environments only at the ends of the functions.

Function	Static	Dynamic
`main`	`main.a`, `main.b`	`main.a`, `main.b`
`f`	`f.a`, `main.b`	`f.a`, `g.b`
`g`	`main.a`, `g.b`	`main.a`, `g.b`

In general, we can’t determine the referencing environment for dynamic scoping just by looking at the code. What appears on the call stack may depend on conditional statements, random numbers, and user input, all of which are unpredictable.

Let’s consider the runtime behavior of these programs by inspecting memory at various checkpoints.

Static Scoping

With static scoping, memory looks like this when we reach the first print statement:

Function	`a`	`b`
`main`	1	2

After function g declares b, we see:

Function	`a`	`b`
`main`	1	2
`g`		4

After function f declares a and updates b, we see:

Function	`a`	`b`
`main`	1	3
`g`		4
`f`	0

Note how main.b from the outer definition was updated. That’s static scoping in action.

When f finishes and control returns to g, we see:

Function	`a`	`b`
`main`	1	3
`g`		4

When g finishes and control returns to main, we see:

Function	`a`	`b`
`main`	1	3

This static scoping narrative leads to the following output:

Dynamic Scoping

With dynamic scoping, memory looks like this when we reach the first print statement:

Function	`a`	`b`
`main`	1	2

After function g declares b, we see:

Function	`a`	`b`
`main`	1	2
`g`		4

After function f declares a and updates b, we see:

Function	`a`	`b`
`main`	1	2
`g`		5
`f`	0

Note how g.b from the calling function was updated. That’s dynamic scoping in action.

When f finishes and control returns to g, we see:

Function	`a`	`b`
`main`	1	2
`g`		5

When g finishes and control returns to main, we see:

Function	`a`	`b`
`main`	1	2

This dynamic scoping narrative leads to the following output:

Dynamic scoping is harder to reason about than static scoping and we don’t see it many places these days. It’s how Unix environment variables work. The environment variables you see in a process come from the invoking environment. We use this to our advantage at the command-line.

Case Studies

Consider these three kinds of variables:

private instance variable in Java
public static final variable in Java
iterator i declared in for loop in C

For each of the six properties that we’ve discussed, are those properties bound statically or dynamically? Let’s talk through each. For the instance variable, we have:

name: static, since it’s known at compile time
address: dynamic, since it doesn’t appear in memory until an object is dynamically allocated
value: dynamic, since its value is subject to assignment statements
type: static, since the compiler knows the type
lifetime: explicit heap-dynamic, since it lives as long as its explicitly dynamically-allocated host object lives
scope: static, visible in all non-static methods in the class

For the static variable, we have:

name: static, since it’s known at compile time
address: static, since the compiler can set aside space for it at compile time
value: static, since the compiler knows the value and it can’t be reassigned
type: static, since the compiler knows the type
lifetime: static, since’s it’s available throughout the execution
scope: static, visible in all code that can access the class

For the iterator, we have:

name: static, since it’s known at compile time
address: dynamic, since the compiler sets aside space for it when the host function is called
value: dynamic, since it keeps getting updated
type: static, since the compiler knows the type
lifetime: stack-dynamic, since it’s a local variable
scope: static, visible within the loop

Conclusion

Variables are more complex creatures than you may have imagined. There’s appreciable depth to how they are named, how they are treated either as a memory address or as a value, how they are locked up or not locked up with a certain type of data, how long they live, and what code can access them. These are decisions our language designers belabor. When they make good decisions, we reap the benefits of hardly thinking about what’s going on under the hood.

Sincerely,

P.S. It’s time for a haiku!

That poor Foo Jr.
He never knew his father
In fact, no one did