teaching machines

CS 268: Lecture 1 – Web History

February 4, 2020 by . Filed under lectures, spring-2020, webdev.

Welcome to CS 268: Web Systems!

In this class, we will investigate how to make software that runs in a web browser. It’s quite likely that web development is going to be one of the most important skills you can have in this stage of the technological era. Standalone desktop apps will never go away entirely, but capitalism makes the web hard to beat. You can write your software once for all computers, distribute it through the internet, control who accesses it, and deploy fixes more or less immediately.

Let’s have a look at how the web came to be.

History

Internet

It was the 1960s. Democrats John Kennedy and Lyndon Johnson served as presidents, and the nation was strained in a battle for civil rights. World War II had ended two decades earlier, but the United States and Soviet Union were still glaring at each other across the Bering Sea. Several years before, in 1957, the Soviet Union had launched Sputnik I, the first satellite, into orbit. Kennedy’s predecessor Dwight Eisenhower had responded by forming the Advanced Research Projects Agency (ARPA) to spur American innovation. The Department of Defense had funding to unite researchers and military leaders to restore America’s lead in science and technology.

Alexander Graham Bell’s company American Telegraph and Telephone, founded in 1885, still enjoyed a government-authorized monopoly of the phone network. By some accounts, the Department of Defense looked at the way the phone system worked with concern. In order for two parties to exchange data and voice, a dedicated circuit was established between them. This circuit-switching was fast but expensive. Considerable infrastructure was required to support dedicated circuits. The network was also vulnerable. If a missile took down part of the network, connections would be severed. Others, namely the non-military scientists and engineers who partnered with the Department of Defense, downplay the Cold War paranoia. They cite the fact that there were computers spread across the country, and they needed a standard and reliable way of communicating.

Whatever the primary motive, the Department of Defense funded the development of a network that could withstand disruption. The result was a network between computers called the ARPANET. The first message on the network went from UCLA to Stanford in 1969. The payload was the text LOGIN, but Stanford’s computer crashed after receiving just LO.

To communicate across the ARPANET, the sending computer broke a message up into small chunks and pushed onto the network. But the chunks didn’t follow an established circuit. Instead, each chunk or packet was directed to the receiving computer by way of routers, which were distributed around the country and which maintained a directory of all the computers on the network. This packet-switching meant that some chunks might follow different routes. Because some routes were slower than others, the packets arrived out of order. The receiving computer would reorganize them. The eventual protocols specifying how packets were formed, delivered, and reassembled across the ARPANET were TCP and IP.

As the ARPANET grew bigger and scaled beyond its military roots, it was renamed the Internet.

The key takeaway here is that the Russians had the foresight to launch Sputnik in 1957 so that they would have a vehicle for interfering with American elections in 2016. That is indirection at its finest.

Web

The Internet soon crossed over into Europe. One of the European hubs of the network was CERN, a nuclear physics laboratory in Geneva, Switzerland, which is known today for building the Large Hadron Collider.

In the early 1980s, CERN employed one Tim Berners-Lee, a second-generation computer scientist from Britain. (In fact, both his parents were by computer scientists, and he also married a computer scientist.)

Berners-Lee had a vision for sharing research on the Internet: instead of just presenting flat listings of text, the text would contain links to other documents, forming a vast graph of the world’s information. Using a NeXT computer from Steve Jobs’ company, he wrote a web server and a web client (or browser). In late 1990, he had identified the key technologies of his system:

He soon began serving out the world’s first web page. This would have meant nothing if he had not also released his software for free. CERN charged no royalties. A freely- accessible structured information graph was born. Berners-Lee called it the World Wide Web.

Several years later, Marc Andreesen led a team in Illinois that built Mosaic, which supported images and which they ported to Windows. Andreesen and some of his team went on to form Netscape Communications Center, whose Netscape Navigator was the world’s most popular browser in the 1990s. But Microsoft licensed Mosaic and released Internet Explorer, which in time displaced Netscape and led to an antitrust case in the United States. Microsoft’s tight integration of Internet Explorer in their operating system was declared illegal. But the verdict came too late for Netscape to recover. Before they dissolved as a company, Netscape released its code base to support the development of a comprehensive suite of Internet tools, including a browser, an email client, an HTML editor, an address book, and a chat client. The group that took over the development was the Mozilla Foundation, who eventually pruned away everything but the browser, which they named Firefox, and the email client, which they named Thunderbird. Firefox was the only serious competitor to Internet Explorer in the 2000s, but then Google released Chrome in 2008. Chrome today accounts for around 65% of the world’s web traffic. Apple’s Safari accounts for around 15%. But the shares differ by region.

Of course, technology alone is not what makes the web interesting or successful. The web has become what it is because of Berners-Lee’s original spirit of sharing content and the tooling for distributing and accessing it. And that’s what we do on the Internet. We share everything.

Computers Talking

If we compare network technology to physical transportation, the Internet is the network of roads and the web is one of the many goods that can be shipped across it. The computers providing the goods and consuming the goods are the endpoints. We call a computer that provides the goods a server, and its goods are called services. The computer that requests and consumes a service is a client.

Most any computer can be a server. Two things are required of the server. It must have an IP address which uniquely identifies it on the network. It must also have a program that’s running in an infinite loop, waiting for clients to connect to it.

If we are in business, we probably want a server that is connected to a fast network and is itself fast. We will also buy a domain name from a domain name registrar. The registrar will help maintain a mapping called the Domain Name Service (DNS), which maps domain names to our servers’ IP addresses. But none of this is necessary.

How does a client send a request to a server? Our operating systems provide a door to the internet called a socket. The server opens its door and leaves it open indefinitely. The client opens its door only when it has a request.

The server may provide many services. It may serve out web pages, a shared Minecraft world, email, Git repositories, amongst other things. So that a computer may listen to many servers, it has many doors. These doors are called ports. Web pages are often delivered across ports 80 or 443 and Minecraft across port 25565.

HTTP

Let’s have a look at how a web client (a browser) issues a request to a web server. The protocol that both client and server follow is what Berners-Lee called HTTP. The client initiates a conversation by sending a request to a server. The request includes a method and a set of headers describing the requested service.

To retrieve a page, the client sends a GET request that has in the very least a Host header. The message might look like this:

GET /index.html HTTP/1.1
Host: twodee.org

We specify the path to the resource we are requesting, which happens to be a plain text HTML document. We also specify the version of HTTP we are using. The latest is HTTP/2, which was modeled on Google’s experimental SPDY protocol. Around 43% of current traffic uses HTTP/2. But HTTP/1.1 is simpler to inspect.

Note the blank line at the end of the message. According to the HTTP specification, each line should be ended with both a carriage return and a linefeed (a CRLF or a literal \r\n).

We can write a little script that submits this request to the server. I will use Ruby because it is lightweight and fun.

First we create a socket to the HTTP door (port 80) on our server, which is identified by its IP address. Then we send our request and collect up the response.

#!/usr/bin/env ruby

require 'socket'

socket = TCPSocket.open('138.68.15.70', 80)

socket.write("GET /index.html HTTP/1.1\r\n")
socket.write("Host: twodee.org\r\n")
socket.write("\r\n")

while line = socket.gets
  puts line
end

socket.close

The response might look something like this:

HTTP/1.1 200 OK
Date: Tue, 28 Jan 2020 19:48:34 GMT
Server: Apache/2.4.18 (Ubuntu)
Vary: Host,Accept-Encoding
Last-Modified: Tue, 09 Aug 2016 19:47:25 GMT
ETag: "6b-539a8cd574b68"
Accept-Ranges: bytes
Content-Length: 107
Content-Type: text/html

<!DOCTYPE html>
<html>
<head>
  <title>...</title>
</head>
<body>
Hello, client. It's me, server.
</body>
</html>

When we run the script, we see a slight delay before it finishes. That’s because we are using HTTP/1.1. In earlier versions of HTTP, the connection to the server closed immediately by default. If the HTML content included a reference to an image, we had to establish a new TCP connection to retrieve the image. Such building up and tearing down of sockets is slow.

In HTTP/1.1, the server by default keeps the socket open for a bit in case the client sends a followup request. If the client is going to make only a single request, it specifies the Connection header with the value closed:

socket.write("GET /index.html HTTP/1.1\r\n")
socket.write("Host: twodee.org\r\n")
socket.write("Connection: close\r\n")
socket.write("\r\n")

This removes the delay. This behavior is not exactly significant, but I include it to illustrate how technology evolves. In fact, a recurring theme of this course is that a technology progresses from slow but understandable to fast but incomprehensible. You are coming into web development at the incomprehensible stage, but it’s only going to get more so.

HTTP has several other methods that we will see through the semester, namely the following:

These correspond quite nicely to SQL’s INSERT, UPDATE, and DELETE. And GET is like SELECT. For the time being, we will focus only on GET.

Normally we don’t issue GET requests directly. Rather, we enter a URI in the web browser, and it opens a socket and issues the request on our behalf.

HTML

The content sent back by the server is a hierarchy or tree of information, marked up in HTML. But computers don’t exchange trees across the network. They exchange flat sequences of bytes. Therefore, the tree must be serialized. To mark the start and end of a node in the tree, we use an element, which has this form:

<element>content...</element>

Each <...> is called a tag. <element> is the opening tag, and the </element> is the closing tag.

The content itself is a mix of child elements and implicit text nodes.

Each element may shaped by zero or more attributes:

<element attribute0="value0" attribute1="value1">content</element>

Whitespace in between attributes, tag names, and top-level punctuation is not significant. For example, this element is equivalent to the preceding:

<element
  attribute0="value0"
  attribute1="value1">content</element>

But whitespace in an attribute or in the content may be significant.

In-class Exercise

As your teacher, I am going to do what I can to give you opportunities to learn web development. But I am only a small part of your learning. Really, my job is to make you—encourage you—to do things that engage your brain. So, in most of our lectures, I will stop talking for a while and get you talking with a neighbor as you complete an exercise.

So, here’s your exercise. First, visit Crowdsource and claim a task number. Then investigate the HTML element below that corresponds to your task number:

  1. html
  2. head
  3. body
  4. p
  5. a
  6. img
  7. div
  8. span
  9. iframe
  10. table, tr, td
  11. ul
  12. ol
  13. img
  14. h1, h2, …
  15. main
  16. nav

Read up on your element’s attributes. Write a short snippet of HTML demonstrating how your element and its attributes might be used. Submit it on Crowdsource. We will examine your submissions in lecture.

When you search the web for your element, W3Schools is often going to be the top result. I suggest you favor Mozilla’s Developer Network over W3Schools for more comprehensive documentation.

TODO

Here’s your TODO list for next time:

See you next time.

Sincerely,

P.S. It’s time for a haiku!

“Hello, world,” I lied
There was no world till later
When its Tim had come