teaching machines

Seven Digit Study

My son and I are on our way to school. It’s just the two of us. These drives should be a great opportunity for some genuine father-son talk, but it rarely happens. My mind is usually anxious about the day ahead, the traffic around us, and the hazards of winter. Even when we do talk, I can barely hear my son’s quiet voice above the engine noise and the blowing air of the heater. But today, a good conversation does happen.

He wonders aloud about what the most common seven-digit number is. Maybe it’s 1000000 or 9999999? I think to myself that such a question has no answer but then remember I’ve got a duty to develop scientific inquiry in my child. I suggest that we don’t have to guess. We could do a real study of real human beings, asking each one to identify the first seven-digit number that comes to mind. He likes the idea.

We discuss the difficulties of getting a good sample of the population. Perhaps he could ask kids at school? Then I remember that Amazon has a service called Mechanical Turk for getting a bunch of humans to complete tasks that a machine can’t do, like navigating a webpage to test its usability or identifying the contents of images. We make a plan to look into it.

Some weeks later, our stay-at-home orders arrive from our governor, giving us plenty of time together. We set up a requester account in Mechanical Turk and figure out how to create a task that asks the workers a single question. There are two tricky parts:

  • Validating the form input to only accept a number made of seven digits.
  • Wording the question to be short and unambiguous without being leading. We don’t want to include an example of a seven-digit number that a worker could copy and paste.

For a title, we enter “7-digit Number Survey.” For a description, we enter “This is a very simple survey asking you to think of a number. 2 cents for your 2 cents!” We enter this source for the form shown to workers, which includes the validation:

<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>

<crowd-form answer-format="flatten-objects">
  <div>
    <p>Enter a 7-digit number without spaces or punctuation:</p>
    <crowd-input id="my-number-input" name="number" required></crowd-input>
  </div>
</crowd-form>

<script>
var numberInput = document.getElementById('my-number-input');
var form = document.querySelector('crowd-form');

function validate(event) {
  var text = numberInput.value;
  if (!text.match(/^\d{7}$/)) {
    alert('Your number must be 7 digits and contain no spaces or punctuation.');
    event.preventDefault();
  }
}

form.addEventListener('submit', validate);
</script>

This source produces the following form:

We’re ready to deploy our task, but first we need to decide how many responses we want and how much we want to pay for each response. I pony up $20 for the cause. If we give each human worker $0.02, then we can get 1000 responses. That should be plenty, we think. Amazon asks for an additional $10 fee. I enter a 16-digit number and then our task is live.

The results start flying in. By the end of the first day, we have collected about 500 of the responses. By the end of the second day, 900 responses. By noon on the third day 3, we have our 1000 responses. We take a moment to soak in the data. It is beautiful. It is ours. We quickly dive into analyzing it.

My son is at the computer, and I tell him what code to write. We use Ruby. We load in the table of data, cull the column we want, and sort the numbers.

require "csv"
data = CSV.read("7digit.csv", headers: :first_row)
strings = data.map { |row| row["Answer.number"] }
ints = strings.map(&:to_i).sort

We keep two versions of the list of numbers: one a list of strings and one a list of equivalent integers. We want both versions because some operations are easier on strings and some are easier on integers. For example, it’s slightly easier to count digits in the string "0000001" than the equivalent integer 1.

The low-hanging fruit of our analysis is to count how much of the data was of the proper form, containing exactly seven digits.

count7 = strings.count { |n| n.size == 7 }
puts "7 digits: #{count7}"

The output makes us cheer.

7 digits: 1000

Our input validation worked. All 1000 responses are usable.

We then determine the minimum and maximum responses.

min = strings.min
puts "minimum: #{min}"
max = strings.max
puts "maximum: #{max}"

The output is both surprising and unsurprising.

minimum: 0000100
maximum: 9999999

Not a single worker has responded with 0000000. We become curious about the other repeating-digit numbers and calculate the frequency of 1111111, 2222222, and so on.

(0..9).each do |i|
  count = strings.count { |s| s =~ /^#{i}{7}$/ }
  puts "#{i.to_s * 7} -> #{count}"
end

The output shows that only 7777777 gets more than a few hits.

0000000 -> 0
1111111 -> 1
2222222 -> 0
3333333 -> 1
4444444 -> 0
5555555 -> 2
6666666 -> 0
7777777 -> 6
8888888 -> 0
9999999 -> 2

It’s clear that some workers have picked the same number as someone else, but how many?

uniqueCount = strings.uniq.size
puts "unique: #{uniqueCount}"

The output shows that uniqueness is far more common than duplication.

unique: 876

My son wants to know the sum and average.

sum = ints.sum
puts "sum: #{sum}"
average = sum / 1000.0
puts "average: #{average} "

The output surprised him.

sum: 4931037193
average: 4931037.193

He is initially perplexed that the sum and average look so similar, but then he remembers how the average is calculated and the result makes sense. If the numbers were uniformly distributed, we’d expect an average of 5000000. The average is less than this. Does that mean anything? I don’t know.

We wonder about the median and debate how it should be calculated in a list of even length. My son says if there is no middle number, then it’s the average of the middle two numbers.

median = (ints[499] + ints[500]) / 2.0
puts "middle: #{median}"

The output shows a median greater than the mean.

middle: 5004019.0

We don’t know if this arrangement of the average and median is significant or not.

Somewhere my son has learned about perfect numbers, which are numbers whose factors sum to the number. Consider 28, whose factors are [1, 2, 4, 7, 14]. Add these up and one gets 28. He wonders if there are any perfect numbers in the data.

perfect = ints.select do |n|
  (1..n / 2).select { |x| n % x == 0 }.sum == n
end
puts perfect 

This code runs slowly because we ask each number if it’s perfect by calculating all of its factors and summing them up. We would have been better off collecting a list of all perfect numbers with 7 digits or fewer and just looking for those in the list. Either way, there are no perfect numbers in the responses.

Are people more likely to choose an odd number of an even number?

oddcount = ints.count { |n| n % 2 == 1}
puts "odd: #{oddcount}"

The output tells us that the odds are for the odds.

odd: 589

We wonder how frequently each digit is seen. This code gets a bit more involved.

digits = Hash.new(0)
strings.each do |n|
  n.chars.each { |digit| digits[digit] += 1 }
end
digits = digits.sort_by { |n, count| count }.reverse.to_h
digits.each do |n, count|
  puts "#{n} -> #{count}"
end

In retrospect, I see that a Hash is probably not necessary. An array of 10 counters would have been a bit simpler, but not that much. The output is interesting.

5 -> 838
4 -> 810
7 -> 806
2 -> 758
6 -> 712
3 -> 703
1 -> 678
8 -> 646
9 -> 622
0 -> 427

We are twice as likely to see a 5 as a 0. There’s probably some Freudian explanation for our zero-aversion, but I don’t mention that to my son.

Lastly we examine the frequencies of the complete 7-digit numbers. The code is similar to the code we use to calculate the frequencies of digits.

popular = Hash.new(0)
strings.each do |n|
  popular[n] += 1
end
popular = popular.sort_by { |n, count| count }.to_h
popular.each do |n, count|
  puts "#{n} -> #{count}"
end

The output is long, and I include only the numbers that appeared more than once.

2938475 -> 2
6767676 -> 2
5678456 -> 2
6969696 -> 2
1029384 -> 2
6543210 -> 2
5555555 -> 2
8675301 -> 2
1597535 -> 2
1236547 -> 2
9999999 -> 2
1357924 -> 2
2345678 -> 2
7894561 -> 3
7654321 -> 5
9876543 -> 5
7777777 -> 6
1000000 -> 7
8675309 -> 14
1234567 -> 78

Thus we have an answer to our question. 1234567 is the most popular response by a long shot. My son is very confused by the runner-up 8675309, but I delay explaining to him what it means to children of the 1980s in hopes that he will investigate. He asks his brothers if they know the number, but they are all younger than him. 1000000 is the first number that truly requires all seven digits, so it has significance to us. 7777777 must be doubly lucky, or septuply perhaps. 9876543 and 7654321 are just decreasing sequences. That 7894561 occurs three times strikes us as peculiar, but then we see that it too is a sequence. It is the arrangement of numbers on a 10-key keypad. I’m fairly sure 8675301 appears because people mishear lyrics as they’re using the bathroom on the right. Many of the rest are increasing or decreasing sequences. My son gets no help from me figuring out 6969696. We assume that 2938475 and 1029384 are just flukes, but we are disturbed that they have 29384 in common. In writing this, we see that these numbers are not flukes at all. They are shaped by typing from either end of the top row of the keyboard, working inward and alternating between strokes.

In conclusion, my son and I feel great about having a question, gathering data, and analyzing the results. That we administered this survey through a computer is clear. Many of the most frequent numbers were clearly chosen because of typing ergonomics. Perhaps we need to do a face-to-face survey to get results that aren’t tainted by our technological interfaces. But not until the governor says we can.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *