teaching machines

Identifiers Starting with Numbers

May 26, 2016 by . Filed under programming languages, public.

I was skimming through an introduction to a new programming language and ran across the familiar rule that identifiers in this language must start with a letter or underscore. Because I’ve been feeling a bit subversive lately, I wondered what would happen if this rule went away? What if identifiers could start with a number?

Certainly we’d encounter ambiguity issues with tokens like this:

0xFF -- hexademical 255 as an int? Or a variable name?
500L -- 500 as a long? Or a variable name?
1e3 -- 1000 as an int? Or a variable name?

Then I headed to the Internet to see if there were deeper technical issues. A bunch of folks on Stack Overflow said that initial numbers are outlawed because ambiguities like the ones above would make lexing and parsing more difficult. Because I’ve been feeling a bit subversive lately, I couldn’t agree with them. Many of our lexer generators follow two rules that eliminate ambiguity:

  1. The longest match wins. If we try lexing the string 0xFFZZZ, and we have a rule [0-9]+ for integers and [A-Za-z0-9]+ for identifiers, the identifier rule will win because it can match more of the string.
  2. The first match wins. If we try lexing the string 0xFF, both our integer rule and our identifier rule match. Whichever one is specified first wins. So, as long as we list the pattern for integer literals before the pattern for identifiers, the lexer will classify things like 0xFF as integers. Only tokens that don’t match any previous integer patterns will get classified as identifiers.

As a proof of concept, I wrote up a little ANTLR grammar for Numid, a language that lets you add and subtract numbers and store values in variables whose identifiers may start with a number:

grammar Numid;

program
  : statement* EOF
  ;

statement
  : expression NEWLINE # Print
  | NEWLINE # Blank
  ;

expression
  : LEFT_PARENTHESIS expression RIGHT_PARENTHESIS # Parenthesized
  | DECIMAL_INTEGER # DecimalInteger
  | HEXADECIMAL_INTEGER # HexadecimalInteger
  | LONG_INTEGER # LongInteger
  | IDENTIFIER # Identifier
  | expression op=(ADD|SUBTRACT) expression # Additive
  | IDENTIFIER EQUALS expression # Assignment
  ;

DECIMAL_INTEGER: '-'? [0-9]+;
HEXADECIMAL_INTEGER: '0x' [0-9A-Fa-f]+;
LONG_INTEGER: '-'? [0-9]+ 'L';
IDENTIFIER: [0-9A-Za-z_]+;
NEWLINE: '\r'? '\n';
LEFT_PARENTHESIS: '(';
RIGHT_PARENTHESIS: ')';
ADD: '+';
SUBTRACT: '-';
EQUALS: '=';
WHITESPACE: [ \t]+ -> skip;

I went ahead and made a little in-browser calculator (with no error checking) so you can test out this language:

Try typing in these terrible but legal expressions:

5LL = 6
0xABCDEFG = 9
5LL + 0xABCDEFG
5LL - 5L
1l1l1l1 = 100
1l1l1l1 - 1l1l1l1
00xAB = -23

We can verify that 5LL and 0xABCDEFG are indeed identifiers by examining ANTLR’s token stream:

[@0,0:2='5LL',<IDENTIFIER>,1:0]
[@1,4:4='=',<EQUALS>,1:4]
[@2,6:6='6',<DECIMAL_INTEGER>,1:6]
[@3,7:7='\n',<NEWLINE>,1:7]
[@4,8:16='0xABCDEFG',<IDENTIFIER>,2:0]
[@5,18:18='=',<EQUALS>,2:10]
[@6,20:20='9',<DECIMAL_INTEGER>,2:12]
[@7,21:21='\n',<NEWLINE>,2:13]
[@8,22:24='5LL',<IDENTIFIER>,3:0]
[@9,26:26='+',<ADD>,3:4]
[@10,28:36='0xABCDEFG',<IDENTIFIER>,3:6]
[@11,37:37='\n',<NEWLINE>,3:15]
[@12,38:40='5LL',<IDENTIFIER>,4:0]
[@13,42:42='-',<SUBTRACT>,4:4]
[@14,44:45='5L',<LONG_INTEGER>,4:6]
[@15,46:46='\n',<NEWLINE>,4:8]
[@16,47:53='1l1l1l1',<IDENTIFIER>,5:0]
[@17,55:55='=',<EQUALS>,5:8]
[@18,57:59='100',<DECIMAL_INTEGER>,5:10]
[@19,60:60='\n',<NEWLINE>,5:13]
[@20,61:67='1l1l1l1',<IDENTIFIER>,6:0]
[@21,69:69='-',<SUBTRACT>,6:8]
[@22,71:77='1l1l1l1',<IDENTIFIER>,6:10]
[@23,78:78='\n',<NEWLINE>,6:17]
[@24,79:83='00xAB',<IDENTIFIER>,7:0]
[@25,85:85='=',<EQUALS>,7:6]
[@26,87:89='-23',<DECIMAL_INTEGER>,7:8]
[@27,90:90='\n',<NEWLINE>,7:11]
[@28,91:91='\n',<NEWLINE>,8:0]
[@29,92:91='<EOF>',<EOF>,9:0]

In conclusion, only humans get confused when identifiers start with digits. Lexers and parsers can deal.