There is a course we have, Theory of Computations (actually more of formal languages), which uses the Cinderella Book as one of the recommended readings. There was an assignment for this course, which involved designing a program to separate tokens from a C++ program. After a little search-engining, I found out that such a program is actually a lexer(lexical analyzer), uses regular expressions, and that there are various automated tools to generate lexers.
But I decided to design one on my own before looking into the lexer implementation strategies in detail. So, I began with reading some stuff from “Engineering a Compiler”, and got something running. Here is the crude working code. Its immensely unproductive as far as applicability goes, but I was behind schedule in submissions, so I crooked my way through. As soon as possible, I’ll work on a better implementation.
Source Code:
Click below to read:
import re from deftok import keywords,expressions,separators import os tokens=[] def matchNow(line,pos,id): """ checks for the regular expression of the given 'id', if it is present in rulebase otherwise appends it to token list also adds type of token to tokens list """ global tokens ret=0 if id in expressions.keys(): pattern=re.compile(expressions[id]) found=pattern.match(line,pos) tok=line[pos:found.end()] ret=found.end() else: tok=id ret=pos+1 if tok in keywords: b='KEYWORD' elif id=='i': b='IDENTIFIER' elif tok in separators: b='SEPARATOR' elif id=='n': b='NUMERAL' else: b='OPERATOR' tup=(tok,b) tokens.append(tup) return ret def checkLine(line,pos): """ removes whitespace checks first character of line at 'pos', and calls matchNow() appropriately """ global tokens pattern=re.compile('\s+') found=pattern.match(line,pos) if found: pos=found.end() if pos<len(line): curr=line[pos] else: return endofprev=pos if curr.isalpha() or curr=='_': endofprev=matchNow(line,pos,'i') elif curr.isdigit(): endofprev=matchNow(line,pos,'n') elif curr in ['@','$']: endofprev=pos+1 else: endofprev=matchNow(line,pos,curr) checkLine(line,endofprev) def main(name): """ responsible for file handling, and concatenation of lines with backslash-newline """ fp=open(name,'r') temp="" for line in fp: if len(line)>1 and line[-2]=="\\": temp=temp+line[:-2] continue line=temp+line[:] temp="" checkLine(line,0) fp.close() fp=open('outputof'+name+'.txt','w') for token in tokens: token1,token2=token fp.write(token1+'\t'+token2+'\n') fp.flush() os.fsync(fp.fileno()) fp.close() if __name__=="__main__": import sys main(sys.argv[1])
Here is the definitions file. It has a list of keywords and separators, and a dictionary mapping “id’s” with regular expressions…
#list of keywords keywords=['auto', 'const', 'double', 'float', 'int', 'short', 'struct', 'unsigned', 'break', 'continue', 'else', 'for', 'long', 'signed', 'switch', 'void', 'case', 'default', 'enum', 'goto', 'register', 'sizeof', 'typedef', 'volatile', 'char', 'do', 'extern', 'if', 'return', 'static', 'union', 'while', 'asm', 'dynamic_cast', 'namespace', 'reinterpret_cast', 'try', 'bool', 'explicit', 'new', 'static_cast', 'typeid', 'catch', 'false', 'operator', 'template', 'typename', 'class', 'friend', 'private', 'this', 'using', 'const_cast', 'inline', 'public', 'throw', 'virtual', 'delete', 'mutable', 'protected', 'true', 'wchar_t'] #list of separators separators=['(', ')', '[', ']', '{', '}', ',', ';', ':'] #rulebase expressions={ '+':r'\+[+=]|\+', '-':r'-[-=>]|-', '!':r'!=|!', '*':r'\*=|\*', '/':r'/=|/', '%':r'%=|%', '<':r'<<=|<[<=]|<', '>':r'>>=|>[>=]|>', '&':r'&[&=]|&', '^':'^=|^', '|':r'\|[|=]|\|', '=':'==|=', 'i':r'[A-Za-z_]+\w*', 'n':r'[0-9]+\.[0-9]+|[0-9]+', '.':r'\.[0-9]+|\.'}
Problems with this program:
- It is ugly.
- Tokens like “9xyz” are accepted as valid identifiers. Change in regular expressions required.
- Apart from the regular expression rulebase, not much of the code is customizable. Eg: classification of identifiers is rigid, formatting in output file is precoded; in fact, even the rulebase is forced to work according to the inbuilt functions: checkLine() and matchNow()
- Invalid tokens need to raise an exception, not included here.
- Possibly, the entire lexer should be packaged in a class, which is iterable over itself. Then, its instance can be utilized as <instancename>.next()
There might be other issues as well. After I get over my exams, lets see if I get a better version of this rudimentary lexer up and running.
Namaste.
Just to let you know your header is not displaying correct in firefox.
Umm,it looks good on mine…I’ll check it out anyways, thanks.
Thanks – where is article source?
Hey, sorry, but I didn’t get you..
Source?? I actually wrote this stuff for my course completion 🙂
If you mean the sources that I referred, I actually looked up a few lexer implementation strategies on the internet. I don’t have the particular links since I formatted my comp 😦 But google it and you should be fine