Posts Tagged ‘Lexer

23
Nov
10

My First Lexer (in Python)…

There is a course we have, Theory of Computations (actually more of formal languages), which uses the Cinderella Book as one of the recommended readings. There was an assignment for this course, which involved designing a program to separate tokens from a C++ program. After a little search-engining, I found out that such a program is actually a lexer(lexical analyzer), uses regular expressions, and that there are various automated tools to generate lexers.

But I decided to design one on my own before looking into the lexer implementation strategies in detail. So, I began with reading some stuff from “Engineering a Compiler”, and got something running. Here is the crude working code. Its immensely unproductive as far as applicability goes, but I was behind schedule in submissions, so I crooked my way through. As soon as possible, I’ll work on a better implementation.

Source Code:

Click below to read:

import re
from deftok import keywords,expressions,separators
import os
    
tokens=[]

def matchNow(line,pos,id):
    """
    checks for the regular expression of the given 'id', if it is present in rulebase
    otherwise appends it to token list
    also adds type of token to tokens list
    """
    global tokens
    ret=0
    if id in expressions.keys():
        pattern=re.compile(expressions[id])
        found=pattern.match(line,pos)
        tok=line[pos:found.end()]
        ret=found.end()
    else:
        tok=id
        ret=pos+1
    if tok in keywords:
        b='KEYWORD'
    elif id=='i':
        b='IDENTIFIER'
    elif tok in separators:
        b='SEPARATOR'
    elif id=='n':
        b='NUMERAL'
    else:
        b='OPERATOR'
    tup=(tok,b)
    tokens.append(tup)
    return ret

def checkLine(line,pos):
    """
    removes whitespace
    checks first character of line at 'pos', and calls matchNow() appropriately
    """
    global tokens
    pattern=re.compile('\s+')
    found=pattern.match(line,pos)
    if found:
        pos=found.end()
    if pos<len(line):
        curr=line[pos]
    else:
        return
    endofprev=pos
    if curr.isalpha() or curr=='_':
        endofprev=matchNow(line,pos,'i')
    elif curr.isdigit():
        endofprev=matchNow(line,pos,'n')
    elif curr in ['@','$']:
        endofprev=pos+1
    else:
        endofprev=matchNow(line,pos,curr)
    checkLine(line,endofprev)

def main(name):
    """
    responsible for file handling, and concatenation of lines with backslash-newline
    """
    fp=open(name,'r')
    temp=""
    for line in fp:
        if len(line)>1 and line[-2]=="\\":
            temp=temp+line[:-2]
            continue
        line=temp+line[:]
        temp=""
        checkLine(line,0)
    fp.close()
    fp=open('outputof'+name+'.txt','w')
    for token in tokens:
        token1,token2=token
        fp.write(token1+'\t'+token2+'\n')
    fp.flush()
    os.fsync(fp.fileno())
    fp.close()
    
if __name__=="__main__":
    import sys
    main(sys.argv[1])

Here is the definitions file. It has a list of keywords and separators, and a dictionary mapping “id’s” with regular expressions…

#list of keywords
keywords=['auto', 'const', 'double', 'float', 'int', 'short', 'struct', 'unsigned', 'break', 'continue', 'else', 'for', 'long', 'signed', 'switch', 'void', 'case', 'default', 'enum', 'goto', 'register', 'sizeof', 'typedef', 'volatile', 'char', 'do', 'extern', 'if', 'return', 'static', 'union', 'while', 'asm', 'dynamic_cast', 'namespace', 'reinterpret_cast', 'try', 'bool', 'explicit', 'new', 'static_cast', 'typeid', 'catch', 'false', 'operator', 'template', 'typename', 'class', 'friend', 'private', 'this', 'using', 'const_cast', 'inline', 'public', 'throw', 'virtual', 'delete', 'mutable', 'protected', 'true', 'wchar_t']

#list of separators
separators=['(', ')', '[', ']', '{', '}', ',', ';', ':']

#rulebase
expressions={   '+':r'\+[+=]|\+',
 '-':r'-[-=>]|-',
 '!':r'!=|!',
 '*':r'\*=|\*',
 '/':r'/=|/',
 '%':r'%=|%',
 '<':r'<<=|<[<=]|<', '>':r'>>=|>[>=]|>',
 '&':r'&[&=]|&',
 '^':'^=|^',
 '|':r'\|[|=]|\|',
 '=':'==|=',
 'i':r'[A-Za-z_]+\w*',
 'n':r'[0-9]+\.[0-9]+|[0-9]+',
 '.':r'\.[0-9]+|\.'}

Problems with this program:

  1. It is ugly.
  2. Tokens like “9xyz” are accepted as valid identifiers. Change in regular expressions required.
  3. Apart from the regular expression rulebase, not much of the code is customizable. Eg: classification of identifiers is rigid, formatting in output file is precoded; in fact, even the rulebase is forced to work according to the inbuilt functions: checkLine() and matchNow()
  4. Invalid tokens need to raise an exception, not included here.
  5. Possibly, the entire lexer should be packaged in a class, which is iterable over itself. Then, its instance can be utilized as <instancename>.next()

There might be other issues as well. After I get over my exams, lets see if I get a better version of this rudimentary lexer up and running.

Namaste.

Advertisements



Top Clicks

  • None
Advertisements