23
Nov
10

My First Lexer (in Python)…

There is a course we have, Theory of Computations (actually more of formal languages), which uses the Cinderella Book as one of the recommended readings. There was an assignment for this course, which involved designing a program to separate tokens from a C++ program. After a little search-engining, I found out that such a program is actually a lexer(lexical analyzer), uses regular expressions, and that there are various automated tools to generate lexers.

But I decided to design one on my own before looking into the lexer implementation strategies in detail. So, I began with reading some stuff from “Engineering a Compiler”, and got something running. Here is the crude working code. Its immensely unproductive as far as applicability goes, but I was behind schedule in submissions, so I crooked my way through. As soon as possible, I’ll work on a better implementation.

Source Code:

Click below to read:

import re
from deftok import keywords,expressions,separators
import os
    
tokens=[]

def matchNow(line,pos,id):
    """
    checks for the regular expression of the given 'id', if it is present in rulebase
    otherwise appends it to token list
    also adds type of token to tokens list
    """
    global tokens
    ret=0
    if id in expressions.keys():
        pattern=re.compile(expressions[id])
        found=pattern.match(line,pos)
        tok=line[pos:found.end()]
        ret=found.end()
    else:
        tok=id
        ret=pos+1
    if tok in keywords:
        b='KEYWORD'
    elif id=='i':
        b='IDENTIFIER'
    elif tok in separators:
        b='SEPARATOR'
    elif id=='n':
        b='NUMERAL'
    else:
        b='OPERATOR'
    tup=(tok,b)
    tokens.append(tup)
    return ret

def checkLine(line,pos):
    """
    removes whitespace
    checks first character of line at 'pos', and calls matchNow() appropriately
    """
    global tokens
    pattern=re.compile('\s+')
    found=pattern.match(line,pos)
    if found:
        pos=found.end()
    if pos<len(line):
        curr=line[pos]
    else:
        return
    endofprev=pos
    if curr.isalpha() or curr=='_':
        endofprev=matchNow(line,pos,'i')
    elif curr.isdigit():
        endofprev=matchNow(line,pos,'n')
    elif curr in ['@','$']:
        endofprev=pos+1
    else:
        endofprev=matchNow(line,pos,curr)
    checkLine(line,endofprev)

def main(name):
    """
    responsible for file handling, and concatenation of lines with backslash-newline
    """
    fp=open(name,'r')
    temp=""
    for line in fp:
        if len(line)>1 and line[-2]=="\\":
            temp=temp+line[:-2]
            continue
        line=temp+line[:]
        temp=""
        checkLine(line,0)
    fp.close()
    fp=open('outputof'+name+'.txt','w')
    for token in tokens:
        token1,token2=token
        fp.write(token1+'\t'+token2+'\n')
    fp.flush()
    os.fsync(fp.fileno())
    fp.close()
    
if __name__=="__main__":
    import sys
    main(sys.argv[1])

Here is the definitions file. It has a list of keywords and separators, and a dictionary mapping “id’s” with regular expressions…

#list of keywords
keywords=['auto', 'const', 'double', 'float', 'int', 'short', 'struct', 'unsigned', 'break', 'continue', 'else', 'for', 'long', 'signed', 'switch', 'void', 'case', 'default', 'enum', 'goto', 'register', 'sizeof', 'typedef', 'volatile', 'char', 'do', 'extern', 'if', 'return', 'static', 'union', 'while', 'asm', 'dynamic_cast', 'namespace', 'reinterpret_cast', 'try', 'bool', 'explicit', 'new', 'static_cast', 'typeid', 'catch', 'false', 'operator', 'template', 'typename', 'class', 'friend', 'private', 'this', 'using', 'const_cast', 'inline', 'public', 'throw', 'virtual', 'delete', 'mutable', 'protected', 'true', 'wchar_t']

#list of separators
separators=['(', ')', '[', ']', '{', '}', ',', ';', ':']

#rulebase
expressions={   '+':r'\+[+=]|\+',
 '-':r'-[-=>]|-',
 '!':r'!=|!',
 '*':r'\*=|\*',
 '/':r'/=|/',
 '%':r'%=|%',
 '<':r'<<=|<[<=]|<', '>':r'>>=|>[>=]|>',
 '&':r'&[&=]|&',
 '^':'^=|^',
 '|':r'\|[|=]|\|',
 '=':'==|=',
 'i':r'[A-Za-z_]+\w*',
 'n':r'[0-9]+\.[0-9]+|[0-9]+',
 '.':r'\.[0-9]+|\.'}

Problems with this program:

  1. It is ugly.
  2. Tokens like “9xyz” are accepted as valid identifiers. Change in regular expressions required.
  3. Apart from the regular expression rulebase, not much of the code is customizable. Eg: classification of identifiers is rigid, formatting in output file is precoded; in fact, even the rulebase is forced to work according to the inbuilt functions: checkLine() and matchNow()
  4. Invalid tokens need to raise an exception, not included here.
  5. Possibly, the entire lexer should be packaged in a class, which is iterable over itself. Then, its instance can be utilized as <instancename>.next()

There might be other issues as well. After I get over my exams, lets see if I get a better version of this rudimentary lexer up and running.

Namaste.

Advertisements

4 Responses to “My First Lexer (in Python)…”


  1. November 29, 2010 at 4:15 pm

    Just to let you know your header is not displaying correct in firefox.

  2. April 14, 2011 at 5:52 am

    Thanks – where is article source?

    • April 16, 2011 at 8:43 pm

      Hey, sorry, but I didn’t get you..
      Source?? I actually wrote this stuff for my course completion 🙂
      If you mean the sources that I referred, I actually looked up a few lexer implementation strategies on the internet. I don’t have the particular links since I formatted my comp 😦 But google it and you should be fine


What are you thinking?

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s


Top Clicks

  • None
Advertisements

%d bloggers like this: