Building a Programing Language – Part I (Creating BrainLess)

BrainlessLabs.com

Rate me:

4.92/5 (25 votes)

11 Oct 2013LGPL317 min read

62K

600

114

This is the first of a series of articles where we will explore about writing compilers.

Clone at github.

1. Introduction

I will build up this article, as on need basis. Like the article, “Building a Language”, if I remember the article correctly by the creator of Self language. I will introduce terms whenever needed, I will try not to define them, but the term will be a conclusion. Even the math I will do will gradually change, as per the need. We cannot leave mathematics, and understand the essence of these, but I will try not to overhaul this text with funny symbols and imaginary numbers. We will build something, learn and then formalize. Writing a compiler is certainly far easier than enlightening a dragon. I want these articles to be simple, so that even history graduates can understand, but basic knowledge of C/C++ is required.

In this article, my intention is to give a brief overview of what a language is, and some formal tools to study language. These tools will make our life easier as we go along the way of compiler /translator writing.

2. What is a Programing Language, How Is It Useful?

Animals also talk. It’s just that we can’t normally understand. When birds talk, though we do not know their language, we sometimes think its soothing, but sometimes we get agitated. This is the reaction to some protocol we do not follow quite well. That’s what happens when we give some unrecognized file format to it shouts at us. Let me make this clear, first of all why the heck is a language a protocol between objects, why not living objects. Well computers do communicate, and they are not living. Even matter follows a certain protocol of communication that is why if a quark reverses its spin the other counter part quark also does the same. This is why when there is a gravitational pull, things tend to get attracted. They are following a certain protocol, we encode it in terms of “Laws of Nature”.

We now draw our attention to human communication. From the start of civilization, we communicate. We use hand symbols, draw things (writing is also a kind of drawing), and talk. In each of them, we associate a meaning with a particular symbol. When we speak, we associate a meaning or a semantic value by social convention. This is basically a pairing of meaning with a symbol. In mathematics, there is an elegant way of representing this information, i.e., the concept of relation. Suppose “S” is the set of all symbols we have chosen, and “M” is the set of all meanings associated. Well, I guess we cannot have a symbol first and tell “hey I want to have a meaning for it”, mostly it will be the other way around. To represent a meaning, we will need a symbol. So I will map a meaning with a symbol by the symbol M->S or just to signify that it is an ordered thing we make it a relation:

R: M->S

Now coming to the question of symbols; these symbols used to represent meaning are arbitrary. Any grammar or semantic rule can be mapped to any symbol arbitrarily. It’s just a commonly agreed upon convention. No one is biting you if you go and change the symbol time to time, but that means again coming to an agreement which seems like a boring affair.

2.1 What are Symbols?

Now, we will come to the symbols. Symbol does not mean that it is a just a glyph. Say the example of the English language. We have an “Alphabet” set; we make another set from it, which is called as word. A word is a combination of glyphs or symbols. We can think of words also as a really big symbol, which is in order a group of symbols. No we can say that the set “S” is really a set of words. Where each symbol of “S” is a combination of more primitive symbols taken from set called “Alphabet”, in short “A”. Let’s take another example. We have a group of 3D objects, like nuts, bolts, wheel, etc. This is the set “A”. We can make less abstract or more meaningful 3D objects from this set like a wheel, with Tyre and tube. This is a combination of the 3D objects from set “A”. This set is “S”. But one thing is that only a certain combination sequence of the objects from “A” will make something useful. Perhaps as this is a virtual 3D environment, we can combine the nuts, bolts in some awkward manner also. But if we do not make a wheel, or a steering wheel or an engine from the objects from “A”, then it’s of no use to us. Only these things can be added to make a proper vehicle. But how can all members prepare the same type of wheel? Well, we need to have a blueprint. This blue print on which everybody constructs these objects is called a “Grammar”. This is a set of rules, that all need to follow to make something meaningful. We will call this set of rules as “G” for grammar.

2.2 Talking to a Computer

Now let us focus our attention to the language of computers. Programing languages were there before even computers were there. These languages were used to direct the behavior of machines such as Jacquard looms (did you watch the Hollywood movie “Wanted”? they used to code messages in clothes) and player pianos. Even Arabs made mechanical watches that used to tell time and play music. A programming language is a protocol by which we talk to the computer and direct it to perform certain tasks. Now we all know that computer is a set of switches that go either “0” or “1”. So the only language that computer understand is “0” or “1”. So say to boot, we need the following sequence of binary bits “0000” followed by “1111” for example. But writing something big in “0” and “1” is a daunting task. It is mechanical, error prone and repeating. We know that computers are good at doing such tasks. So if at all we could do some hard toil and make the computer follow the instruction, it will make our life easier. Here assembly language comes to our rescue. But computers cannot understand letters and words. So if I write MOV A, B, then how is the computer is going to understand. Say we write ADD A, B and save it in a file. We know that computers understand binary numbers. First, what do we mean by computer understands? Computer has an inbuilt interpreter which is the processor. Here the set of alphabets are A = {0, 1}.

We currently know that “0000” followed by “1111” is a reboot instruction. This is two words or symbols. Each symbol consists of 4 alphabets in this case.

S = {"0000", "1111"}
M = "0000" "1111" = reboot

Now we should have a proper set of rules in place so that all can follow and processor can understand everyone's language. So what is the set of rules?

Rules G = {Each member of “S” can have only 4 “0” s or “1”s or a combination of these, Each member of “M” consists of members from “S”}

Now as we extend the set S and M the set G also extends.

Coming back to the file that we saved say fl.txt. Info stored is ADD A, B. The file will look as follows in binary mode:

“Binary Value for A” “Binary Value for D” “Binary Value for D” “Binary Value for SPACE” “Binary Value for A” “Binary Value for ,” “Binary Value for SPACE” “Binary Value for B” “Binary Value for END OF FILE”

Say "Binary Value for A" = 0001
"Binary Value for D" = 0010
"Binary Value for SPACE" = 0011
"Binary Value for ," = 0100
"Binary Value for EOF" = 0101
"Binary Value for B" = 0110
"Binary Value for ADD instruction" = "0001" "0011"

So the add instruction should look like 000100110110

01 represents the variable/register A and 10 represent B. But instead, our file looks like:

000100100010001100010100001101100101

If we feed the above statement to the execution box, then it will either shout or will give you something junk. So what do we do? We will write a translator that understands this code and gives us 000100110110. In other words:

R : 000100100010001100010100001101100101 -> 000100110110

So here we are not exactly communicating with the computer directly, but we communicate with an abstract entity that translates our message for the computer. We call this translator “Assembler”.

3. Our First Language, BrainLess

Enough of talking. Let's do some work also. We will create our first language. This is a variety of the BrainF language with some changes. Generally, when we read something like English, our brain tends to read “words” instead of grouping individual characters and then forming a word. This is the reason I will not write our first language in English. It will be a symbolic language, so that we get to know what our brain really does when it tries to read some new language. We will apply that bluntly in a program. Here it goes:

3.1 The BrainLess Virtual Machine

Imagine we have a tape of length 100. We can operate on this tape machine. Our language will define operations on this tape machine; our compiler/interpreter will read the instruction and do the operations on this tape. We have a read/write head that will read or write onto the tape machine. We have instruction for reading and writing, moving the head. If we want to read the content of the 3^rd slot, then we first position the head on that place and then we issue the read instruction. The machine is illustrated in Figure 1.

Figure – 1

We have the following instructions.

>>: Read a value form user and store it in the tape (*p = getch() in C language)

. : Display the value where currently the head is pointing

! [N]: Display the contents of the stack starting from the place where the head is pointing to, to the N^th place.

++: Increment the content where the head is pointing now (++*p)

--: Decrement the content where the head is pointing now (--*p)

+*: Take the content where the head is pointing now with the next content where the head is pointing, in C language terms *p = *p + *(p+1)

-* : Do as above but in this case, subtract *p = *p - *(p+1)

>: Increment the head by one p = p+1

<: Decrement the head by one p = p-1

So let’s write a small program to print “Hello World”.

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<![11]

Now let us interpret it.

Reading user input is given in black, incrementing the head in red:

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<![11]

3.2 Ambiguity in BrainLess and Resolving It

Great till now. But wait; we can create another meaning too, there are 32 ‘>’s. So it can also mean increment the progress 32 times. Now the whole meaning changes. There can be another meaning too, get 16 user inputs and store in the same place. This sentence is ambiguous. It can only be interpreted if we know what the program does, in this case printing hello world. This gives us some idea that though the syntax of the above program is correct, the meaning/semantics is ambiguous. But how do you tell the computer? Well this program should only tell the computer or in this case our tape machine. These kinds of statements are called ambiguous statements. They arise in computer science, compilers, logic, artificial intelligence. We will keep an eye on them while we progress. For now, we need to avoid them in our program. There can be many ways, but I will use one simple method, i.e., to move the head one position instead of ‘>’ use ‘*’. So now the program becomes:

>>*>>*>>*>>*>>*>>*>>*>>*>>*>>*>><<<<<<<<<<![11]

Now suppose we want to do it in another way, we will use ‘>*’ to increment the character. Then the program becomes:

>>>*>>>*>>>*>>>*>>>*>>>*>>>*>>>*>>>*>>>*>><<<<<<<<<<![11]

Now let us interpret this program. We will consider only one part or store and increment head, i.e., >>>*

Let's do it the way our parser will see this. We will get the characters one by one:

The character first seen by the parser will be ‘>’. Once we see this, we have two options:

This ‘>’ can be followed by another ‘>’ in which case it will be the operation read from user and store.
‘>’ can have a ‘*’, in which case it will be incrementing the head.

Well now also we have an ambiguity. Where to go? We will decide it based up on the next character. So to interpret the meaning of ‘>’ we need to ‘look ahead’. Let's look ahead one more character, i.e. ‘*’. Now we have a clear meaning, ‘>*’. So only looking ahead one character helps us resolve ambiguity. This is how it happened, read the characters from the program file, we will read it from the left side, and we will look-ahead one character, i.e., from Left, Look-ahead of 1 character. This can be represented in short as LL(1). We will look more about LL class, where there can be ‘k’ characters look ahead represented as LL(k).

3.3 Revised BrainLess

The revised language has the below instructions now.

>>: Read a value form user and store it in the tape (*p = getch() in C language)

. : Display the value where currently the head is pointing

! [N]: Display the contents of the stack starting from the place where the head is pointing to, to the Nth place.

++: Increment the content where the head is pointing now (++*p)

--: Decrement the content where the head is pointing now (--*p)

+*: Take the content where the head is pointing now with the next content where the head is pointing, in C language terms *p = *p + *(p+1)

-* : Do as above, but in this case subtract *p = *p - *(p+1)

*: Increment the head by one p = p+1

<: Decrement the head by one p = p-1

Just for fun, let us look at how the binary file looks; it’s given in Figure - 2.

Figure – 2

When writing the parser, we will read it in binary mode.

4. Formal Tools and Models to Help Us Along the Way

To make our task easier, we will take the help of some tools also. These are called formal methods; you can call them anything you want. Before that, I will give some definitions. Let’s come to the point where we imagine ourselves as the parser. Let me give a random set of characters one by one. First character ‘>’. When we receive this character, what comes to our mind? OK now we can have ‘>’, or ‘*’. This is our current state of mind, we are expecting some more characters and our mind is uncertain. We will represent these states by circles. When we are in a state, we are waiting for some input. Based on this input, we will transit to another state of mind. Like if we get a ‘*’, our mind will be certain that it is a head increment. The inputs will be given by arrows. The direction of the arrow will tell from what state to what we transit, and the label on the arrow will tell on what symbol. Let us take an example of recognizing the ‘>*’. We will construct an abstract machine for this purpose. The time till we are not giving any characters to this machine, it will stay in the special state of not knowing what will happen. This is the “Start” state. See figure – 3.

Figure – 3

Here as you can see, the “start” state as “q0”. The symbol “>” makes a transition from “q0” to “q1”. When we are in “q1” and we get any of “>” or “*” we recognize it as one of storing or moving head statements. So we go to a recognized state, which marks the final state of recognizing a certain token. This final state is marked by two concentric circles. Here the arrows mark the conditions which make the transition from one state to another. When we are resting on “q0” state, there will be a transition to “q1” state provided we get a “>” as next character. It is like a chemical reaction, the molecules are in the “q0” state initially, “>” can be thought as making the temperature high. Then only there is a transition to “q1”. But hey in a chemical reaction, a transition to another state can depend on many other states like say introduction of chemical “x”. How will we mark that? It will be like figure 4.

Figure – 4

Here the black rectangle tells that it will only allow transition to “p1” state provided that two of the conditions are fulfilled. Fortunately, we will first consider only one condition for some time. The other kind of multiple conditions are called “Petri’s net”. The single condition transition is called “Finite automata”. Here all the rectangles/circles are called “nodes” and the arrows are called “arcs”. Now a combination of these nodes and arrows are what we call “graphs” or more specifically directed graphs. So graphs are tools to help us study languages. There are a lot of properties of these automata; we will consider them in a separate thread.

Each of these tokens can be recognized by finite automata. If after the string of characters is empty, the finite automata state is not final, then there is some error. But these tell us only how to recognize tokens. They do not give us any meaning of token, they just tell us if a particular string satisfies a certain structure or not by simulating it through their states. Few of the automation are given in the below figure. This is by no means a complete automation. You can finish the complete and post me if you like. I will put it in this page with your name below it.

Figure – 5

5. Brainless Interpreter Code and Example

First, we will create a simple interpreter for our simple language. I have attached the source for the simple language BrainLess with this article. In the next article, we will enhance the language/compiler. I have created a solution with Visual Studio C++ 2008 express edition. Open the solution file and compile it. It will create an "exe" called BrainLess.exe. You can go to command prompt and then type BrainLess <File Name>.

I have supplied a sample script file “m.bl” with the code. I have given comments in the source code to make it easier to understand.

6. What's in the Next Article?

In the next article, we will make some more improvements to brainless. The improvements I am planning are:

Adding functions to BrainLess
Adding loops
Adding conditional statements

If the readers want any other improvements, then they can speak up. I will try to see if we can do it.

7. Conclusion

Please provide your comments and reviews for this article and help me enhance the other articles. You can use the source any way as you wish, but if you are distributing it please don’t forget to mention the copyright notice. I have few ideas for the improvement for this language, please let me know what more features you would like in this language. But one thing is this language is meant to be very simple, so the features should be simple only.

8. Background

There is a plethora of material about languages in the net, so why another article? Well logically suppose there were N books about languages, but why did the N+1 th book came up? This means different books give different views to languages, some are easy to follow, and some are boring. Each give different views for a certain group of people. But my perspective about learning is little different. Some time back when I wrote a letter to a professor, he gave me a wonderful suggestion. “If you think you like something try to teach it, implement, and try to write about it. This will increase your knowledge about it, and the more flaws you find in your learning, the more refined your knowledge you will get. So I thought of writing few articles about, and implement, so here it is.

9. History

BrainLess coming to life. Version 0.1

Update

I have moved the code to Google code. Further updates will be available at http://code.google.com/p/brainless-labs/.

License

This article, along with any associated source code and files, is licensed under The GNU Lesser General Public License (LGPLv3)

Written By

BrainlessLabs.com

Architect

India

I like to explore different aspects of technology. Try new things, and get delighted. My interests are programming language, and Imaging. But its not hard to work on other things also. Algorithms delight me over a coffee break.

I basically code in C++, but JAVA is not so alien for me. I know few scripting languages also. Basically I feel that knowing a programing language is just a matter of getting introduced to it.

For my other articles check my blog on homepage:

http://brainlesslabs.com/

https://github.com/BrainlessLabsInc

http://www.luxrender.net/en_GB/authors_contributors - SMISRA

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.