Modified: 05th Jun 16 |

Regex Programming Language

Version:: 0.4.0, August 25th, 2013
License:: Copyright © 2014 Daniel Korsgaard, Freeware
Download:: Win32: regexpl-0.2.1.zip; Win32: regexpl-0.3.0.zip; Win32 and Linux: regexpl-0.4.0.zip

RegexPL (Regular Expression Programming Language) is an esoteric programming language that performs computation solely by means of matching strings against regular expressions and substituting parts of the matched string.

An example with explanation

It is always nice to see some examples to get a feeling of what this language is all about, so here is a tiny example that illustrates how to compute a given fibonacci number. There are two approaches, the famous exponential approach, and then I also implemented a much faster linear approach, similar to what you would do with a loop construct in imperative languages.

But lets start with the 'Main' function, which is also the entry point of all RegexPL scripts.

# Entry point
def Main()
    inp = readline("Get what fibbonacci number? ")
    nbr = {([0-9]+)} inp
        ! fast_fibbo(nbr[1])
    ! "You need to enter a number"

What does this function really do? First of all, it calls the built in function 'readline' (there is explanations for all built in functions in the download archive). the 'readline' function, takes any number of arguments, and prints them to the console, and waits for the user to enter some text followed by a press on the enter-key. The entered text gets assigned to the 'nbr' variable.

Now the 'nbr' variable is tested against the regular expression /^([0-9]+)$/, in fact, "/^([0-9]+)$/" (without quotes) could have been used instead {([0-9]+)}. If the regular expression does match on the contents of 'nbr' then the following group of indented lines will be executed, otherwise they will be skipped.

Suppose the regular expression does match 'nbr', then the function 'fast_fibbo' will be called. The argument passed to 'fast_fibbo' is capture group 1 from the previous regex test, which happen to have captured the number the user have entered. (Note: lookup "regex capture groups" if you're unfamiliar with them.) Since there is an exclamation mark in front of the call to 'fast_fibbo' the result of that call, will be returned from the 'Main' function, exclamation marks are similar to 'return' in other languages.

If the regular expression did not match 'nbr', the indented group will be skipped, and the string "You need to enter a number" will be returned from Main. Whatever is returned from main, will be output to the console, as the interpreter exits. It is possible to output to the console without returning from Main, simply by calling 'writeline' with whatever you want to output.

Slow Fibonacci

# slow exponential time Fibonacci calc
def slow_fibbo(x)
    {0} x ! "0"
    {1} x ! "1"
    a = slow_fibbo(add(x, "-1"))
    b = slow_fibbo(add(x, "-2"))
    ! add(a, b)

The slow Fibonacci calculator 'slow_fibbo' takes one argument which must represent a number between zero and positive infinity. This number tells what Fibonacci number we are interrested in computing.

As per the definition of Fibonacci, if F_n denotes the n^th fibonacci number starting from n = 0, then F₀ = 0, and F₁ = 1. Simply as a direct effect of these simple definitions, we know that if the argument to slow_fibbo matches {0}, then the result is "0", which will be returned in the immediately following return-expression. Similarly if the argument to slow_fibbo is 1, then "1" is returned.

However, if we didnt look for neither F₀ nor F₁, then we have to compute the result. And again, our approach comes directly from the definition, that says F_n = F_n-1 + F_n-2. So we recursively ask slow_fibbo to return the (n-1)^th number, simply by subtracting "1" from x, using the build in 'add' function. (Adding -1 is the same as subtracting 1.)

The variable 'a' now contains the (x-1)^th fibonacci number, we compute the (x-2)^th fibonacci number in much the same manner, and assigns it to the variable 'b'. And at last, we add those two, as per the definition, and then returns the result of that.

Fast Fibonacci

# fast linear time Fibonacci calc
def fast_fibbo(x)
    {0} x ! "0"
    ! ffibbo_core(x, "0", "1")
 
def ffibbo_core(x, a, b)
    {1} x ! b
    new_x = add(x, "-1")
    new_b = add(a, b)
    ! ffibbo_core(new_x, b, new_b)

Since there is a slow_fibbo, there's also a fast one. Actually a much much faster one, it calculates the thousandth fibonacci number in a matter of seconds, where as the slow_fibbo will take ages just to compute the 50th fibonacci number. I will not explain the code as elaborate as with the slow_fibbo, but I will explain how it generally works.

Basically, the routine is split into two functions, one that contains the necessary initialization (fast_fibbo) and another that performs the actual computation.

If we read the definition carefully enough to actually understand what happens, then we will know that the next number in the sequence can be computed directly, only by knowing the two numbers immediately preceeding the unknown number. The definition also directly gives us the first two numbers of the sequence, namely 0 and 1.

All we have to do from here, is to add those two numbers together, and we now have the three first numbers of the fibonacci sequence. Since we dont need the very first number of the sequence (the number 0), we might as well throw it away, and we can still calculate the fourth.

The two last parameters of ffibbo_core will always contain the last two numbers of the sequence that we have computed so far. the first parameter, tells us how many fibonacci numbers that we have yet to compute, before we found the one we're interested in. If 'x' is counted down to "1", then we know 'b' contains the answer (notice the initial values ffibbo_core is called with. Otherwise, we decrement 'x', and computes a new 'b', that is, a new tail of the sequence. We then recursively call ffibbo_core, where we throw the value of 'a' away, because we do not need it no more.

Why oh why!?

I've always thought that the usual esoteric programming languages people come up with are quite dull. In general it seems as if the challenge is to make something that is as terse to program as possible, and I just don't find that interesting at all.

What I do find interesting is, however, radically and fundamentally different concepts for modeling computation, and that is exactly what RegexPL is. The plain fundamental computational power of RegexPL is based on its ability to recognize any regular expression combined with the ability to branch arbitrarily. And last but not least, the last bit of power comes from the ability to recurse infinitely and arbitrarily. Together these three very well known concepts make up a Turing complete language.

The language is however far more practical than it needs to be in order to be Turing complete, because I believe it is far more fun to program and experiment with something you actually have a chance to comprehend.

Too late

Only a day after I showed RegexPL to a few guys on IRC, someone made me aware that this form of computation had already been discovered ten years ago. See: http://esolangs.org/wiki/Thue. I already suspected this at the time I created RegexPL though.

It all started

Some day I suddently got this crazy idea, that it might be possible to create a Turing Complete programming language solely based on selective text matching and substitution.

I quickly began scribbling down arbitrary syntax in an attempt to visualize my idea. The first lines of code I wrote back then, doesn't look much different than the syntax I ended up implementing.

RegexPL Reference

def Main()
    !"Hello World!"

Yep you guessed it, that's the famous "hello world" program, written in RegexPL source code.

But lets start by defining the fundamentals of the syntax and corresponding semantics, and then turn back to how to compose programs.

Tokens

(backslashes are stripped by the system, and as such this section may appear to be wrong)

First off, RegexPL is a white space sensitive language. In practice, you will only be concerned with where you break lines, and by how much you indent code.

There are only two keywords, namely data and def, used to declare data and functions respectively.

Identifiers starts with a letter or underscore folloved by letters numbers and underscores. Regex: [a-zA-Z_][a-zA-Z0-9_]*

Integers consists of digits from zero to nine inclusive. Regex: [0-9]+

Text literals starts with a double quote followed by any sequence of characters except double quotes but including backslash escaped characters, and ending with a double quote. Regex: "([^\"]|\\.)*"

Regular expressions either start and ends with a slash, or starts with an opening curly brace and end with a closing curly brace. The text between the begin and end characters must be a valid Perl Compatible Regular Expression. When curly braces are used to enclose the regular expression, it is actually just a shortcut for

Comments starts with a hash followed by any sequence of characters except newlines. Regex: #[^\r\n]*
Comments are completely ignored in the parsing step.

Supporting syntax such as round opening and closing parentheses, square opening and closing brackets, comma, equality sign, and exclamation mark.

Grammar

First the grammar of expressions will be explained, then the grammar of declarations, and lastly the grammar of statements.

Declarations are the outer most constructs of the language and reside in global scope. Inside declarations there may be statements, and statements often contain expressions.

Declarations

Functions are declared with the def keyword followed by an identifier followed by parameters enclosed in parentheses. The body of the function consists of one or more statements, where the statements are separated by one or more newlines. The grammar is:

'def' <identifier> '(' (<identifier> (',' <identifier>)*)? ')' <newline> <statement>*

Data declarations declares global constants, but the current virtual machine does not support these, and as such they will remain undocumented for now.

Statements

Statements are probably the most prominent part of RegexPL, since these are the most complex, and it is here the major part of the power of RegexPL is found. There are five statements in total, but conceptually there are only two. These two are return and test statements. A test statement can have four different forms, but they all basically performs exactly the same operation.

Return statements evaluates the given expression, and returns the resulting value from the current function.

! <expr>

Chained test statement. There are two versions of this statement. Common for both, is that they take an expression and a regex. The expression is evaluated, and the resulting value, will then be matched against the regular expression. If there is a match, then the following statement will be executed, otherwise the following statement will be skipped.

<label> = <regex> <expression> <statement>

In case the given regular expression happens to be a capturing regular expression, it will be possible to access these captured groups by indexing on the label. Consider the following example:

foo = /^a(.*)e(.*)$/ "abcdefgh" ! foo[1] " " foo[2]

The returned value of that statement will be "bcd fgh". Also note that the global match is available through index zero.

The alternative version of the chained test statement has the label part cut away. In this case, it is not possible to access any of the matching groups. But it is an easy way to tell if there was a match. Consider the following example:

/foo(bar)/ x !"There was a match"

The grammar for the alternative chained test is:

<regex> <expression> <statement>

Branched test statement. This test statement is much like the chained test statement, except that the branched test statement can have several sub branches. The grammar is as follows

<label> = <regex> <expression> <statement>*

The nested statements can be seen as a block of statements that will be executed if the result of the expression matches the given regex. Consider the following example:

/Foo/ A /Bar/ B !"A and B" /Baz/ C !"A and C" !"only A" !"not A"

If it happens that A matches Foo, then the nested block will be executed. If it turns out that B matches Bar, then "A and B" will be returned. If it does not match, then C will be tested against Baz. If this is a match, then "A and C" will be returned, otherwise "only A" will be returned.

If however A did not match in the first place, then the nested statements would never be executed, and "not A" would be returned immediately. It is valid to remove the statement that returns "only A" and let the flow of execution escape the nested block, which will then result in "not A" to be returned, even if A in fact did match against Foo.

The branched test statement also have a version with no label, and the rules are similar to that of the chained test statement with no label. The grammar is:

<regex> <expression> <statements>

Expressions

Expressions are made up of string literals, variables, capture group indexing and function calls.

Expressions separated by whitespace are concatenated For instance:

A "b" C()

Will concatenate the variable A with the string "b" and the result of the function call C(). If A evaluates tot he string "a" and the function call C() evaluates to the string "c", then the resulting string of the entire expression will be "abc".

Function calls. A function call takes zero or more arguments, and yields a value. The grammar for a function call expression is as follows:

<identifier> '(' (<identifier> (',' <identifier>)*)? ')'

An example of a function call looks like this:

foo(a, b)

That snippet will call the function foo while passing the results of a and b as arguments.

Capture group indexing. Capture group indexing is used to retrieve the matched string captured by one of the test statements. The syntax for this expression is:

<identifier> '[' <integer> ']'

Combined

An example of a function named Tail, will be presented and explained int he following. The purpose of the function, is to take a sequence of characters (a string) as input, and then remove the first character, and return all the remaining characters of the string. If an empty string is given as argument, then the empty string will be returned.

def Tail(list)
    list_parts = /^.(.*)$/ list
        !list_parts[1]
    !""

First the function Tail is declared to receive a single argument. The first statement captures everything but the first character, unless the string is empty. If the string is not empty, the second statement returns all the remaining characters. If the string was empty, then an empty string will be returned.

Known Problems (0.2.1)

If you attempt to access undefined variables you will get an incomprehensible error message telling that a number could not be resolved. This is also the case when attempting to access declared data definitions.

If you attempt to call a function that is not defined, you will get an incomprehensible error at runtime.

Escaped characters in text strings are not converted to the code units they represent.

A different problem, but still worth to mention, is that backslashes are apparently stripped from this page by my content management system, and as such, some descriptions might be wrong, especially with regards to the regexes describing the tokens. New CMS is actually capable of storing backslash entities in the database! Yay! So all backslashes that are necessary, should be present.

Version 0.4.0 Changes

The error output is now marginally better than before, but still could be a lot better. It still sometimes takes a bit of guessing what went wrong.

The syntax has been updated to be a bit tighter.

A long standing issue with retrieving group 0 of matches, has been fixed.

Some documentation for the system calls that was introduced in 0.3.0, have been included in the package.

A bit more example code have been attached to the package.

^ Top

Previous page: Dot Matrix Marquee Project
Next page: QHA Interpreter

Navigation