Regex+ - Extended Regular Expression Engine

Mirzakhmet Syzdykov

3.29/5 (5 votes)

Sep 23, 2021

CPOL

2 min read

5312

Description of the newly available Regex+ engine for Java

Introduction

The modern regular expression engines don't support extended operators like intersection, subtraction and complement. With our new idea of development and elaborate efforts, we make it possible to use the extended operators in regular expression.

Extended Regular Expression Engine

As stated before, there's more of a theoretical background to study to build the fully POSIX-compliant regular expression engine supporting intersection, subtraction and complement. For this purpose, we override the states with counter-state using the decision making when choosing whether to branch for further evaluation of the matching process.

For parsing purposes, we use the following notation for intersection - "&", subtraction - "-" and complement - "~". Thus, to write the expression for subtraction, for example, we can set "(Hello|world)-world".

The present time full support of our engine for the extended regular expressions is as follows in BNF-notation:

R =
    (A | ["~"] /* complement */ (R) | "." | "[" A* "]") ["+", "*", "?"],
    R1 R2,
    R1 | R2,
    R1 & R2 /* intersection */
    R1 - R2 /* subtraction */

Using the Code

The code usage is defined by first parsing the expression by class Parser. As the parsed data are ready, it's time to build automaton for the abstract node produced by parser. The streaming is vital and the author has implemented the streaming abstract classes and interfaces so that the user can use custom-defined stream, for example, by networking.

Let's see the code usage which can be found in tests prepared for Regex+ engine:

import com.regexplus.automaton.model.Automaton;
import com.regexplus.automaton.model.StringStream;
import com.regexplus.match.common.IMatch;

String pattern = "~(((a+|aa+)-aa)&aaa)";
String string = "ba";

Automaton automaton = new Automaton();

automaton.build(new StringStream(pattern));

if (automaton.matches(new StringStream(string))) {
	List<IMatch> matches = automaton.match(new StringStream(string));
	
	System.out.println("Start: " + matches.get(0).start() + ", 
                        Length: " + matches.get(0).length());
}

Basic Principles

Our basic principle is to create the INode-interface and Node-class along the Parser-class as the parsing engine. Thus, we have to re-define the parsing technique in context-free manner and propose the new node class by overriding the defaults (like NodePaired or Node), if we have to create the new branch for our BNF-grammar.

For the purpose of the non-deterministic finite automata, there's the set of classes and interfaces in the package com.regexplus.automaton like IState, State. Thus to create the new state, just override the defaults or re-implement the interface IState.

There's only matching algorithm in the engine, so we can re-develop the existing one.

We can use deterministic automaton also:

import com.regexplus.automaton.model.Automaton;
import com.regexplus.automaton.dfa.DeterministicAutomaton;
import com.regexplus.automaton.model.StringStream;
import com.regexplus.match.common.IMatch;

String pattern = "~(((a+|aa+)-aa)&aaa)";
String string = "ba";

Automaton automaton = new Automaton();

automaton.build(new StringStream(pattern));

DeterministicAutomaton deterministicAutomaton = new DeterministicAutomaton(automaton);

if (deterministicAutomaton.matches(new StringStream(string))) {
    List<IMatch> matches = deterministicAutomaton.match(new StringStream(string));
    System.out.println("Start: "
        + matches.get(0).start() + ", Length: " + matches.get(0).length());
}

Step by Step Walkthroughs

As we have researched the topic of ERE, there's an underlying problem before our revolutionary solution which includes the quadratic time and space to solve the ERE-matching problem. Our engine works linearly dealing with decision-making counter overridden states which operate in O(1). Moreover, there's a precise concept behind that.

Conclusion and Points of Interest

So the intersection, subtraction and complement in regular expression engine is possible in linear polynomial time with our revolutionary approach.

The GitHub repository is wide open for our community to be supported and extended by you. So please feel free to drop me a comment or join my project.

History

23^rd September, 2021: Initial version
11^th March, 2022: Revised version
23^rd March, 2022: Added deterministic automaton
27^th April, 2022: Added grep feature
1^st May, 2022: Fixed subtraction operator
2^nd November, 2022: Removed links
3^rd November, 2022: Links deleted
4^th November, 2022: License changed