Help needed to verify my understanding of regex coding

Question

1.00/5 (1 vote)

See more:

I'm a real newbie at regex. I have a banking program and it has regex rules for importing transactions. But sometimes the rules just don't work right. I'm trying to understand just what the rules are saying so that I can make them work the way they should.

I think I've figured out some of it but I just need some verification as to whether I'm on the right track or not.

Here's the actual wording that comes through on my credit card statement:
AT&T*BILL PAYMENT 800-331-0500 TX

Here's the regex rule that I evaluates for my banking program:
AT&T\*BILL PAYMENT +[0-9][^\p{L}]*T+[0-9][^\p{L}]*S+SQPZ

Here's the way that I understand the regex rule: (this is what I'd like to know if I'm understanding the expression rightly.

AT&T (Actual text to display) \ (escape character so next symbol (*) will display) BILL PAYMENT (Actual text to display) +[0-9] (If there are numbers then display those numbers however many times they exist and in the order they exist - Since there aren’t any numbers I THINK it put in 9 spaces instead) [^\p{L}] (when the ^ is used it can mean the start of a new line. But if it’s used inside a square bracket it means “not” - So, the \p{L}] when used with the ^ indicates an Arabic letter that is not a letter {L} or a number {N}) *T (I think it stands for a Tab +[0-9] (As above, if there are numbers then display those numbers however many times they exist and in the order they exist). [^\p{L}] (As indicated above, it indicates an Arabic letter that is not a letter {L}. *S+SQPZ (I don’t know what these stand for. It would seem that they are specific Letters that should be displayed at the end of the line.)

So, am I anywhere near right in deciphering the regex?

Thanks for whatever help comes!

Art

Posted 28-Dec-15 10:01am

aajoyce

Add a Solution

1 solution

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Sergey Alexandrovich Kryukov · Accepted Answer · 2015-12-28T10:17:00

Solution 1

Not quite, because your Regular Expression does not match your example. Can it be fixed?

No! This is because you are missing the whole idea of match. Regular expression cannot be built on just a single text sample. It covers some set of all matching strings, which you did not define and maybe you don't know it.

For example, I can suggest one Regex which matches your example: AT&T\*BILL PAYMENT [0-9]{3}-[0-9]{3}-[0-9]{4} TX. But how do you know that it can be only "TX" at the end? What if it should be two arbitrary Latin letters? Then it should be AT&T\*BILL PAYMENT [0-9]{3}-[0-9]{3}-[0-9]{4} [A-Z]{2}. It also matches your example. But how do you know that it should be two? How do you know it should be upper-case? And so on…

Your "problem" is a typical problem which is not really formulated. You need to know exact definition of the format, which should cover exact set of all possible string values. It can be defined mathematically, or… by a Regular Expression. :-)

—SA

Posted 28-Dec-15 10:17am

Sergey Alexandrovich Kryukov

Comments

aajoyce 28-Dec-15 19:53pm

Thank you Sergey. You're right, that "typical problem" wasn't really formulated because I was trying to work only from the info that I had. Sometimes a transaction would have 3 or 4 rules that seemed to be all different, or if not different, so completely alike in almost every detail that it seemed like it was the same. I would import transactions and thought that the import rule (a regex that the program formulated) was what I wanted but it would never be exactly right and I found myself going over and over the same things, such as selecting the proper category, so it would register correctly in my bank program. I was trying to learn what the various regex codes meant so I could adjust them to give more consistent results. So I found myself removing parts of the regex and not really knowing what I was removing. I will continue to expand my learning of regex and perhaps one day I'll have a better grasp of it. I was also making an assumption that the company, such as AT&T always sent the same text with every transaction and couldn't figure out why my program was forming the rule as it did. Anyway …thanks!

Sergey Alexandrovich Kryukov 28-Dec-15 20:35pm

You are welcome. Well, that's simple enough: you can collect some statistics and hypothesize on some set of input string and generalize it in some reasonable way, but you cannot be 100% sure unless you receive a formal definition from an original source... It's not really related to your knowledge of Regex, which should not be a problem.
—SA