Click here to Skip to main content
15,868,141 members
Please Sign up or sign in to vote.
5.00/5 (1 vote)
Hi all.

Is it possible to split input text into list of words. Words should contain only symbols.

for ex. "This is Peter's 5th program." I want to get words: "This", "is", "Peter", "program".

Can I do it using only regular expressions? Or it is better to use myString.Split(' '), analyse each word and remove signs etc?

Thanks for any help.
Posted

That actually quite difficult, and probably not suited for a regex at all. The problem is that you want to accept the "Peter" from "Peter's" but discard "5th". What you really want to do is probably use a dictionary (a proper one, rather than an .NET Dictionary class) and check for actual words. Other wise, what are you going to do with "Peters'" or "it's"?
 
Share this answer
 
Comments
BoxyBrown 18-Aug-11 12:04pm    
I want to count words in some text and get most frequency words from it. (to learn most frequency unknown words before reading knew book in foreign language) I thought to merge such words like peter and peters.
OriginalGriff 18-Aug-11 12:07pm    
Yes, but Peter, Peters, Peter's and Peters' are all different meanings... :laugh:
BoxyBrown 18-Aug-11 12:52pm    
I know, but for my task such difference doesn't matter.
BoxyBrown 18-Aug-11 12:05pm    
Anyway thank you. Now I have information to think about.
I would start with Split. The problem is not really up to Regex, which is also would be hard to support.

—SA
 
Share this answer
 
This one will work for the input you specified
(^|\s)(?<word>[a-zA-Z][a-zA-Z']*)</word>

I agree with OriginalGriff that making a regex that'll work 100% is if not impossible then atleast almost. If you do not required 100% precision then the regex should workout for you.
 
Share this answer
 
Comments
Sergey Alexandrovich Kryukov 18-Aug-11 18:15pm    
Won't really work. What to do with Unicode characters (and even many other ASCII characters)? The "word" might not mean "word composed from Latin letters".
--SA

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900