Click here to Skip to main content
15,883,901 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
hi,

I'm working on getting values from meta tags. So far I've gotten success but stuck at a point where i'm getting meta tag like below:
<meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image">


through this i'm not able to extract url string which is in the content property of meta tag.

What I have tried:

Regex meta = new Regex(@"<meta\s*(?:(?:\b(\w|-)+\b\s*(?:=\s*(?:""[^""]*""|'" +
                          @"[^']*'|[^""'<> ]+)\s*)?)*)/?\s*>");

WebClient web = new WebClient();
					web.UseDefaultCredentials = true;
					string page = web.DownloadString(url);


                    WebClient client = new WebClient();

                    // Add a user agent header in case the 
                    // requested URI contains a query.

                    client.Headers.Add("user-agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.0.3705;)");

                    Stream data = client.OpenRead(url);
                    StreamReader reader = new StreamReader(data);
                    string s = reader.ReadToEnd();
                    //Console.WriteLine(s);
                    data.Close();
                    reader.Close();



                    MatchCollection mc = meta.Matches(s);
                    int mIdx = 0;
                    foreach (Match m in mc)
                    {
                        for (int gIdx = 0; gIdx < m.Groups.Count; gIdx++)
                        {
                            
                            metadata.Add(m.Groups[gIdx].Value);
                        }
                        mIdx++;
                    }



Any Solution?
Posted
Updated 9-May-16 4:32am

Use a RegEx debugger to see where the match fail
Debuggex: Online visual regex tester. JavaScript, Python, and PCRE.[^]
Paste your RegEx.
Paste your data to match.
Use the cursor to see where it fail.
When you have a valid RegEx, use Code Snipset button on top.

You will see the problem is not what you think.

perlre - perldoc.perl.org[^]

[Update]
Nota: There is more than 1 RegEx dialect, JavaScript regEx is not C# RegEx, difference is in details.
Find which dialect is used in C# and find differences.
By the way JavaScript and C# strings do not handle special chars the same way.
 
Share this answer
 
v2
Comments
[no name] 9-May-16 7:40am    
i checked the following regex

@"< meta\s * (?: (?:\b(\w | -) +\b\s * (?:=\s * (?: "[^"]*"|'[^']*'|[^"'<> ]+)\s*)?)*)/?\s*content[\\s]?=[\\s\"\']+(.*?)[\"\']+.*?/>"

its working perfectly fine but when i use the same in C#, its giving me error.

Am I doing any mistake?
Patrice T 9-May-16 8:20am    
Your RegEx look way complicated. Must be able to simplify.
Instead of using Regex, you can use an HTML parser. I would recommend HTML Agility Pack, first of all: Html Agility Pack — Home[^].

—SA
 
Share this answer
 

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900