Detecting URLs in Text

JohnnyCee

Rate me:

5.00/5 (3 votes)

9 Apr 2016CPOL4 min read

27.2K

This article demonstrates how to use simple parsing techniques to detect URLs in user-supplied text.

Introduction

I am developing a program where one of the minor features is detecting one or more URLs in long, user-supplied text fields in order to wrap the URL in an HTML "A" element, i.e., to "linkify" the URLs. This article describes how I chose to proceed, and why I chose to (mostly) ignore the formal specification of a URL and focus on other factors associated with finding URLs in text.

To Regex, or Not to Regex?

If you need to find URLs in text, you will probably search the web for existing solutions. When you do, you'll find many articles and examples that show how to use Regex patterns to find URLs. None of the items I found seemed appropriate for my application.

The solutions I found were more concerned with the formal specifications for a URL than they were with the tactical problems of finding URLs in user-supplied text. Multiple authors described a fundamental problem: URLs are complex beasts and it is difficult to detect them in plain text because the rules that determine what can and cannot be part of URLs are often at odds with how end users enter them in text.

Valid URLs may contain a wide range of characters, with specific rules for the characters in the domain, path, query, and fragment segments of the URL. As a result, a Regex pattern to match a URL is long and involved. I found patterns that ranged from a couple dozen characters to 2000 characters, and more. Despite the length and complexity of those patterns, none claimed to accurately match all valid URLs in any given text.

Some example patterns exclude URLs that the author wants to ignore, such as "http://localhost", and the like. Removing or modifying those exclusions would require reverse-engineering the patterns, and while I'd probably have to do that before including the pattern in my application, I wasn't looking forward to it!

A more important issue with the patterns I found was the failure to accurately detect the end of the URL.

For example, URLs may include parentheses, which are common on Wikipedia, and that proved challenging because people often wrap URLs in parentheses, i.e., "(http://www.example.com)". The patterns I found included the trailing ) in the URL.

In the text I need to process, ending punctuation often follows a URL, like this: "Is it http://example.com?" Most punctuation is valid in URLs, including common sentence-ending punctuation like question mark, comma, period, and exclamation mark, but when that punctuation is followed by white space or the end of the text, the punctuation is typically not part of the URL. The patterns I found did not account for those cases.

Rather than use a complex Regex pattern that did not suit my requirements, or try to amend those patterns to meet my requirements, I chose a different approach.

Perfect is the enemy of good

For my approach, I decided to focus on the text surrounding the URL candidates, and I defined a relatively simple set of rules:

All URLs must begin with http://, https://, or ftp://, AND
A URL must be surrounded by a delimiter pair, such as ( and ), OR
A URL must end with common ending punctuation, OR
A URL must end with a whitespace character.

I understand that those rules will not detect all URLs (false negatives), and will detect invalid URLs (false positives). For my application, I can accept those failures. I can instruct users how to adapt their text to avoid issues with problematic constructs.

Understanding the code

ReplaceUrls() is a string extension that accepts a delegate method. It finds the URLs, but leaves it up to the delegate to supply the replacement text. Typically, the delegate will wrap the URL in HTML to linkify it. The delegate may choose, based on its own logic, to ignore the URL by returning the URL text unchanged.

ReplaceUrls() uses a simple Regex to find the start of URLs, and one of two methods to find the end of each URL based on whether or not the URL appears to be wrapped in a delimiter pair.

The GetUrlDelimiter() method inspects the text to determine if the URL is wrapped in a delimiter pair. The valid delimiters are ( and ), [ and ], « and », and single or double quotes. If the URL begins with the initial character of one of the delimiter pairs, ReplaceUrls() uses FindEndOfDelimitedUrl() to find the matching character. Otherwise, ReplaceUrls() uses FindEndOfUrl() to find the end.

Using the code

Here's an example of using ReplaceUrls():

text = text.ReplaceUrls(LinkifyUrl);

...

public string LinkifyUrl(string url) {
    return String.Format("<a href=\"{0}\">{0}</a>", url);
}

The code

using System.Text;
using System.Text.RegularExpressions;

namespace JohnCardinal.Html {
   public delegate string UrlEvaluator(string url);

   internal static class UrlExtensions {
      private const char kLeftPointingDoubleAngle = '\u00AB';
      private const char kRightPointingDoubleAngle = '\u00BB';
      private const char kNoDelimiter = '\0';

      private static Regex UrlPrefix = new Regex("(https?|ftp)://",
            RegexOptions.IgnoreCase | RegexOptions.Compiled);

      public static string ReplaceUrls(this string text, UrlEvaluator evaluator) {
         var matches = UrlPrefix.Matches(text);
         if (matches.Count == 0) {
            return text;
         }

         int copied = 0;
         var sb = new StringBuilder();

         foreach(Match match in matches) {
            if (match.Index > copied) {
               sb.Append(text, copied, match.Index - copied);
            }

            var delimiter = GetUrlDelimiter(text, match);
            var end = (delimiter == kNoDelimiter) ?
                  FindEndOfUrl(text, match) :
                  FindEndOfDelimitedUrl(text, match, delimiter);

            var url = text.Substring(match.Index, end - match.Index + 1);
            if (url.Length > match.Length) {
               sb.Append(evaluator(url));
            }
            else {
               sb.Append(url);
            }

            copied = end + 1;
         }

         if (text.Length > copied) {
            sb.Append(text, copied, text.Length - copied);
         }

         return sb.ToString();
      }

      private static int FindEndOfUrl(string text, Match match) {
         const string kWhitespace = " \r\n\t";

         var index = match.Index;
         while (index < text.Length) {
            switch (text[index]) {
               case ' ':
               case '\r':
               case '\n':
               case '\t':
                  // whitespace ends the URL
                  return index - 1;

               case '.':
               case ',':
               case '!':
               case '?':
               case ':':
               case ';':
                  // common punctuation followed by whitespace
                  // ends the URL
                  if (index < text.Length - 1) {
                     if (kWhitespace.IndexOf(text[index + 1]) != -1) {
                        return index - 1;
                     }
                  }
                  // common punctuation at the end of the text
                  // ends the URL
                  else if (index == text.Length - 1) {
                     return index - 1;
                  }
                  break;

            }
            index++;
         }
         return index - 1;
      }

      private static int FindEndOfDelimitedUrl(string text, Match match, char delimiter) {
         var nested = 1;

         var index = match.Index;
         while (index < text.Length) {
            switch (text[index]) {
               case ' ':
               case '\r':
               case '\n':
               case '\t':
                  // whitespace ends the URL
                  return index - 1;

               case '"':
                  if (delimiter == '"') {
                     return index - 1;
                  }
                  break;

               case '\'':
                  if (delimiter == '\'') {
                     return index - 1;
                  }
                  break;

               case '(':
                  if (delimiter == '(') nested++;
                  break;

               case ')':
                  if (delimiter == '(') {
                     nested--;
                     if (nested == 0) {
                        return index - 1;
                     }
                  }
                  break;

               case kRightPointingDoubleAngle:
                  if (delimiter == kLeftPointingDoubleAngle) {
                     return index - 1;
                  }
                  break;

               case ']':
                  if (delimiter == '[') {
                     return index - 1;
                  }
                  break;
            }
            index++;
         }

         return index - 1;
      }

      private static char GetUrlDelimiter(string text, Match match) {
         const string kDelimiters = "\"'([\u00AB";

         if (match.Index > 0) {
            var index = match.Index - 1;

            if (kDelimiters.IndexOf(text[index]) != -1) {
               return text[index];
            }
         }
         return kNoDelimiter;
      }
   }
}

You may want to adjust the delimiters to suit the type of text you need to process.

History

v1.0 - 2016-04-07

v1.1 - 2016-04-09 Removed unnecessary "using" statement

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Written By

JohnnyCee

Web Developer

United States

This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.