Click here to Skip to main content
15,891,184 members
Articles / Programming Languages / C# 4.0
Tip/Trick

A Simple and Powerful Library to Deal with Web Robots Control Strategy

Rate me:
Please Sign up or sign in to vote.
4.29/5 (3 votes)
6 Mar 2014MIT 10.2K   7  
How to parse robots.txt and robots meta tag

Introduction

In this tip, I'll present my Library WWW RobotRules (https://robotrules.codeplex.com/). This is a simple library to parse robots.txt and robots meta tag. The library fully respects the RFC 1808 and the RFC 1945.

Using the Code

Configuration

  • RobotRulesUseCache: Boolean, to active or deactivate the cache support
  • RobotRulesCacheLibrary: Type definition string, optional if RobotRulesUseCache is False
XML
<?xml version="1.0" encoding="utf-8" ?>
<configuration>
  <appSettings>
    <add key="RobotRulesUseCache" value="False"/>
    <add key="RobotRulesCacheLibrary" 
    value="RobotRules.Cache.MemoryCache, RobotRules"/>
    <add key="RobotRulesCacheTimeout" value="00:01:00" />
  </appSettings>
</configuration>  

Use the Library

First, define a new parser with your robot user agent:

C#
using RobotRules; 
 
private RobotsFileParser RobotRules = new RobotsFileParser() 
{
 LocalUserAgent = @"Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)";
}; 

Then, use it like this:

C#
RobotRules.Parse(new Uri("http://blablabla.com"));
if (RobotRules.IsAllowed("GoogleBot", new Uri ("http://blablabla.com"))) {
   // your code ...
}

This code is great, but if the robot control rules are embedded into the HTML code?

Sample

HTML
<!DOCTYPE html>
 
<html lang="en" 
xmlns="<a href="http://www.w3.org/1999/xhtml">http://www.w3.org/1999/xhtml">
<head>
    <meta charset="utf-8" />
    <title>Test</title>
    <meta name="robots" content="nofollow"/>
</head>
<body>
 
</body>
</html>

Don't be worried about that, just use the library like this:

C#
RobotsFileParser RobotRules = new RobotsFileParser()
{
    LocalUserAgent =  @"Mozilla/5.0 
    (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
};

RobotControlStrategy strategy = RobotRules.CheckRobotControlStrategy
("Googlebot", "HTML CONTENT");

if (strategy.CanFollow)
{
    // your code
}
if (strategy.CanIndex)
{
    // your code
}

Points of Interest

  • Use MEF to load the cache plugin instead of reflection

History

  • V1 : 03/06/2014
  • V1.5.2.4
    • ICache now inherits from IDisposable
    • Fix cache initialization
    • RobotsFileParser is disposable
    • RobotsFileParser exposes the method ClearCache()
    • Add new configuration key RobotRulesCacheTimeout to specify cache timeout

License

This article, along with any associated source code and files, is licensed under The MIT License


Written By
Software Developer
France (Metropolitan) France (Metropolitan)
This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

Comments and Discussions

 
-- There are no messages in this forum --