Click here to Skip to main content
15,888,157 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
I am writing a web crawler in Node js with the request module and redis as url cache.
What I try to accomplish is a constantly crawler loop (endlessly) which gets a url from redis and makes a https request.

I have tried to add paralell requests limitation. For example 10 parallel requests.
I do not know why but no matter if I choose 10 or 1000 the outcome is always the same.
Within one minute only between 100 and 200 requests get processed.

Please give me a hint where am I doing wrong. I would like to process 60.000 requests (urls) per minute, but even if I choose 10k parallel requests as limit it process only between 100 and 200 requests within a minute.

My code as below:

the crawler loop:
JavaScript
var limit   = 1000;       // parallel requests limit
var running = 0;  

function loop() {
  while(running < limit) {
    req(function(){
      running--;
      loop();
    });  
    running++;
  }
}
loop();

// and the req function to make the https requests to the urls given by redis:

  function req (callback){ 
    client.select(2);
    client.RANDOMKEY([], function (err, result) {
    var url = result;
    client.del(url);

    if(err) { 
      return  callback();
    }
    else if (url == null || url == "" ) {
      return  callback();
    } 
    else {
      request(url,function(error, response, body){
        return callback();
      });
    } 
  });
}


What I have tried:

I tried to play with setTimeout and fire more concurrent requests even that didn't work out no matter what I have tried I never got more than 200 requests being processed in a minute. Cannot believe that this is the limit node js is capable of.
Posted
Updated 6-May-20 0:56am
v2
Comments
Richard MacCutchan 4-May-20 15:21pm    
"Captain, I canna change the laws of physics".
[no name] 4-May-20 15:45pm    
... And all ISP's are the same. And all web sites. And all links. And all pages.
ZurdoDev 4-May-20 16:03pm    
It's more likely a limit of what's running it.
NodejsToGo 4-May-20 16:15pm    
I am running on ubuntu with a cloud server of Ionos. I really do not have a clue what is limiting my requests do you ? Node js could easily run 1000k requests per second normally.
Dave Kreskowiak 4-May-20 17:41pm    
Node js could easily run 1000k requests per second normally.
Yeah, if you were writing a SERVER application that's taking inbound requests.

Outbound requests to that many servers, not so much.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900