Web crawler loop with a limit of 1000 parallel https requests only process between 100 and 200 requests within a minute

Question

0.00/5 (No votes)

See more:

I am writing a web crawler in Node js with the request module and redis as url cache.
What I try to accomplish is a constantly crawler loop (endlessly) which gets a url from redis and makes a https request.

I have tried to add paralell requests limitation. For example 10 parallel requests.
I do not know why but no matter if I choose 10 or 1000 the outcome is always the same.
Within one minute only between 100 and 200 requests get processed.

Please give me a hint where am I doing wrong. I would like to process 60.000 requests (urls) per minute, but even if I choose 10k parallel requests as limit it process only between 100 and 200 requests within a minute.

My code as below:

the crawler loop:

JavaScript

var limit   = 1000;       // parallel requests limit
var running = 0;  

function loop() {
  while(running < limit) {
    req(function(){
      running--;
      loop();
    });  
    running++;
  }
}
loop();

// and the req function to make the https requests to the urls given by redis:

  function req (callback){ 
    client.select(2);
    client.RANDOMKEY([], function (err, result) {
    var url = result;
    client.del(url);

    if(err) { 
      return  callback();
    }
    else if (url == null || url == "" ) {
      return  callback();
    } 
    else {
      request(url,function(error, response, body){
        return callback();
      });
    } 
  });
}

What I have tried:

I tried to play with setTimeout and fire more concurrent requests even that didn't work out no matter what I have tried I never got more than 200 requests being processed in a minute. Cannot believe that this is the limit node js is capable of.

Posted 4-May-20 7:44am

NodejsToGo

Updated 6-May-20 0:56am

MadMyche

v2

Add a Solution

Comments

Richard MacCutchan 4-May-20 15:21pm

"Captain, I canna change the laws of physics".

[no name] 4-May-20 15:45pm

... And all ISP's are the same. And all web sites. And all links. And all pages.

ZurdoDev 4-May-20 16:03pm

It's more likely a limit of what's running it.

NodejsToGo 4-May-20 16:15pm

I am running on ubuntu with a cloud server of Ionos. I really do not have a clue what is limiting my requests do you ? Node js could easily run 1000k requests per second normally.

Dave Kreskowiak 4-May-20 17:41pm

Node js could easily run 1000k requests per second normally.
Yeah, if you were writing a SERVER application that's taking inbound requests.

Outbound requests to that many servers, not so much.

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)