Why does my scrapy script run successfully without yielding any results?

Question

0.00/5 (No votes)

See more:

I have a scrapy script I wrote to extract profiles from LinkedIn using proxy services. The proxy am using is scrapeops. I created a virtual environment and did pip install scrapeops-scrapy-proxy-sdk. I also added the proxy API to my scrapy project settings following the proxy rules of usage. When i run my scrapy script, it runs successfully with no error but returns empty result. Please what am i missing?

Here is my code

What I have tried:

Python

class ProfilespiderSpider(CrawlSpider):
      name = 'profilespider'
      allowed_domains = ['www.linkedin.com']
      start_urls = ['https://www.linkedin.com/in/reidhoffman?trk=people-guest_people_search-card']

      rules = (
            Rule(LinkExtractor(allow='public_jobs_people-search-bar_search-submit')),
            Rule(LinkExtractor(allow='people-guest_people_search-card'), callback='parse_item'),
      )

      def parse_item(self, response):
          item = {}
    
          """
             Profile Summary
          """
          item['name'] = response.css("div.top-card-layout__entity-info-container h1::text").get().strip()
          item['description'] = response.css("div.top-card-layout__entity-info-container h2::text").get().strip()
    
          try:
             item['location'] = response.css('section.top-card-layout div.top-card__subline-item::text').get()
          except:
               item['location'] = response.css('section.top-card-layout span.top-card__subline-item::text').get().strip()
               if 'followers' in item['location'] or 'connections' in item['location']:
                  item['location'] = ''
            
          contacts = response.css("div.top-card-layout__entity-info-container span.top-card__subline-item::text").getall() 
          item['followers'] = contacts[0].replace('followers', '').strip()
          item['connections'] = contacts[1].replace('connections', '').strip()
    
    
          """
             About Section
          """
          item['about'] = response.css(".summary  p::text").getall() 
    
    
          """
             Experience Section          
          """
          item['experience'] = []
          experience_blocks = response.css('li.experience-item')
          for block in experience_blocks:
              experience = {}
        
              #Position
              try:
                  experience['position'] = block.css('li.experience-item h3::text').get().strip()
              except Exception as e:
                  experience['position'] = ''
                 
              #organisation
              try:
                 experience['organisation_profile'] = block.css('h4 a::attr(href)').get().split('?')[0]
              except Exception as e:
                 print('No organisation profile found')
                 experience['organisation_profile'] = ''
            
              #date
              try:
                 date_ranges = block.css('span.date-range time::text').getall()
                 if len(date_ranges) == 2:
                    experience['start_time'] = date_ranges[0]
                    experience['end_time'] = date_ranges[1]
                    experience['duration'] = block.css('span.date-range__duration::text').get()
                 elif len(date_ranges) == 1:
                    experience['start_time'] = date_ranges[0]
                    experience['end_time'] = 'present'
                    experience['duration'] = block.css('span.date-range__duration::text').get()
              except Exception as e:
                  print('No dates')
                  experience['start_time'] = ''
                  experience['end_time'] = ''
                  experience['duration'] = ''               
            
              #location
              try:
                 experience['location'] = block.css('p.experience-item__location::text').get().strip()
              except Exception as e:
                 print('No location')
                 experience['location'] = ''
            
              #description
              try:
                 experience['description'] = block.css('p.show-more-less-text__text--more::text').get().strip()
              except Exception as e:
                 try:
                     experience['description'] = block.css('p.show-more-less-text__text--less::text').get().strip()
                 except Exception as e:
                     print('no description found')
                     experience['description'] = ''
                
              item['experience'].append(experience)
                
        
              """
                 Education Section
              """
              item['education'] = []
              education_groups = response.css('li.education__list-item')
              for group in education_groups:
                  education = {}
        
                  #university
                  education['university_link'] = group.css('h3 a::attr(href)').get().split('?')[0]
        
                  #degrees
                  try:
                      degree_info = group.css('h4 span::text').getall() 
                      if len(degree_info) == 2:
                      education['degree'] = degree_info[0]
                      education['faculty'] = degree_info[1]
                      else:
                          pass
                  except:
                      print('no degrees acquired')
            
                  #date_range
                  try:
                     date_range = group.css('span.date-range time::text').getall()
                     if len(date_range) == 2:
                        education['start_date'] = date_range[0]
                        education['end_date'] = date_range[1]
                     else:
                         pass
                  except:
                        print('no degree dates')
            
        
                  #description
                  try:
                      education['description'] = group.css('div.show-more-less-text p::text').get().strip()
                  except:
                      education['description'] = ''
            
                  item['education'].append(education)
             
            
                  """
                     Skills Section
                  """
                  item['skills'] = []
                  skills = {}
    
                  try:
                     skills = response.css('div.core-section-container__content li.skills__item') 
                     skills['start_up'] = skills[1].css('li.skills__item a::text').get().strip()
                     skills['strategy'] = skills[3].css('li.skills__item a::text').get().strip() 
                     skills['venture capital'] = skills[5].css('li.skills__item a::text').get().strip() 
                     skills['Saas'] = skills[7].css('li::text').get().strip() 
        
                  except:
                       pass
    
                  item['skills'].append(skills)
    
              yield item

When I run scrapy crawl profilespider -o profiles.json on my cmd prompt, the json file 'profiles' comes back empty. Please do you know what am missing?

Here is my log from console

Python

(venv) C:\Users\LP\Documents\python\ProfileTest\profilescraper>scrapy crawl profilespider -o profiles.json
2023-01-24 15:35:58 [scrapy.utils.log] INFO: Scrapy 2.7.1 started (bot: profilescraper)
2023-01-24 15:35:58 [scrapy.utils.log] INFO: Versions: lxml 4.9.2.0, libxml2 2.9.12, cssselect 1.2.0, parsel 1.7.0, w3lib 2.1.1, Twisted 22.10.0, Python 3.10.7 (tags/v3.10.7:6cc6b13, Sep  5 2022, 14:08:36) [MSC v.1933 64 bit (AMD64)], pyOpenSSL 23.0.0 (OpenSSL 3.0.7 1 Nov 2022), cryptography 39.0.0, Platform Windows-10-10.0.19044-SP0
2023-01-24 15:35:58 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'profilescraper',
'NEWSPIDER_MODULE': 'profilescraper.spiders',
'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',
'SPIDER_MODULES': ['profilescraper.spiders'],
'TWISTED_REACTOR': 
'twisted.internet.asyncioreactor.AsyncioSelectorReactor'}
2023-01-24 15:35:58 [asyncio] DEBUG: Using selector: SelectSelector
2023-01-24 15:35:58 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2023-01-24 15:35:58 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.windows_events._WindowsSelectorEventLoop
2023-01-24 15:35:58 [scrapy.extensions.telnet] INFO: Telnet Password: d126f5d312c5e917
2023-01-24 15:35:59 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2023-01-24 15:35:59 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware' 

 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2023-01-24 15:35:59 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2023-01-24 15:35:59 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2023-01-24 15:35:59 [scrapy.core.engine] INFO: Spider opened
2023-01-24 15:35:59 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2023-01-24 15:35:59 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2023-01-24 15:36:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://proxy.scrapeops.io/v1/?api_key=9c79a52d-f08d-4c45-b8d2-f51ec9a4e7a4&url=https%3A%2F%2Fwww.linkedin.com%2Fpub%2Fdir%3FfirstName%3Dreid%26lastName%3Dhoffman%26trk%3Dpublic_jobs_people-search-bar_search-submit> (referer: None)
2023-01-24 15:36:07 [scrapy.core.engine] INFO: Closing spider (finished)
2023-01-24 15:36:07 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 405,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 323474,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'elapsed_time_seconds': 8.263713,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2023, 1, 24, 14, 36, 7, 939751),
 'log_count/DEBUG': 4,
 'log_count/INFO': 10,
 'response_received_count': 1,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'start_time': datetime.datetime(2023, 1, 24, 14, 35, 59, 676038)}
 2023-01-24 15:36:07 [scrapy.core.engine] INFO: Spider closed (finished)

Posted 29-Jan-23 4:36am

Asuzor Miracle

Updated 29-Jan-23 23:07pm

v2

Add a Solution

1 solution

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Member 15720863 · Answer 1 · 2023-01-29T18:00:00

The issue might be with the way you are trying to extract data from the response. In the skills section, you are trying to extract data from the skills variable which is a list of elements. However, you are trying to access the elements in the list by index, which will not work because the elements in the list do not have index numbers. You should use a loop or the css() method to extract the data from the elements in the list.

You should also check the structure of the webpage that you are trying to scrape to make sure that the css selectors you are using match the structure of the page correctly.

Additionally, you should check if the proxy is working correctly and providing you with valid responses. This can be done by checking the status codes of the responses, or by checking if the proxy is blocked or not.