Click here to Skip to main content
15,891,907 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
Trying to scrape tables on multiple pages. had some trouble with this one, was looping with out stoping but now:  The loop stops after two rounds and I get an IndexError: Reindexing only valid with uniquely valued Index objects.
4 Pages over all to scrape in this round. 

<pre><pre>

import pandas as pd
import requests

headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/537.36'}

results = pd.DataFrame()
stats = 2018

while stats < 2023:
    goToNextStats = True
    desc = 1
    while goToNextStats == True:
        
    
        
        base_URL = 'https://basketball.realgm.com/nba/stats/{}/Averages/Qualified/points/All/desc/{}/Regular_Season'.format(stats,desc) 
        
    
        response = requests.get(base_URL, headers)
        if response.status_code == 200:
            temp_df = pd.read_html(base_URL)[2]
            temp_df.columns = list(temp_df.iloc[0,:])
           

            if len(temp_df) == 0:
                goToNextStats = False
                stats +=1
                continue


            print ('Aquiring Season: %s\tPage: %s' %(stats, desc))

            temp_df['Season'] = '%s-%s' %(stats-1, stats)

            results = results.append(temp_df, sort=False).reset_index(drop=True)

            desc+=1


results.to_csv('/avg.csv', index=False)



Quote:
InvalidIndexError Traceback (most recent call last)
<ipython-input-78-2c377d5de3a4> in <module>
34 temp_df['Season'] = '%s-%s' %(stats-1, stats)
35
---> 36 results = results.append(temp_df, sort=False).reset_index(drop=True)
37
38 desc+=1


What I have tried:

Checked the url if table number changes. But not the point.
Posted
Updated 16-Jun-22 8:34am

1 solution

Interesting, what happens if the URL responds with something besides 200 on the following line:

Python
if response.status_code == 200:


Does it loop forever, or fall through?
 
Share this answer
 
Comments
Rasselbande 16-Jun-22 14:57pm    
falls through
raddevus 16-Jun-22 15:31pm    
This line seems suspect: temp_df = pd.read_html(base_URL)[2]
You're indexing off the HTML that comes back into the 2nd element??
What if it fails? What would temp_df value be when it hits the code that you seem to be indicating that fails?
results.append(temp_df,
Rasselbande 17-Jun-22 4:07am    
I thought about finding a way to add like "find.all" but it odes not work in read.hmtl, right? Since number of table changes after a while.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900