Click here to Skip to main content
15,889,367 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
My objective with this code is scrap data of some nfl players, than compare they but i find an error yesterday and i cant resolve by myself.

My code:

from bs4 import BeautifulSoup
from urllib.request import urlopen
import pandas as pd
import matplotlib as mpl
import numpy as np
import matplotlib.pyplot as plt


#create a list of each year where data will be extract

years_list = [2001, 2002, 2008, 2012, 2015,2018, 2020 , 2021]
player_list = ['Mac Jones', 'Aaron Rodgers', 'Deshaun Watson', 'Patrick Mahomes',
                'Josh Allen', 'Ryan Tannehill', 'Drew Bress', 'Russel Wilson',
                'Kirk Cousins', 'Tom Brady', 'Derek Carr']

#selecting stats
cols = ['Player', 'Tm','Cmp%', 'Yds', 'TD', 'Int', 'Y/A', 'Rate', 'G']
df_list = []

#loop for extract data
for year in years_list:
    url_mac = f'https://www.pro-football-reference.com/years/{year}/passing.htm'
    temp_df = pd.read_html(url_mac)[0][cols]
    temp_df['Season'] = year
    
    df_list.append(temp_df)
    print(f'Collected: {year}')


data_radar = pd.concat(df_list)

#renaming columns
new_columns = data_radar.columns.values
new_columns[-6] = 'y_sack'
data_radar.columns = new_columns

#picking top 10 qb in rating stats in last season
mid_data = pd.DataFrame()
for player in player_list:
    mid_data = mid_data.append(data_radar[data_radar['Player'] == player + '*'])
    mid_data = mid_data.append(data_radar[data_radar['Player'] == player + '*' + '+'])
    mid_data = mid_data.append(data_radar[data_radar['Player'] == player])
    mid_data = mid_data.append(data_radar[data_radar['Player'] == player + '+'])


cols = ['Cmp%', 'Yds', 'Int', 'Y/A','Rate', 'G', 'Season']
final_data = pd.DataFrame()


#Select informations about players and ordering

final_data = mid_data[['Player', 'Tm'] + cols]
final_data.sort_values(by = 'Season', ascending=False)
final_data.drop_duplicates(subset = 'Player')


radar_data = final_data.replace({'Tom Brady*':'Tom Brady', 'Aaron Rodgers*':'Aaron Rodgers','Aaron Rodgers*+':'Aaron Rodgers','Deshaun Watson*':'Deshaun Watson', 'Josh Allen*':'Josh Allen','Derek Carr*':'Derek Carr','Patrick Mahomes*':'Patrick Mahomes', 'Patrick Mahomes*+':'Patrick Mahomes' })

final_data



My problem with this is that sort.values dont order well. Like, Aaron Rodgers should be orderned by 2008, 2012, 2015, 2018, 2020, 2021, but are 2012, 2015, 2018, 2020, 2018


Idk why this occurs.

What I have tried:

I tried use another methods to order like groupby then drop.duplicates(), but dont fit well for my purposes.
I also tried save as xlsx and see if my replace method was with some problem and also dont worked.

What i need is just first season of each player where him player more than 10 games, players of 2021 dont matter games played. My inicial idea was just use some method to order like sort.values in crescent order than delete duplicates and keep the first.
Posted
Updated 24-Sep-21 19:45pm

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900