-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updated the players.py file to parse correctly. #5
base: ncaaf
Are you sure you want to change the base?
Conversation
…six for compatibility)
…ls are out of date.
Not sure what the correct way to do it is, but the ncaaf branch really could use updated utils file. After the utils are updated, it may need naming refactor. I've not done all that before, so I'm not sure how it works. I only edited the players file for now because it's what I need to use, but I have some other team stuff I could edit and work on. I'd rather work on this with an updated utils that more reflects master's 0.7.15 version |
… of annual leaders.
I believe the correct way to do that would be to just merge the master branch into the ncaaf branch. A possible issue (although I don't see this being an issue in this case) is that this will get updated copies of all possible files, not only utils.py. If you really want to update only utils.py, then you can do the following (make sure you're on the ncaaf branch first): |
Even more explicit in identifying the h1 containing the player's name by using [itemprop="name"]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left pretty granular feedback, so I'd just look at that. Your code quality is good imo, most of the comments are about codebase organization and/or correctness. Thanks so much again for contributing!
rawPos = re.search(r'Position : (\S+)', rawText, re.I).group(1) | ||
allPositions = rawPos.split('/') | ||
# TODO: returning just the first position for those with | ||
# multiple positions. Should return the last position played |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interesting, why do you think I should return the last position played? Regardless, as you can see, this can become a bit of an issue. In most cases positions are clear-cut, but sometimes you run into issues with guys like Jabrill Peppers. I am currently leaning towards just returning a list regardless (e.g., return ['QB'] for Brady and ['DB', 'S', 'LB'] for Sanders), but curious to hear your thoughts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking that players are typically referenced on most websites by one position. That position is usually the last position played. I'm thinking of Terrelle Pryor, but don't know a ton of these examples. Peppers is a pretty good example with all those postions. I'm fine with a list as I can just parse out the last position played from annual data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to a list.
@@ -139,6 +137,13 @@ def passing(self): | |||
doc = self.getDoc() | |||
table = doc('#passing') | |||
df = sportsref.utils.parseTable(table) | |||
if df.empty and table.length > 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry you had to go through this, but this stuff is fixed in the new utils file, as you seem to have discovered. Want to merge in the current master branch to get the newest decorators, utils, etc and then replace this with the original code without this edge case? Same with the other functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, I'll look into a merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I merged and this works without the if/then at this point.
@@ -147,8 +152,17 @@ def rushing_and_receiving(self): | |||
:returns: Pandas DataFrame with stats. | |||
""" | |||
doc = self.getDoc() | |||
table = doc('#rushing') | |||
table = doc('#all_rushing') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whose page are you using to look at the HTML? The pages I've looked at are all either 'table#rushing' or 'table#receiving', I haven't seen any all_rushing or all_receiving
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the reason I am using all_whatever is because just rushing or passing works fine for the top-most table, but I could not get the pyquery to find the tables with those ids on the lower portion of the player pages. For example, for Jabrill Peppers, the methods cannot see 'rushing' id. For some reason it gets lost after the enormous comments included on those lower tables. That's also why I included the:
if df.empty and table.length > 0:
-
tree = etree.fromstring(str(table))
-
comments = tree.xpath('//comment()')
-
comment = etree.tostring(comments[0])
-
contents = comment.replace("<!--", "").replace("-->", "")
-
table = pq(contents)
-
df = sportsref.utils.parseTable(table)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those lines parse the comment from the 'all_rushing' table which is the only way I could get at that data. It's pretty hacky, but you are welcome to try the other way. I got no results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new utils.get_html function takes care of this (just strips literally all comments). I realize that stripping all comments could be an issue, but for now it works. I actually think I may rewrite the relevant parts of the library using lxml instead of pyquery, because I've seen some benchmarks that it's a bit faster and I've been meaning to learn XPath. Anyways, for now, the new utils.get_html
handles this, so using table#rushing
should be fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All this is removed now and seems to work well. Thanks!!
@@ -157,8 +171,66 @@ def defense(self): | |||
:returns: Pandas DataFrame with defensive stats. | |||
""" | |||
doc = self.getDoc() | |||
table = doc('#defense') | |||
table = doc('#all_defense') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, seems to me like it should be #defense but maybe I'm wrong
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see rushing comment above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the table definitions are now of the type
table = doc('table#defense')
:returns: Pandas DataFrame with defensive stats. | ||
""" | ||
doc = self.getDoc() | ||
table = doc('#all_scoring') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see rushing comment above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see comment above. now fixed
dfAll = dfAll.merge(dfScoring, 'outer', mergeList) | ||
return dfAll | ||
|
||
def get_college_leaders_one_year(year): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the sort of thing where I'd prefer to make a ncaaf/seasons.py file with a Season class (compare to to nba.Season in the nba subpackage). This would contain all info on/related to pages like this one. Thank you for making it though! Seems like it would be useful for a draft forecasting model :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yea, that sounds great. I didn't know where to put this as it is appropriate in a seasons class but didn't have a home right now since that class doesn't exist. I agree with you completely. At first I was going to use this method in my forecasting model, but now I'm not, so I don't mind if it's tossed out for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, feel free to just toss it here, and I can refactor it to a Season class when I get to that
:param year: the year of the roster | ||
:return: a dataframe with the player_id, class, and position | ||
""" | ||
url = sportsref.ncaaf.BASE_URL +\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think using \ here is frowned upon in Python (that's at least my impression). I think using parentheses is preferred.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed
df = sportsref.utils.parseTable(table) | ||
return df | ||
|
||
def get_all_college_teams(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine here; I had a similar function in the nba.teams module, but then I realized the teams change each year as teams are added. If this is a concern for ncaaf, it may make sense to make a similar function as a method within the ncaaf.Season class (that's how I solved it for nba, and I believe nfl).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the way that the cfb is organized, this page contains all teams that have ever played and the year they started and also the year they stopped. I guess there's multiple ways to skin this cat. I'm also ok trying to do it with the year. I think I may have added such an approach in the nfl sportsref. Which way should i code this up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose it depends on the use case for this function. I guess there's no harm in leaving this the way it is now; if I change my mind, I may move things around, but for now, this is fine with me.
""" Returns all the college teams from | ||
http://www.sports-reference.com/cfb | ||
|
||
:return: A dataframe with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with what?? the suspense is killing me 😃
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
haha
@@ -61,14 +61,14 @@ def parseTable(table): | |||
return pd.DataFrame() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, these changes should instead be added by merging master into ncaaf (or by git checkout master sportsref/utils.py
if necessary).
I had to move the ignore *,+, and other characters used to note things within the parse_table method in utils in order to get rid of * notations on player's years (it designates the player's bowl stats are in the stat line). |
# Conflicts: # sportsref/ncaaf/players.py
So I think with the merge, this is now looking better. Utils is up-to-date and I changed the tables. |
Note: this branch's utils are out of date.