Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated the players.py file to parse correctly. #5

Open
wants to merge 97 commits into
base: ncaaf
Choose a base branch
from

Conversation

matthewstirling
Copy link

Note: this branch's utils are out of date.

mdgoldberg added 30 commits July 4, 2016 18:23
@matthewstirling
Copy link
Author

Not sure what the correct way to do it is, but the ncaaf branch really could use updated utils file. After the utils are updated, it may need naming refactor. I've not done all that before, so I'm not sure how it works. I only edited the players file for now because it's what I need to use, but I have some other team stuff I could edit and work on. I'd rather work on this with an updated utils that more reflects master's 0.7.15 version

@mdgoldberg
Copy link
Owner

I believe the correct way to do that would be to just merge the master branch into the ncaaf branch. A possible issue (although I don't see this being an issue in this case) is that this will get updated copies of all possible files, not only utils.py. If you really want to update only utils.py, then you can do the following (make sure you're on the ncaaf branch first): git checkout master sportsref/utils.py. That should do the trick!

Even more explicit in identifying the h1 containing the player's name by using [itemprop="name"]
Copy link
Owner

@mdgoldberg mdgoldberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left pretty granular feedback, so I'd just look at that. Your code quality is good imo, most of the comments are about codebase organization and/or correctness. Thanks so much again for contributing!

rawPos = re.search(r'Position : (\S+)', rawText, re.I).group(1)
allPositions = rawPos.split('/')
# TODO: returning just the first position for those with
# multiple positions. Should return the last position played
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting, why do you think I should return the last position played? Regardless, as you can see, this can become a bit of an issue. In most cases positions are clear-cut, but sometimes you run into issues with guys like Jabrill Peppers. I am currently leaning towards just returning a list regardless (e.g., return ['QB'] for Brady and ['DB', 'S', 'LB'] for Sanders), but curious to hear your thoughts.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking that players are typically referenced on most websites by one position. That position is usually the last position played. I'm thinking of Terrelle Pryor, but don't know a ton of these examples. Peppers is a pretty good example with all those postions. I'm fine with a list as I can just parse out the last position played from annual data.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to a list.

@@ -139,6 +137,13 @@ def passing(self):
doc = self.getDoc()
table = doc('#passing')
df = sportsref.utils.parseTable(table)
if df.empty and table.length > 0:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry you had to go through this, but this stuff is fixed in the new utils file, as you seem to have discovered. Want to merge in the current master branch to get the newest decorators, utils, etc and then replace this with the original code without this edge case? Same with the other functions.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, I'll look into a merge.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I merged and this works without the if/then at this point.

@@ -147,8 +152,17 @@ def rushing_and_receiving(self):
:returns: Pandas DataFrame with stats.
"""
doc = self.getDoc()
table = doc('#rushing')
table = doc('#all_rushing')
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whose page are you using to look at the HTML? The pages I've looked at are all either 'table#rushing' or 'table#receiving', I haven't seen any all_rushing or all_receiving

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the reason I am using all_whatever is because just rushing or passing works fine for the top-most table, but I could not get the pyquery to find the tables with those ids on the lower portion of the player pages. For example, for Jabrill Peppers, the methods cannot see 'rushing' id. For some reason it gets lost after the enormous comments included on those lower tables. That's also why I included the:

if df.empty and table.length > 0:

  •        tree = etree.fromstring(str(table))
    
  •        comments = tree.xpath('//comment()')
    
  •        comment = etree.tostring(comments[0])
    
  •        contents = comment.replace("<!--", "").replace("-->", "")
    
  •        table = pq(contents)
    
  •        df = sportsref.utils.parseTable(table)
    

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those lines parse the comment from the 'all_rushing' table which is the only way I could get at that data. It's pretty hacky, but you are welcome to try the other way. I got no results.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new utils.get_html function takes care of this (just strips literally all comments). I realize that stripping all comments could be an issue, but for now it works. I actually think I may rewrite the relevant parts of the library using lxml instead of pyquery, because I've seen some benchmarks that it's a bit faster and I've been meaning to learn XPath. Anyways, for now, the new utils.get_html handles this, so using table#rushing should be fine.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All this is removed now and seems to work well. Thanks!!

@@ -157,8 +171,66 @@ def defense(self):
:returns: Pandas DataFrame with defensive stats.
"""
doc = self.getDoc()
table = doc('#defense')
table = doc('#all_defense')
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, seems to me like it should be #defense but maybe I'm wrong

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see rushing comment above

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the table definitions are now of the type
table = doc('table#defense')

:returns: Pandas DataFrame with defensive stats.
"""
doc = self.getDoc()
table = doc('#all_scoring')
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see rushing comment above

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comment above. now fixed

dfAll = dfAll.merge(dfScoring, 'outer', mergeList)
return dfAll

def get_college_leaders_one_year(year):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the sort of thing where I'd prefer to make a ncaaf/seasons.py file with a Season class (compare to to nba.Season in the nba subpackage). This would contain all info on/related to pages like this one. Thank you for making it though! Seems like it would be useful for a draft forecasting model :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yea, that sounds great. I didn't know where to put this as it is appropriate in a seasons class but didn't have a home right now since that class doesn't exist. I agree with you completely. At first I was going to use this method in my forecasting model, but now I'm not, so I don't mind if it's tossed out for now.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, feel free to just toss it here, and I can refactor it to a Season class when I get to that

:param year: the year of the roster
:return: a dataframe with the player_id, class, and position
"""
url = sportsref.ncaaf.BASE_URL +\
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using \ here is frowned upon in Python (that's at least my impression). I think using parentheses is preferred.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

df = sportsref.utils.parseTable(table)
return df

def get_all_college_teams():
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine here; I had a similar function in the nba.teams module, but then I realized the teams change each year as teams are added. If this is a concern for ncaaf, it may make sense to make a similar function as a method within the ncaaf.Season class (that's how I solved it for nba, and I believe nfl).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the way that the cfb is organized, this page contains all teams that have ever played and the year they started and also the year they stopped. I guess there's multiple ways to skin this cat. I'm also ok trying to do it with the year. I think I may have added such an approach in the nfl sportsref. Which way should i code this up?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose it depends on the use case for this function. I guess there's no harm in leaving this the way it is now; if I change my mind, I may move things around, but for now, this is fine with me.

""" Returns all the college teams from
http://www.sports-reference.com/cfb

:return: A dataframe with
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with what?? the suspense is killing me 😃

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haha

@@ -61,14 +61,14 @@ def parseTable(table):
return pd.DataFrame()
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, these changes should instead be added by merging master into ncaaf (or by git checkout master sportsref/utils.py if necessary).

@matthewstirling
Copy link
Author

I had to move the ignore *,+, and other characters used to note things within the parse_table method in utils in order to get rid of * notations on player's years (it designates the player's bowl stats are in the stat line).

@matthewstirling
Copy link
Author

So I think with the merge, this is now looking better. Utils is up-to-date and I changed the tables.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants