I have a web scraping script that was originally created with this project: http://www.freelancer.com/projects/Perl-Python/College-Basketball-Box-Score-Scraper.html and then updated with this project: http://www.freelancer.com/projects/Python-Software-Architecture/Need-Python-Script-Scrape-box.html.
My needs have changed with the script and now I am looking to switch the website that the box scores are scraped from as well as make a couple of upgrades. I'm looking to start pulling the box scores from the Yahoo box scores at [login to view URL] Since we are in the off-season, an example date with actual box scores could be located at [login to view URL]
I'm looking to get the box scores scraped like they were in the above projects with a couple of changes:
1) Name is only first initial, not whole name. I can deal with this, but would like to get first initial and last name is separate columns.
2) For each player, there is a link on the player's name. In that link is a player ID. I'd like to capture that ID as well.
3) the first couple of columns (FG, 3PT, FT) are combined. I'd like to separate those 3 columns into 6 columns (FGM, FGA, 3PM, 3PA, FTM, FTA) by removing the "-".
4) Rebounds are listed a little differently in the Yahoo boxscore. instead of having defensive rebounds, offensive rebounds and then total, there is just offensive and total. I'd like to have the quick math done to create the defensive rebounds number as well.
5) To help with my data import, I'd still like the columns to be in this order: Date,Team,YahooPlayerID,First Initial,Last Name,MIN,FG,FGA,3P,3PA,FT,FTA,OR,DR,TOT,A,PF,ST,TO,BL,PTS
As far as the language that the project needs to be done in, it was originally done in Python, but I'm happy to have it done in another language. I'd prefer the language to be .NET or something like because I have familiarity with that, but I am open to suggestions.
I have provided the original source code so that it can be examined and adapted for this project and make it a little quicker.
In the original code, there was also a section that went and grabbed injuries from the sports network site. I'd like to keep that in the new program. Ideally, I'd like to keep my original program's ability, and add the ability to grab the box scores from Yahoo. For example, when the program starts, it asks "Enter a date MMDDYY (leave blank for [yesterday's date], I for injuries", if I then select a date or leave blank and hit enter, then it asks if I'd like to use Yahoo or TSN, leave blank for Yahoo, then hit enter and the script would go grab the box score.
I'd also be interested in talking about possibly writing to the csv file as the scores are downloaded. In my current program, if there are 20 box scores for the day, it downloads all 20 box scores and then adds all data to the csv at the end. if the program errors/breaks before the csv write, the program must be run again. Ideally, I'd liek to be able to have the program run and somehow "virtually check off" the box scores that are completed and written to the csv, so if the program breaks, it can be re-run and only needs to download the remaining scores that haven't be grabbed. Of course, this is up for discussion. if it is going to add much expense to the project, I will probably drop the idea, but I'd like to talk about it as an option.
The final product of the project with be:
1) a working executable file that will grab a day's boxscores and deliver them in a .csv file.
2) all source code for the project