It all started when...
[friend] I'm trying to get this table on wikipedia from python...
[me] Sure. What module are you using?
[friend] BeautifulSoup, but man, this is hard. It's this url...
[me] Wait, this is not a Coursera assignment you are asking me to do, is it?
[friend] No, no. I saw this thing using a different programming language and I want to do it in Python.
[me] Ok, sounds reasonable.
The URL
The basic URL that documents James Bond movies on wikipedia is at: https://en.wikipedia.org/wiki/List_of_James_Bond_films but the URL he sent me was: https://en.wikipedia.org/w/index.php?title=List_of_James_Bond_films&oldid=688916363 and hence why it looked like a assignment.
Let me pause for a brief second on this subject. I'm a big fan of reproducible research, and selecting a specific revision of a document is an excellent idea. This page will never change, whereas any given normal URL on wikipedia changes all the time.
I'll have some of that BeautifulSoup
My friend mentioned he was trying to use BeautifulSoup but facing some challenges. BeautifulSoup and lxml are the usual suspects when it comes to doing web scraping (and using requests itself to pull the data in). But I have to admit, most of the time I don't use any of these. You see, I'm lazy, and typically these solutions require too much work. If you want to see what I'm talking about, you can check using-python-beautifulsoup-to-scrape-a-wikipedia-tableI don't like to type more code than I need to. At any rate, the goal was to get the web page, parse two tables and then load the data in a pandas data frame to do further analysis, plots etc.
Enter the Pandas
And it's not even the Kung Fu Panda, just good old Pandas, the data wrangling tool par excellence (https://pypi.python.org/pypi/pandas/0.17.1).
Everybody knows, I hope, that it has a superb support for loading excel and CSV files. It's why Python is the number 1 data wrangling programming language.
But what about loading tables from wikipedia web pages, surely there is nothing that can simplify this, is there? If you've attended all PYPTUG meetings, you already know the answer.
read_html returns a list of dataframes, with each table found on the web page being a dataframe. So to access the box office table on this page, we have to look at the second dataframe, the first being the warning table at the top of the page. Since it is 0 indexed we refer to it with wiki_df[1]. We don't want line 0 because that's sub headers, and we don't want the last two lines because one is a movie that's just been released and the numbers are not in yet, and the other one because it's a total column. How do we do this? Good old Python slices:
And that's it, seriously. One line to ingest, one line to cleanup.
Everybody knows, I hope, that it has a superb support for loading excel and CSV files. It's why Python is the number 1 data wrangling programming language.
But what about loading tables from wikipedia web pages, surely there is nothing that can simplify this, is there? If you've attended all PYPTUG meetings, you already know the answer.
import pandas as pd
wiki_df = pd.read_html("https://en.wikipedia.org/w/index.php?title=List_of_James_Bond_films&oldid=688916363", header=0)
read_html returns a list of dataframes, with each table found on the web page being a dataframe. So to access the box office table on this page, we have to look at the second dataframe, the first being the warning table at the top of the page. Since it is 0 indexed we refer to it with wiki_df[1]. We don't want line 0 because that's sub headers, and we don't want the last two lines because one is a movie that's just been released and the numbers are not in yet, and the other one because it's a total column. How do we do this? Good old Python slices:
df = wiki_df[1][1:24]
And that's it, seriously. One line to ingest, one line to cleanup.
The result
Title | Year | Bond actor | Director | Box office | Budget | Salary of Bond actor | Box office.1 | Budget.1 | Salary of Bond actor.1 | |
---|---|---|---|---|---|---|---|---|---|---|
1 | Dr. No | 1962 | Connery, SeanSean Connery | Young, TerenceTerence Young | 59.5 | 1.1 | 0.1 | 448.8 | 7.0 | 0.6 |
2 | From Russia with Love | 1963 | Connery, SeanSean Connery | Young, TerenceTerence Young | 78.9 | 2.0 | 0.3 | 543.8 | 12.6 | 1.6 |
3 | Goldfinger | 1964 | Connery, SeanSean Connery | Hamilton, GuyGuy Hamilton | 124.9 | 3.0 | 0.5 | 820.4 | 18.6 | 3.2 |
4 | Thunderball | 1965 | Connery, SeanSean Connery | Young, TerenceTerence Young | 141.2 | 6.8 | 0.8 | 848.1 | 41.9 | 4.7 |
5 | You Only Live Twice | 1967 | Connery, SeanSean Connery | Gilbert, LewisLewis Gilbert | 101.0 | 10.3 | 0.8 + 25% net merch royalty | 514.2 | 59.9 | 4.4 excluding profit participation |
6 | On Her Majesty's Secret Service | 1969 | Lazenby, GeorgeGeorge Lazenby | Hunt, Peter R.Peter R. Hunt | 64.6 | 7.0 | 0.1 | 291.5 | 37.3 | 0.6 |
7 | Diamonds Are Forever | 1971 | Connery, SeanSean Connery | Hamilton, GuyGuy Hamilton | 116.0 | 7.2 | 1.2 + 12.5% of gross (14.5) | 442.5 | 34.7 | 5.8 excluding profit participation |
8 | Live and Let Die | 1973 | Moore, RogerRoger Moore | Hamilton, GuyGuy Hamilton | 126.4 | 7.0 | n/a | 460.3 | 30.8 | n/a |
9 | man with !The Man with the Golden Gun | 1974 | Moore, RogerRoger Moore | Hamilton, GuyGuy Hamilton | 98.5 | 7.0 | n/a | 334.0 | 27.7 | n/a |
10 | spy who !The Spy Who Loved Me | 1977 | Moore, RogerRoger Moore | Gilbert, LewisLewis Gilbert | 185.4 | 14.0 | n/a | 533.0 | 45.1 | n/a |
11 | Moonraker | 1979 | Moore, RogerRoger Moore | Gilbert, LewisLewis Gilbert | 210.3 | 34.0 | n/a | 535.0 | 91.5 | n/a |
12 | For Your Eyes Only | 1981 | Moore, RogerRoger Moore | Glen, JohnJohn Glen | 194.9 | 28.0 | n/a | 449.4 | 60.2 | n/a |
13 | Octopussy | 1983 | Moore, RogerRoger Moore | Glen, JohnJohn Glen | 183.7 | 27.5 | 4.0 | 373.8 | 53.9 | 7.8 |
14 | view !A View to a Kill | 1985 | Moore, RogerRoger Moore | Glen, JohnJohn Glen | 152.4 | 30.0 | 5.0 | 275.2 | 54.5 | 9.1 |
15 | living !The Living Daylights | 1987 | Dalton, TimothyTimothy Dalton | Glen, JohnJohn Glen | 191.2 | 40.0 | 3.0 | 313.5 | 68.8 | 5.2 |
16 | Licence to Kill | 1989 | Dalton, TimothyTimothy Dalton | Glen, JohnJohn Glen | 156.2 | 36.0 | 5.0 | 250.9 | 56.7 | 7.9 |
17 | GoldenEye | 1995 | Brosnan, PiercePierce Brosnan | Campbell, MartinMartin Campbell | 351.9 | 60.0 | 4.0 | 518.5 | 76.9 | 5.1 |
18 | Tomorrow Never Dies | 1997 | Brosnan, PiercePierce Brosnan | Spottiswoode, RogerRoger Spottiswoode | 338.9 | 110.0 | 8.2 | 463.2 | 133.9 | 10.0 |
19 | world !The World Is Not Enough | 1999 | Brosnan, PiercePierce Brosnan | Apted, MichaelMichael Apted | 361.8 | 135.0 | 12.4 | 439.5 | 158.3 | 13.5 |
20 | Die Another Day | 2002 | Brosnan, PiercePierce Brosnan | Tamahori, LeeLee Tamahori | 431.9 | 142.0 | 16.5 | 465.4 | 154.2 | 17.9 |
21 | Casino Royale | 2006 | Craig, DanielDaniel Craig | Campbell, MartinMartin Campbell | 594.2 | 150.0 | 3.4 | 581.5 | 145.3 | 3.3 |
22 | Quantum of Solace | 2008 | Craig, DanielDaniel Craig | Forster, MarcMarc Forster | 576.0 | 200.0 | 8.9 | 514.2 | 181.4 | 8.1 |
23 | Skyfall | 2012 | Craig, DanielDaniel Craig | Mendes, SamSam Mendes | 1108.6[20] | 150.0[21][22]—200.0[20] | 17.0[23] | 879.8 | 158.1 | 13.5 |
Francois Dion
@f_dion
3 comments:
Just a hint, no need to count rows:
df = wiki_df[1][1:-1]
Or even just:
df = wiki_df[1][1:]
I think you missed this part: "and we don't want the last two lines because one is a movie that's just been released and the numbers are not in yet, and the other one because it's a total column"
Hence why wiki_df[1][1:24]. Or wiki_df[1][1:-2]
Post a Comment