Wednesday, November 25, 2015

Bond. James "import pandas" Bond

It all started when...

    [friend] I'm trying to get this table on wikipedia from python...

[me] Sure. What module are you using?

    [friend] BeautifulSoup, but man, this is hard. It's this url...

[me] Wait, this is not a Coursera assignment you are asking me to do, is it?

    [friend] No, no. I saw this thing using a different programming language and I want to do it in Python.

[me] Ok, sounds reasonable.

The URL

The basic URL that documents James Bond movies on wikipedia is at: https://en.wikipedia.org/wiki/List_of_James_Bond_films but the URL he sent me was: https://en.wikipedia.org/w/index.php?title=List_of_James_Bond_films&oldid=688916363 and hence why it looked like a assignment.


Let me pause for a brief second on this subject. I'm a big fan of reproducible research, and selecting a specific revision of a document is an excellent idea. This page will never change, whereas any given normal URL on wikipedia changes all the time.

I'll have some of that BeautifulSoup

My friend mentioned he was trying to use BeautifulSoup but facing some challenges. BeautifulSoup and lxml are the usual suspects when it comes to doing web scraping (and using requests itself to pull the data in). But I have to admit, most of the time I don't use any of these. You see, I'm lazy, and typically these solutions require too much work. If you want to see what I'm talking about, you can check using-python-beautifulsoup-to-scrape-a-wikipedia-table

I don't like to type more code than I need to. At any rate, the goal was to get the web page, parse two tables and then load the data in a pandas data frame to do further analysis, plots etc.

Enter the Pandas

And it's not even the Kung Fu Panda, just good old Pandas, the data wrangling tool par excellence (https://pypi.python.org/pypi/pandas/0.17.1).

Everybody knows, I hope, that it has a superb support for loading excel and CSV files. It's why Python is the number 1 data wrangling programming language.

But what about loading tables from wikipedia web pages, surely there is nothing that can simplify this, is there? If you've attended all PYPTUG meetings, you already know the answer.

 import pandas as pd  
   
 wiki_df = pd.read_html("https://en.wikipedia.org/w/index.php?title=List_of_James_Bond_films&oldid=688916363", header=0)  

read_html returns a list of dataframes, with each table found on the web page being a dataframe. So to access the box office table on this page, we have to look at the second dataframe, the first being the warning table at the top of the page. Since it is 0 indexed we refer to it with wiki_df[1]. We don't want line 0 because that's sub headers, and we don't want the last two lines because one is a movie that's just been released and the numbers are not in yet, and the other one because it's a total column. How do we do this? Good old Python slices:

 df = wiki_df[1][1:24]  

And that's it, seriously. One line to ingest, one line to cleanup.

The result

Title Year Bond actor Director Box office Budget Salary of Bond actor Box office.1 Budget.1 Salary of Bond actor.1
1 Dr. No 1962 Connery, SeanSean Connery Young, TerenceTerence Young 59.5 1.1 0.1 448.8 7.0 0.6
2 From Russia with Love 1963 Connery, SeanSean Connery Young, TerenceTerence Young 78.9 2.0 0.3 543.8 12.6 1.6
3 Goldfinger 1964 Connery, SeanSean Connery Hamilton, GuyGuy Hamilton 124.9 3.0 0.5 820.4 18.6 3.2
4 Thunderball 1965 Connery, SeanSean Connery Young, TerenceTerence Young 141.2 6.8 0.8 848.1 41.9 4.7
5 You Only Live Twice 1967 Connery, SeanSean Connery Gilbert, LewisLewis Gilbert 101.0 10.3 0.8 + 25% net merch royalty 514.2 59.9 4.4 excluding profit participation
6 On Her Majesty's Secret Service 1969 Lazenby, GeorgeGeorge Lazenby Hunt, Peter R.Peter R. Hunt 64.6 7.0 0.1 291.5 37.3 0.6
7 Diamonds Are Forever 1971 Connery, SeanSean Connery Hamilton, GuyGuy Hamilton 116.0 7.2 1.2 + 12.5% of gross (14.5) 442.5 34.7 5.8 excluding profit participation
8 Live and Let Die 1973 Moore, RogerRoger Moore Hamilton, GuyGuy Hamilton 126.4 7.0 n/a 460.3 30.8 n/a
9 man with !The Man with the Golden Gun 1974 Moore, RogerRoger Moore Hamilton, GuyGuy Hamilton 98.5 7.0 n/a 334.0 27.7 n/a
10 spy who !The Spy Who Loved Me 1977 Moore, RogerRoger Moore Gilbert, LewisLewis Gilbert 185.4 14.0 n/a 533.0 45.1 n/a
11 Moonraker 1979 Moore, RogerRoger Moore Gilbert, LewisLewis Gilbert 210.3 34.0 n/a 535.0 91.5 n/a
12 For Your Eyes Only 1981 Moore, RogerRoger Moore Glen, JohnJohn Glen 194.9 28.0 n/a 449.4 60.2 n/a
13 Octopussy 1983 Moore, RogerRoger Moore Glen, JohnJohn Glen 183.7 27.5 4.0 373.8 53.9 7.8
14 view !A View to a Kill 1985 Moore, RogerRoger Moore Glen, JohnJohn Glen 152.4 30.0 5.0 275.2 54.5 9.1
15 living !The Living Daylights 1987 Dalton, TimothyTimothy Dalton Glen, JohnJohn Glen 191.2 40.0 3.0 313.5 68.8 5.2
16 Licence to Kill 1989 Dalton, TimothyTimothy Dalton Glen, JohnJohn Glen 156.2 36.0 5.0 250.9 56.7 7.9
17 GoldenEye 1995 Brosnan, PiercePierce Brosnan Campbell, MartinMartin Campbell 351.9 60.0 4.0 518.5 76.9 5.1
18 Tomorrow Never Dies 1997 Brosnan, PiercePierce Brosnan Spottiswoode, RogerRoger Spottiswoode 338.9 110.0 8.2 463.2 133.9 10.0
19 world !The World Is Not Enough 1999 Brosnan, PiercePierce Brosnan Apted, MichaelMichael Apted 361.8 135.0 12.4 439.5 158.3 13.5
20 Die Another Day 2002 Brosnan, PiercePierce Brosnan Tamahori, LeeLee Tamahori 431.9 142.0 16.5 465.4 154.2 17.9
21 Casino Royale 2006 Craig, DanielDaniel Craig Campbell, MartinMartin Campbell 594.2 150.0 3.4 581.5 145.3 3.3
22 Quantum of Solace 2008 Craig, DanielDaniel Craig Forster, MarcMarc Forster 576.0 200.0 8.9 514.2 181.4 8.1
23 Skyfall 2012 Craig, DanielDaniel Craig Mendes, SamSam Mendes 1108.6[20] 150.0[21][22]—200.0[20] 17.0[23] 879.8 158.1 13.5


Francois Dion
@f_dion

3 comments:

antiphasis said...

Just a hint, no need to count rows:
df = wiki_df[1][1:-1]

Tres Seaver said...

Or even just:

df = wiki_df[1][1:]

Francois Dion said...

I think you missed this part: "and we don't want the last two lines because one is a movie that's just been released and the numbers are not in yet, and the other one because it's a total column"

Hence why wiki_df[1][1:24]. Or wiki_df[1][1:-2]