Raspberry Pi Python Adventures: Bond. James "import pandas" Bond

It all started when...

[friend] I'm trying to get this table on wikipedia from python...

[me] Sure. What module are you using?

[friend] BeautifulSoup, but man, this is hard. It's this url...

[me] Wait, this is not a Coursera assignment you are asking me to do, is it?

[friend] No, no. I saw this thing using a different programming language and I want to do it in Python.

[me] Ok, sounds reasonable.

The URL

The basic URL that documents James Bond movies on wikipedia is at: https://en.wikipedia.org/wiki/List_of_James_Bond_films but the URL he sent me was: https://en.wikipedia.org/w/index.php?title=List_of_James_Bond_films&oldid=688916363 and hence why it looked like a assignment.

Let me pause for a brief second on this subject. I'm a big fan of reproducible research, and selecting a specific revision of a document is an excellent idea. This page will never change, whereas any given normal URL on wikipedia changes all the time.

I'll have some of that BeautifulSoup

My friend mentioned he was trying to use BeautifulSoup but facing some challenges. BeautifulSoup and lxml are the usual suspects when it comes to doing web scraping (and using requests itself to pull the data in). But I have to admit, most of the time I don't use any of these. You see, I'm lazy, and typically these solutions require too much work. If you want to see what I'm talking about, you can check using-python-beautifulsoup-to-scrape-a-wikipedia-table

I don't like to type more code than I need to. At any rate, the goal was to get the web page, parse two tables and then load the data in a pandas data frame to do further analysis, plots etc.

Enter the Pandas

And it's not even the Kung Fu Panda, just good old Pandas, the data wrangling tool par excellence (https://pypi.python.org/pypi/pandas/0.17.1).

Everybody knows, I hope, that it has a superb support for loading excel and CSV files. It's why Python is the number 1 data wrangling programming language.

But what about loading tables from wikipedia web pages, surely there is nothing that can simplify this, is there? If you've attended all PYPTUG meetings, you already know the answer.

 import pandas as pd  
   
 wiki_df = pd.read_html("https://en.wikipedia.org/w/index.php?title=List_of_James_Bond_films&oldid=688916363", header=0)

read_html returns a list of dataframes, with each table found on the web page being a dataframe. So to access the box office table on this page, we have to look at the second dataframe, the first being the warning table at the top of the page. Since it is 0 indexed we refer to it with wiki_df[1]. We don't want line 0 because that's sub headers, and we don't want the last two lines because one is a movie that's just been released and the numbers are not in yet, and the other one because it's a total column. How do we do this? Good old Python slices:

 df = wiki_df[1][1:24]

And that's it, seriously. One line to ingest, one line to cleanup.

The result

	Title	Year	Bond actor	Director	Box office	Budget	Salary of Bond actor	Box office.1	Budget.1	Salary of Bond actor.1
1	Dr. No	1962	Connery, SeanSean Connery	Young, TerenceTerence Young	59.5	1.1	0.1	448.8	7.0	0.6
2	From Russia with Love	1963	Connery, SeanSean Connery	Young, TerenceTerence Young	78.9	2.0	0.3	543.8	12.6	1.6
3	Goldfinger	1964	Connery, SeanSean Connery	Hamilton, GuyGuy Hamilton	124.9	3.0	0.5	820.4	18.6	3.2
4	Thunderball	1965	Connery, SeanSean Connery	Young, TerenceTerence Young	141.2	6.8	0.8	848.1	41.9	4.7
5	You Only Live Twice	1967	Connery, SeanSean Connery	Gilbert, LewisLewis Gilbert	101.0	10.3	0.8 + 25% net merch royalty	514.2	59.9	4.4 excluding profit participation
6	On Her Majesty's Secret Service	1969	Lazenby, GeorgeGeorge Lazenby	Hunt, Peter R.Peter R. Hunt	64.6	7.0	0.1	291.5	37.3	0.6
7	Diamonds Are Forever	1971	Connery, SeanSean Connery	Hamilton, GuyGuy Hamilton	116.0	7.2	1.2 + 12.5% of gross (14.5)	442.5	34.7	5.8 excluding profit participation
8	Live and Let Die	1973	Moore, RogerRoger Moore	Hamilton, GuyGuy Hamilton	126.4	7.0	n/a	460.3	30.8	n/a
9	man with !The Man with the Golden Gun	1974	Moore, RogerRoger Moore	Hamilton, GuyGuy Hamilton	98.5	7.0	n/a	334.0	27.7	n/a
10	spy who !The Spy Who Loved Me	1977	Moore, RogerRoger Moore	Gilbert, LewisLewis Gilbert	185.4	14.0	n/a	533.0	45.1	n/a
11	Moonraker	1979	Moore, RogerRoger Moore	Gilbert, LewisLewis Gilbert	210.3	34.0	n/a	535.0	91.5	n/a
12	For Your Eyes Only	1981	Moore, RogerRoger Moore	Glen, JohnJohn Glen	194.9	28.0	n/a	449.4	60.2	n/a
13	Octopussy	1983	Moore, RogerRoger Moore	Glen, JohnJohn Glen	183.7	27.5	4.0	373.8	53.9	7.8
14	view !A View to a Kill	1985	Moore, RogerRoger Moore	Glen, JohnJohn Glen	152.4	30.0	5.0	275.2	54.5	9.1
15	living !The Living Daylights	1987	Dalton, TimothyTimothy Dalton	Glen, JohnJohn Glen	191.2	40.0	3.0	313.5	68.8	5.2
16	Licence to Kill	1989	Dalton, TimothyTimothy Dalton	Glen, JohnJohn Glen	156.2	36.0	5.0	250.9	56.7	7.9
17	GoldenEye	1995	Brosnan, PiercePierce Brosnan	Campbell, MartinMartin Campbell	351.9	60.0	4.0	518.5	76.9	5.1
18	Tomorrow Never Dies	1997	Brosnan, PiercePierce Brosnan	Spottiswoode, RogerRoger Spottiswoode	338.9	110.0	8.2	463.2	133.9	10.0
19	world !The World Is Not Enough	1999	Brosnan, PiercePierce Brosnan	Apted, MichaelMichael Apted	361.8	135.0	12.4	439.5	158.3	13.5
20	Die Another Day	2002	Brosnan, PiercePierce Brosnan	Tamahori, LeeLee Tamahori	431.9	142.0	16.5	465.4	154.2	17.9
21	Casino Royale	2006	Craig, DanielDaniel Craig	Campbell, MartinMartin Campbell	594.2	150.0	3.4	581.5	145.3	3.3
22	Quantum of Solace	2008	Craig, DanielDaniel Craig	Forster, MarcMarc Forster	576.0	200.0	8.9	514.2	181.4	8.1
23	Skyfall	2012	Craig, DanielDaniel Craig	Mendes, SamSam Mendes	1108.6[20]	150.0[21][22]—200.0[20]	17.0[23]	879.8	158.1	13.5

Francois Dion
@f_dion

3 comments:

antiphasis said...: Just a hint, no need to count rows:
df = wiki_df[1][1:-1]; November 26, 2015 at 3:02 AM
Tres Seaver said...: Or even just:

df = wiki_df[1][1:]; November 26, 2015 at 4:18 PM
Francois Dion said...: I think you missed this part: "and we don't want the last two lines because one is a movie that's just been released and the numbers are not in yet, and the other one because it's a total column"

Hence why wiki_df[1][1:24]. Or wiki_df[1][1:-2]; November 26, 2015 at 5:30 PM

Raspberry Pi Python Adventures

Pages

Wednesday, November 25, 2015

Bond. James "import pandas" Bond