Web scraping with Python 3

In this quick blog post I’ll show the simply way to web scrape some data from a website using python.

SETUP

Before we start, firstly you’ll need to install three python libraries using the usual pip method, if you’re not familiar how to do this simply open a command prompt (windows example) and type in the following

Pip install <library name>

Required Libraries

  • requests
  • lxml
  • bs4

Once we’ve installed these we’re good to go!

 

EXAMPLE

For this example we’re going to use the Wikipedia page for Burnley F.C, why?, cause its the mighty clarets 🙂 that’s why.

So to start we’ll take a look at the web page https://en.wikipedia.org/wiki/Burnley_F.C. , looking at the page I’ll say we want to extract all of the titles  like below

1

So if you move down to that area of the website page and then left click on the history title and then immediately right click and on the menu select the ‘Inspect’ option, you’ll have a screen similar to below

2

 

Now the aim of this task is to extract those headings

First we’ll extract the html from the website as a whole using the requests (excellent library)  library get request

r = requests.get(https://en.wikipedia.org/wiki/Burnley_F.C.&#8217;)

 

Now if you were to look at this data it will be a little over whelming at this point and just a bowl of soup text, that’s where the perfectly named beautiful soup and lxml libaries comes into play.

Using beautiful soup and lxml we’ll stick all the elements etc into a variable webitemsburnley  by using these libraries it will make the data more readable.

webitemsburnley = bs4.BeautifulSoup(r.text,‘lxml’)

 

Now if you look back above, you’ll remember that we’re after the titles of the section and looking at the html tags in the inspect section we can see its calls md-headline, so to extract those items we can use the below to loop through and extract all the mw-headline text.

for items in webitemsburnley.select(‘.mw-headline’):

print(items.text)

 

However If you wanted to just take the first value in the list, just as an example you could use the below

items = webitemsburnley.select(‘.mw-headline’)

print(items[0].getText())

 

Using this you will get the value ‘History’ as the first value in the index list and then change the index value ‘0’ to ‘1’ etc to move up or down the values returned

 

This script is avaiable at the following git hub location:

https://github.com/DataDoor/Code/blob/master/Python/Webscraping.py

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s