In this quick blog post I’ll show the simply way to web scrape some data from a website using python.
SETUP
Before we start, firstly you’ll need to install three python libraries using the usual pip method, if you’re not familiar how to do this simply open a command prompt (windows example) and type in the following
Pip install <library name>
Required Libraries
- requests
- lxml
- bs4
Once we’ve installed these we’re good to go!
EXAMPLE
For this example we’re going to use the Wikipedia page for Burnley F.C, why?, cause its the mighty clarets 🙂 that’s why.
So to start we’ll take a look at the web page https://en.wikipedia.org/wiki/Burnley_F.C. , looking at the page I’ll say we want to extract all of the titles like below
So if you move down to that area of the website page and then left click on the history title and then immediately right click and on the menu select the ‘Inspect’ option, you’ll have a screen similar to below
Now the aim of this task is to extract those headings
First we’ll extract the html from the website as a whole using the requests (excellent library)Â library get request
r = requests.get(‘https://en.wikipedia.org/wiki/Burnley_F.C.’)
Now if you were to look at this data it will be a little over whelming at this point and just a bowl of soup text, that’s where the perfectly named beautiful soup and lxml libaries comes into play.
Using beautiful soup and lxml we’ll stick all the elements etc into a variable webitemsburnley  by using these libraries it will make the data more readable.
webitemsburnley = bs4.BeautifulSoup(r.text,‘lxml’)
Now if you look back above, you’ll remember that we’re after the titles of the section and looking at the html tags in the inspect section we can see its calls md-headline, so to extract those items we can use the below to loop through and extract all the mw-headline text.
for items in webitemsburnley.select(‘.mw-headline’):
print(items.text)
However If you wanted to just take the first value in the list, just as an example you could use the below
items = webitemsburnley.select(‘.mw-headline’)
print(items[0].getText())
Using this you will get the value ‘History’ as the first value in the index list and then change the index value ‘0’ to ‘1’ etc to move up or down the values returned
This script is avaiable at the following git hub location:
https://github.com/DataDoor/Code/blob/master/Python/Webscraping.py
Not sure what this comment is about? Additionally to what
LikeLike