The ultimate guide for Web Scraping with Selenium
As any data scientist one of the biggest challenges we face is to acquire real time data from a myriad of sources.
This was something I struggled a lot before getting it right and fast. So take your time and develop projects for concretizing the knowledge you get from this article.
When using Python for Web Scraping we have some possible ways for doing this:
- Using the request library, which is very limited in terms of easy actions available to acquire data without raising any problems when requesting the websites HTML.
- Using Spyder, which is great for acquiring large datasets, since requests and selenium library can take a good time to run. It is a bit more challenging to learn, and harder to run in an online notebook (such as collab), which should be a problem in case you are not the only one responsible for getting the data.
- Selenium is what we are going to discuss in this article. Selenium is basically a robot that access the browser and get’s the data item by item, just like a person would do. It can be very slow, but it’s very easy to learn and very fast to develop. This makes selenium great for acquiring data for specific analyses, that are not going to be displayed in a dashboard where data must be refreshed in a shorter period of time. I personally like to use it a lot for when I’m researching an specific topic with the objective of making a presentation or writing an article about it.
Before we start…
First you need to install the Selenium library and a Driver, which is basically the bot for your navigator. You can find instructions for that in Selenium’s official installation instruction article. Go to section 1.5 to find instructions to download your driver.
It is also really useful to download Pandas library and get used to it since it is a very useful tool for crating and managing databases in Python
Initializing Selenium
For initializing our program we are always going to code the same thing:
#This is the base code for any time we wish to run a bot in selenium
from selenium import webdriver
#add the url of the website we wish to srap
website='<https://www.adamchoi.co.uk/overs/detailed>'
#Declare the location of your Chrome driver executer
path='/Users/sophi/OneDrive/Documentos/chromedriver.exe'
#Start a drive object by calling the Chrome driver executer
driver=webdriver.Chrome(path)
#start a session in the website
driver.get(website)
#add code here#end the session by calling quit method
driver.quit()
Getting data from website
Before you start web scraping an website it is important to first learn about how HTML files work. For that I recommend reading the W3 school documentation. It is a fast and practical way to understand how a website structure is built.
After you understand HTML, you will learn how to scrap using an HTML paths. This will allow us to interact with the website like a human would.
Get your XPATH
Here we will use the example of Youtube’s website. Let’s try to get the name of a video in a channel. For that you just need to click the right button and than click on inspect. This will show where the name of our video is located in the HTML file, as shown bellow.
The next step is to find out what our XPATH looks like. For that we can click the part of the HTML file we want to get the information from with the right button of the mouse. With that, you will get the following menu:
Now we just need to click “Copy XPATH”.
I particularly like to check if this path will actually work. For doing that click “ctrl + f” and a search bar will appear. Paste your XPATH in it. If it finds the same elements again, you are good to go use it in your code.
Get data by XPATH
For getting the data we want, in this case, the title of the video, we are going to send our driver to the XPATH we just found. Which should look like this:
video_title=driver.find_element_by_xpath('//*[@id="video-title"]/yt-formatted-string').text
We add “.text” to the and so that it gets the text that is inside this element.
It is important to notice that there are other ways to look for an element, such as looking for aa class or an specific id. The good thing about XPATH is that they are great for getting an specific target.
Create a database
Notice that if we do this command in a loop it can become a great database builder.
For any Web Scraping project it is really useful to create a CSV or XML file to store the data you are getting, you can easily do that by importing Pandas and using it to import to csv for example:
import pandas as pd#adding the item we found to an array
arr=[]
arr.append(video_title)#creating a pandas dataframe
df=pd.DataFrame(arr)
#transfering the dataframe to a new csv file
df.to_csv('VideoName.csv')
In the end the complete code will look like this:
#This is the base code for any time we wish to run a bot in selenium
from selenium import webdriver
#add the url of the website we wish to srap
website='<https://www.adamchoi.co.uk/overs/detailed>'
#Declare the location of your Chrome driver executer
path='/Users/sophi/OneDrive/Documentos/chromedriver.exe'
#Start a drive object by calling the Chrome driver executer
driver=webdriver.Chrome(path)
#start a session in the website
driver.get(website)
#add code here
video_title=driver.find_element_by_xpath('//*[@id="video-title"]/yt-formatted-string').text#importing pandas
import pandas as pd#adding the item we found to an array
arr=[]
arr.append(video_title)#creating a pandas dataframe
df=pd.DataFrame(arr)
#transfering the dataframe to a new csv file
df.to_csv('VideoName.csv')
#end the session by calling quit method
driver.quit()
Interesting projects for training and build portfolio
Certainly, the best way to practice Web Scraping techniques is to create projects of your own. I usually practice it by building usual functionalities for my day by day.
I’ll let some ideas here:
- Scrap google classroom to get latest assignments to a spreadsheet (great for any student)
- LinkedIn Job Hunter: get the latest jobs from an area in a spreadsheet
- Create a book list by scrapping Amazons’ Website
- Scrap Youtube to Scrap the most recent videos from your niche
And don’t forget, the key to perfecting any technique is to persist in getting it right! If you have any problems or doubts let me know!