Overview
HTML is almost intuitive. CSS is a great advancement that cleanly separates the structure of a page from its look and feel. JavaScript adds some pizazz. That’s the theory. The real world is a little different.
In this tutorial, you’ll learn how the content you see in the browser actually gets rendered and how to go about scraping it when necessary. In particular, you’ll learn how to count Disqus comments. Our tools will be Python and awesome packages like requests, BeautifulSoup, and Selenium.
When Should You Use Web Scraping?
Web scraping is the practice of automatically fetching the content of web pages designed for interaction with human users, parsing them, and extracting some information (possibly navigating links to other pages). It is sometimes necessary if there is no other way to extract the necessary information. Ideally, the application provides a dedicated API for accessing its data programmatically. There are several reasons web scraping should be your last resort:
- It is fragile (the web pages you’re scraping might change frequently).
- It might be forbidden (some web apps have policies against scraping).
- It might be slow and expansive (if you need to fetch and wade through a lot of noise).
Understanding Real-World Web Pages
Let’s understand what we are up against, by looking at the output of some common web application code. In the article Introduction to Vagrant, there are some Disqus comments at the bottom of the page:
In order to scrape these comments, we need to find them on the page first.
View Page Source
Every browser since the dawn of time (the 1990s) has supported the ability to view the HTML of the current page. Here is a snippet from the view source of Introduction to Vagrant that starts with a huge chunk of minified and uglified JavaScript unrelated to the article itself. Here is a small portion of it:
Here is some actual HTML from the page:
This looks pretty messy, but what is surprising is that you will not find the Disqus comments in the source of the page.
The Mighty Inline Frame
It turns out that the page is a mashup, and the Disqus comments are embedded as an iframe (inline frame) element. You can find it out by right-clicking on the comments area, and you’ll see that there is frame information and source there:
That makes sense. Embedding third-party content as an iframe is one of the primary reasons to use iframes. Let’s find the tag then in the main page source. Foiled again! There is no
tag in the main page source.
JavaScript-Generated Markup
The reason for this omission is that view page source
shows you the content that was fetched from the server. But the final DOM (document object model) that gets rendered by the browser may be very different. JavaScript kicks in and can manipulate the DOM at will. The iframe can’t be found, because it wasn’t there when the page was retrieved from the server.
Static Scraping vs. Dynamic Scraping
Static scraping ignores JavaScript. It fetches web pages from the server without the help of a browser. You get exactly what you see in “view page source”, and then you slice and dice it. If the content you’re looking for is available, you need to go no further. However, if the content is something like the Disqus comments iframe, you need dynamic scraping.
Dynamic scraping uses an actual browser (or a headless browser) and lets JavaScript do its thing. Then, it queries the DOM to extract the content it’s looking for. Sometimes you need to automate the browser by simulating a user to get the content you need.
Static Scraping With Requests and BeautifulSoup
Let’s see how static scraping works using two awesome Python packages: requests for fetching web pages and BeautifulSoup for parsing HTML pages.
Installing Requests and BeautifulSoup
Install pipenv first, and then: pipenv install requests beautifulsoup4
This will create a virtual environment for you too. If you’re using the code from gitlab, you can just pipenv install
.
Fetching Pages
Fetching a page with requests is a one liner: r = requests.get(url)
The response object has a lot of attributes. The most important ones are ok
and content
. If the request fails then r.ok
will be False and r.content
will contain the error. The content is a stream of bytes. It is usually better to decode it to utf-8 when dealing with text:
>>> r = requests.get('http://www.c2.com/no-such-page') >>> r.ok False >>> print(r.content.decode('utf-8'))404 Not Found Not Found
The requested URL /ggg was not found on this server.
Apache/2.0.52 (CentOS) Server at www.c2.com Port 80
If everything is OK then r.content
will contain the requested web page (same as view page source).
Finding Elements With BeautifulSoup
The get_page()
function below fetches a web page by URL, decodes it to UTF-8, and parses it into a BeautifulSoup object using the HTML parser.
def get_page(url): r = requests.get(url) content = r.content.decode('utf-8') return BeautifulSoup(content, 'html.parser')
Once we have a BeautifulSoup object, we can start extracting information from the page. BeautifulSoup provides many find functions to locate elements inside the page and drill down deep nested elements.
Tuts+ author pages contain multiple tutorials. Here is my author page. On each page, there are up to 12 tutorials. If you have more than 12 tutorials then you can navigate to the next page. The HTML for each article is enclosed in an
def get_page_articles(page): elements = page.findAll('article') articles = [e.a.attrs['href'] for e in elements] return articles
The following code gets all the articles from my page and prints them (without the common prefix):
page = get_page('https://tutsplus.com/authors/gigi-sayfan') articles = get_page_articles(page) prefix = 'https://code.tutsplus.com/tutorials' for a in articles: print(a[len(prefix):]) Output: building-games-with-python-3-and-pygame-part-5--cms-30085 building-games-with-python-3-and-pygame-part-4--cms-30084 building-games-with-python-3-and-pygame-part-3--cms-30083 building-games-with-python-3-and-pygame-part-2--cms-30082 building-games-with-python-3-and-pygame-part-1--cms-30081 mastering-the-react-lifecycle-methods--cms-29849 testing-data-intensive-code-with-go-part-5--cms-29852 testing-data-intensive-code-with-go-part-4--cms-29851 testing-data-intensive-code-with-go-part-3--cms-29850 testing-data-intensive-code-with-go-part-2--cms-29848 testing-data-intensive-code-with-go-part-1--cms-29847 make-your-go-programs-lightning-fast-with-profiling--cms-29809
Dynamic Scraping With Selenium
Static scraping was good enough to get the list of articles, but as we saw earlier, the Disqus comments are embedded as an iframe element by JavaScript. In order to harvest the comments, we will need to automate the browser and interact with the DOM interactively. One of the best tools for the job is Selenium.
Selenium is primarily geared towards automated testing of web applications, but it is great as a general-purpose browser automation tool.
Installing Selenium
Type this command to install Selenium: pipenv install selenium
Choose Your Web Driver
Selenium needs a web driver (the browser it automates). For web scraping, it usually doesn’t matter which driver you choose. I prefer the Chrome driver. Follow the instructions in this Selenium guide.
Chrome vs. PhantomJS
In some cases you may prefer to use a headless browser, which means no UI is displayed. Theoretically, PhantomJS is just another web driver. But, in practice, people reported incompatibility issues where Selenium works properly with Chrome or Firefox and sometimes fails with PhantomJS. I prefer to remove this variable from the equation and use an actual browser web driver.
Counting Disqus Comments
Let’s do some dynamic scraping and use Selenium to count Disqus comments on Tuts+ tutorials. Here are the necessary imports.
from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.expected_conditions import ( presence_of_element_located) from selenium.webdriver.support.wait import WebDriverWait
The get_comment_count()
function accepts a Selenium driver and URL. It uses the get()
method of the driver to fetch the URL. This is similar to requests.get()
, but the difference is that the driver object manages a live representation of the DOM.
Then, it gets the title of the tutorial and locates the Disqus iframe using its parent id disqus_thread
and then the iframe itself:
def get_comment_count(driver, url): driver.get(url) class_name = 'content-banner__title' name = driver.find_element_by_class_name(class_name).text e = driver.find_element_by_id('disqus_thread') disqus_iframe = e.find_element_by_tag_name('iframe') iframe_url = disqus_iframe.get_attribute('src')
The next step is to fetch the contents of the iframe itself. Note that we wait for the comment-count
element to be present because the comments are loaded dynamically and not necessarily available yet.
driver.get(iframe_url) wait = WebDriverWait(driver, 5) commentCountPresent = presence_of_element_located( (By.CLASS_NAME, 'comment-count')) wait.until(commentCountPresent) comment_count_span = driver.find_element_by_class_name( 'comment-count') comment_count = int(comment_count_span.text.split()[0])
The last part is to return the last comment if it wasn’t made by me. The idea is to detect comments I haven’t responded to yet.
last_comment = {} if comment_count > 0: e = driver.find_elements_by_class_name('author')[-1] last_author = e.find_element_by_tag_name('a') last_author = e.get_attribute('data-username') if last_author != 'the_gigi': e = driver.find_elements_by_class_name('post-meta') meta = e[-1].find_element_by_tag_name('a') last_comment = dict( author=last_author, title=meta.get_attribute('title'), when=meta.text) return name, comment_count, last_comment
Conclusion
Web scraping is a useful practice when the information you need is accessible through a web application that doesn’t provide an appropriate API. It takes some non-trivial work to extract data from modern web applications, but mature and well-designed tools like requests, BeautifulSoup, and Selenium make it worthwhile.
Additionally, don’t hesitate to see what we have available for sale and for study in the Envato Market, and don’t hesitate to ask any questions and provide your valuable feedback using the feed below.
Powered by WPeMatico