Basic Web Scraping with Python bs4 and urllib

Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. Specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis. 



One of the most popular tools out there is the Beautiful Soup module in python.
The following is a simple way to parse html pages. It can be extended to lxml or html5lib parsing with the necessary modules installed.

Installation

pip install beautifulsoup4

Usage

from urllib.request import urlopen
from bs4 import BeautifulSoup
import ssl
# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
url = input('Enter the url- ')
html = urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')


This part is followed by the code to extract the necessary information from the parsed html. 

For eg:
tags=soup.find_all('a')
for tag in tags:
    print(tag.get('href'))

tags is a list to store all the retrieved data which can be looped through.In this case,it prints out all the hrefs on the website.

The final part of code can be modified to the user’s interest. 



This article is contributed by Joel Joshua. If you would like to contribute to Hackzism you can also write an article and mail to hackzism.hack@gmail.com
Your article will be published on our home page.

Comments

Popular Posts