Basic Web Scraping with Python bs4 and urllib

June 30, 2020

Basic Web Scraping with Python bs4 and urllib

Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. Specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.

One of the most popular tools out there is the Beautiful Soup module in python.

The following is a simple way to parse html pages. It can be extended to lxml or html5lib parsing with the necessary modules installed.

Installation

pip install beautifulsoup4

Usage

from urllib.request import urlopen
from bs4 import BeautifulSoup
import ssl

# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = input('Enter the url- ')
html = urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')

This part is followed by the code to extract the necessary information from the parsed html.

For eg:

tags=soup.find_all('a')
for tag in tags:
    print(tag.get('href'))

tags is a list to store all the retrieved data which can be looped through.In this case,it prints out all the hrefs on the website.

The final part of code can be modified to the user’s interest.

This article is contributed by Joel Joshua. If you would like to contribute to Hackzism you can also write an article and mail to hackzism.hack@gmail.com.

Your article will be published on our home page.

Search This Blog

Hackzism

Basic Web Scraping with Python bs4 and urllib

Installation

Usage

Comments

Post a Comment

Popular Posts

Being Anonymous: A Beginners Guide

Check weather from Terminal using wttr.in