Flask is a popular Python micro web framework that helps you develop lightweight web applications and APIs quickly and easily.
As a web framework, it provides greater flexibility, customization, and scalability for simple web applications while remaining highly compatible with cutting-edge technologies.
Websites currently contain a wealth of valuable information.When gathering this information, you will almost definitely find yourself manually copying and pasting.You need a simple and more automated method for this, which is where web scraping comes in.
Web scraping is simply the automated extraction of a web page's unstructured HTML information in a specified format, structuring it, and storing it in a database or saving it as a CSV file for future consumption.
Common data types organizations collect include images, videos, text, product information, customer sentiments and reviews (on sites like Twitter, Yell), and pricing from comparison websites.
Market research companies use scrapers to pull data from social media or online forums for things like customer sentiment analysis. Others scrape data from product sites like Amazon or eBay to support competitor analysis.
Google regularly uses web scraping to analyze, rank, and index their content. Web scraping also allows them to extract information from third-party websites before redirecting it to their own.
Real estate listings, weather data, carrying out SEO audits etc.
- Find the URLs you want to scrape
- Inspect the page
- Identify the data you want to extract- find appropriate nest tags
- Beautiful Soup
- Scrapy
- Pandas
- Parsehub
- Load the application
- Provide a target URL and a tag to be fetched example img,p, title
- Receive a response - the requested element(s) content.
- For images, there will be a download functionality that will save the images to your downloads directory
We will utilize Flask, Beautiful Soup, and request libraries. First and foremost, we'll import some functionality from Flask and Beautiful Soup into the app.py file.
We need to validate and parse the URLs we get, so we import the URL handling module for Python- urllib, as well as some other built-in libraries.
from flask import (
Flask,
render_template,
request,
redirect,
flash,
url_for
)
import urllib.request
from urllib.parse import urlparse,urljoin
from bs4 import BeautifulSoup
import requests,validators,uuid,pathlib,os
@app.route("/",methods=("GET", "POST"), strict_slashes=False)
def index():
# parsing
return render_template("index.html")
if request.method == "POST":
check code line 39-...
@app.route("/download",methods=("GET", "POST"), strict_slashes=False)
def downloader():
try:
for img in image_handler(tag,specific_element,requested_url):
image_url = img
filename = str(uuid.uuid4())
file_ext = pathlib.Path(image_url).suffix
picture_filename = filename + file_ext
downloads_path = str(pathlib.Path.home() / "Downloads")
picture_path = os.path.join(downloads_path, picture_filename
)
flash("Images saved in your Downloads directory", "success")
except Exception as e:
flash(e, "danger")
return redirect(url_for('index'))
The uuid library is used by the download function above to produce unique names for the downloaded files.
py -m venv venv
venv\Scripts\activate
pip install -r requirements.txt
python app.py