Python: Is Web Scraping Really That Easy?
By ChatGPT, proofread by AMO
Note: In the following post, the responses are those of ChatGPT after being informed of a basic script.
To create a Python program that scrapes the content of the specified web page and alerts you in case of changes, you can follow these steps:
- Install the necessary libraries:
Make sure you have installed the
requests
,beautifulsoup4
, andsmtplib
libraries to perform HTTP requests, parse HTML content, and send emails respectively. You can install them usingpip
:
pip install requests beautifulsoup4
1 Write the scraping code: Use the requests library to retrieve the HTML content of the web page, then use beautifulsoup4 to extract relevant information you want to compare for detecting changes.
2 Set up email sending: Use the smtplib library to send an email alert when changes are detected.
Here’s an example code to help you get started:
import requests
from bs4 import BeautifulSoup
import smtplib
from email.mime.text import MIMEText
def get_page_content(url):
response = requests.get(url)
if response.status_code == 200:
return response.content
else:
raise Exception("Failed to fetch the page content.")
def extract_info_from_page(content):
soup = BeautifulSoup(content, 'html.parser')
# Here, add the code to extract relevant information from the page,
# which is likely to change, such as specific tags, CSS classes, etc.
# Return the extracted information as text or another data structure.
def send_email_alert(message):
# Configure your SMTP server details and authentication information.
smtp_server = 'smtp.example.com'
smtp_port = 587
smtp_username = 'your_email_address@example.com'
smtp_password = 'your_password'
sender_email = 'your_email_address@example.com'
receiver_email = 'recipient_email_address@example.com'
msg = MIMEText(message)
msg['From'] = sender_email
msg['To'] = receiver_email
msg['Subject'] = 'Alert: Page Content Has Changed!'
try:
with smtplib.SMTP(smtp_server, smtp_port) as server:
server.starttls()
server.login(smtp_username, smtp_password)
server.sendmail(sender_email, receiver_email, msg.as_string())
print("Alert email sent successfully.")
except Exception as e:
print(f"Error sending email: {e}")
if __name__ == "__main__":
url = "https://www.urssaf.fr/portail/home/taux-et-baremes/plafonds.html"
try:
previous_content = get_page_content(url)
except Exception as e:
print(f"Error fetching page content: {e}")
while True:
try:
current_content = get_page_content(url)
except Exception as e:
print(f"Error fetching page content: {e}")
if current_content != previous_content:
message = "The content of the page has changed! Go check: " + url
send_email_alert(message)
break # Stop the loop after sending the alert
# Wait for a certain time interval before checking again.
# For example, every hour.
time.sleep(3600)
Note that the part where you extract specific information depends on the HTML structure of the target page. You’ll need to inspect the page to find the tags and classes containing the data you need.