Information for students of A4M33VIA (Development of Internet Applications).
Tools
Anaconda (Recommended for Windows)
Pip install beautifulsoup4
Pip install requests
Run python script
python main.py
python3 main.py
py -3 main.py
python -3 main.py
One of these should work.
Latest articles from Techcrunch
Get headline, content, and summary of main articles from Techcrunch.
1) Get the main page
import requests
requests.get(url)
response.content
2) Parse page
from bs4 import BeautifulSoup
BeautifulSoup(content, “html.parser”)
3) Element manipulation
soup.find_all(element, class)
soup.find(element, id=element_id)
tag.get(“href”)
tag.get_text()
4) JSON
with open(“articles.json”, “w”, encoding=”utf-8″) as file:
json.dump(articles, file, indent=4)
5) Sleep
from time import sleep
sleep(seconds)
Latest movies from TMDB
Get name, release date, describtion, rating and “adults_only” of latest movies from TMDB API.
1) Register into API
Create account
Settings – API – Registrace aplikace – API Key (should look like this: bea1bcfeeg54erg4reg4d24b6fc895a3 (this one will not work))
2) API requests examples
https://www.themoviedb.org/documentation/api/discover
3) Load JSON response
response.content.decode(“utf-8”)
json.loads(response_content_string)
UPDATE:
import json from time import sleep import requests # library for making http requests from bs4 import BeautifulSoup # html parser # Get headline, content and summary from article def parse_article(article_url): # Make GET request on web page response = requests.get(article_url) # Get HTML content from response content = response.content # Load HTML into parser soup = BeautifulSoup(content, "html.parser") # Find headline tag with class "alpha" headline = soup.find("h1", "alpha") # Get text of headline headline_text = headline.get_text() # Get content of article article = soup.find("div", "article-entry") # Get text of article without tags article_text = article.get_text() # Get speakable summary speakable_summary = soup.find("p", id="speakable-summary") # Get text of speakable summary without tags speakable_summary_text = speakable_summary.get_text() return headline_text, article_text, speakable_summary_text url = "https://techcrunch.com/" # Make GET request on web page response = requests.get(url) # Get HTML content from response content = response.content # Load HTML into parser soup = BeautifulSoup(content, "html.parser") # Find all <a> tags with class "read-more" links_to_articles = soup.find_all("a", "read-more") articles = [] # Go through all <a> for link in links_to_articles: # Get link to article from <a> article_url = link.get("href") # Get headline, content and summary from article headline, content, summary = parse_article(article_url) article = {"headline": headline, "content": content, "summary": summary} # Append article to list of articles articles.append(article) print(headline) # Sleep for 3 seconds to not trigger captcha sleep(3) # Open json file in write mode and utf-8 encoding with open("articles.json", "w", encoding="utf-8") as file: # write articles to file json.dump(articles, file, indent=4)
UPDATE 16. 11. 2017
I loved the way you wrote.
I have just one little thing regarding BeautifulSoup. It is not capable of scraping Ajax & Javascript web pages.
That’s why I’m using Selenium. It has many other advantages.
Anyway, thanks for the article.