VIA - Petr Marek

Information for students of A4M33VIA (Development of Internet Applications).

Tools

Python 3

Anaconda (Recommended for Windows)

Pip install beautifulsoup4

Pip install requests

Run python script

python main.py

python3 main.py

py -3 main.py

python -3 main.py

One of these should work.

Latest articles from Techcrunch

Get headline, content, and summary of main articles from Techcrunch.

1) Get the main page

import requests

requests.get(url)

response.content

2) Parse page

from bs4 import BeautifulSoup

BeautifulSoup(content, “html.parser”)

3) Element manipulation

soup.find_all(element, class)

soup.find(element, id=element_id)

tag.get(“href”)

tag.get_text()

4) JSON

with open(“articles.json”, “w”, encoding=”utf-8″) as file:

json.dump(articles, file, indent=4)

5) Sleep

from time import sleep

sleep(seconds)

Latest movies from TMDB

Get name, release date, describtion, rating and “adults_only” of latest movies from TMDB API.

1) Register into API

Create account

Settings – API – Registrace aplikace – API Key (should look like this: bea1bcfeeg54erg4reg4d24b6fc895a3 (this one will not work))

2) API requests examples

https://www.themoviedb.org/documentation/api/discover

3) Load JSON response

response.content.decode(“utf-8”)

json.loads(response_content_string)

UPDATE:

import json
from time import sleep
import requests  # library for making http requests
from bs4 import BeautifulSoup  # html parser


# Get headline, content and summary from article
def parse_article(article_url):
    # Make GET request on web page
    response = requests.get(article_url)
    # Get HTML content from response
    content = response.content
    # Load HTML into parser
    soup = BeautifulSoup(content, "html.parser")

    # Find headline tag with class "alpha"
    headline = soup.find("h1", "alpha")
    # Get text of headline
    headline_text = headline.get_text()

    # Get content of article
    article = soup.find("div", "article-entry")
    # Get text of article without tags
    article_text = article.get_text()

    # Get speakable summary
    speakable_summary = soup.find("p", id="speakable-summary")
    # Get text of speakable summary without tags
    speakable_summary_text = speakable_summary.get_text()

    return headline_text, article_text, speakable_summary_text


url = "https://techcrunch.com/"
# Make GET request on web page
response = requests.get(url)
# Get HTML content from response
content = response.content
# Load HTML into parser
soup = BeautifulSoup(content, "html.parser")

# Find all <a> tags with class "read-more"
links_to_articles = soup.find_all("a", "read-more")

articles = []
# Go through all <a>
for link in links_to_articles:
    # Get link to article from <a>
    article_url = link.get("href")
    # Get headline, content and summary from article
    headline, content, summary = parse_article(article_url)
    article = {"headline": headline, "content": content, "summary": summary}
    # Append article to list of articles
    articles.append(article)
    print(headline)
    # Sleep for 3 seconds to not trigger captcha
    sleep(3)

# Open json file in write mode and utf-8 encoding
with open("articles.json", "w", encoding="utf-8") as file:
    # write articles to file
    json.dump(articles, file, indent=4)

UPDATE 16. 11. 2017

Intent recognition

One response

Mokhtar Ebrahim says:

December 27, 2018 at 12:19 pm

I loved the way you wrote.
I have just one little thing regarding BeautifulSoup. It is not capable of scraping Ajax & Javascript web pages.
That’s why I’m using Selenium. It has many other advantages.
Anyway, thanks for the article.