EN VI

Python - BeatuifulSoup fetch data and parse - from European Volunteering-Services: a tiny scraper that collects opportunities from EU-Site?

2024-03-14 00:30:08
How to Python - BeatuifulSoup fetch data and parse - from European Volunteering-Services: a tiny scraper that collects opportunities from EU-Site

I am looking for a public list of Volunteering - Services in Europe: I don't need full addresses - but the name and the website. I think of data ... XML, CSV ... with these fields: name, country - and some additional fields would be nice one record per country of presence. btw: the european volunteering services are great options for the youth

well I have found a great page that is very very comprehensive - see

want to gather data from the european volunteering services that are hosted on a European site:

https://youth.europa.eu/go-abroad/volunteering/opportunities_en

We have got several hundred volunteering opportunities there - which are stored in sites like the following:

 https://youth.europa.eu/solidarity/placement/39020_en 

https://youth.europa.eu/solidarity/placement/38993_en 

https://youth.europa.eu/solidarity/placement/38973_en 

https://youth.europa.eu/solidarity/placement/38972_en 

https://youth.europa.eu/solidarity/placement/38850_en 

https://youth.europa.eu/solidarity/placement/38633_en

idea:

I think it would be awesome to gather the data - i.e. with a scraper that is based on BS4 and requests - parsing the data and subsequently printing the data in a dataframe

Well - I think that we could iterate over all the urls:

placement/39020_en 
placement/38993_en 
placement/38973_en 
placement/38850_en 

I think that we can iterate from zero to 100 000 in stored to fetch all the results that are stored in placements. But this idea is not backed with a code. In other words - at the moment I do not have an idea how to do this special idea of iterating over such a great range:

At the moment I think - it is a basic approach to start with this:

import requests
from bs4 import BeautifulSoup
import pandas as pd

# List of URLs to scrape
urls = [
    "https://youth.europa.eu/solidarity/placement/39020_en",
    "https://youth.europa.eu/solidarity/placement/38993_en",
    "https://youth.europa.eu/solidarity/placement/38973_en",
    "https://youth.europa.eu/solidarity/placement/38972_en",
    "https://youth.europa.eu/solidarity/placement/38850_en",
    "https://youth.europa.eu/solidarity/placement/38633_en"
]

# Function to scrape data from a single URL
def scrape_data(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    
    # Extracting relevant data
    title = soup.find("h2").text.strip()
    location = soup.find("span", class_="field--name-field-placement-location").text.strip()
    start_date = soup.find("span", class_="field--name-field-placement-start-date").text.strip()
    end_date = soup.find("span", class_="field--name-field-placement-end-date").text.strip()
    
    # Returning data as dictionary
    return {
        "Title": title,
        "Location": location,
        "Start Date": start_date,
        "End Date": end_date,
        "URL": url
    }

# Scrape data from all URLs
data = []
for url in urls:
    data.append(scrape_data(url))

# Convert data to DataFrame
df = pd.DataFrame(data)

# Print DataFrame
print(df)

Which gives me back the following

AttributeError                            Traceback (most recent call last)

<ipython-input-1-e65c612df65e> in <cell line: 37>()
     36 data = []
     37 for url in urls:
---> 38     data.append(scrape_data(url))
     39 
     40 # Convert data to DataFrame

<ipython-input-1-e65c612df65e> in scrape_data(url)
     20     # Extracting relevant data
     21     title = soup.find("h2").text.strip()
---> 22     location = soup.find("span", class_="field--name-field-placement-location").text.strip()
     23     start_date = soup.find("span", class_="field--name-field-placement-start-date").text.strip()
     24     end_date = soup.find("span", class_="field--name-field-placement-end-date").text.strip()

AttributeError: 'NoneType' object has no attribute 'text'

Solution:

First check whether the elements you want to select are contained in response / soup; the ones you are addressing do not appear to be present. So as mentioned by @John Gordon your selection is not finding anything.

You could select your elements like this - used css selectors here:

# Extracting relevant data
title = soup.h1.get_text(', ',strip=True)
location = soup.select_one('p:has(i.fa-location-arrow)').get_text(', ',strip=True)
start_date,end_date = (e.get_text(strip=True)for e in soup.select('span.extra strong')[-2:])
Title Location Start Date End Date URL
0 Supporting GOB's Sustainable Gardening Project "Es Viver" c/ Camí des Castell, 53, 07702 Maó, Menorca, Spain 01/06/2024 31/05/2025 https://youth.europa.eu/solidarity/placement/39020_en
1 EUROPEAN VOLUNTEERING VS DEPOPULATION 3.0 47400 Medina del Campo (VALLADOLID), Spain 31/05/2024 30/03/2025 https://youth.europa.eu/solidarity/placement/38993_en
2 SUPPORTING LOCAL COMMUNITIES: ASMISAF/AUNA Inclusión Gandia, Spain 01/06/2024 30/06/2025 https://youth.europa.eu/solidarity/placement/38973_en
3 SUPPORTING LOCAL COMMUNITIES: Caritas Gandia Gandia, Spain 01/06/2024 30/06/2025 https://youth.europa.eu/solidarity/placement/38972_en
4 Pedagogic farm based on equine assisted interventions + social service Masía Cal Taulé s/n, 08673 Serrateix, Spain 01/03/2024 31/03/2025 https://youth.europa.eu/solidarity/placement/38850_en
5 Suporting in a Rural Area Plaza de Tuy, 6, 34440 Frómista, Spain 04/03/2024 04/11/2024 https://youth.europa.eu/solidarity/placement/38633_en
Answer

Login


Forgot Your Password?

Create Account


Lost your password? Please enter your email address. You will receive a link to create a new password.

Reset Password

Back to login