I am looking for a public list of Volunteering - Services in Europe: I don't need full addresses - but the name and the website. I think of data ... XML, CSV ... with these fields: name, country - and some additional fields would be nice one record per country of presence. btw: the european volunteering services are great options for the youth
well I have found a great page that is very very comprehensive - see
want to gather data from the european volunteering services that are hosted on a European site:
https://youth.europa.eu/go-abroad/volunteering/opportunities_en
We have got several hundred volunteering opportunities there - which are stored in sites like the following:
https://youth.europa.eu/solidarity/placement/39020_en
https://youth.europa.eu/solidarity/placement/38993_en
https://youth.europa.eu/solidarity/placement/38973_en
https://youth.europa.eu/solidarity/placement/38972_en
https://youth.europa.eu/solidarity/placement/38850_en
https://youth.europa.eu/solidarity/placement/38633_en
idea:
I think it would be awesome to gather the data - i.e. with a scraper that is based on BS4
and requests
- parsing the data and subsequently printing the data in a dataframe
Well - I think that we could iterate over all the urls:
placement/39020_en
placement/38993_en
placement/38973_en
placement/38850_en
I think that we can iterate from zero to 100 000 in stored to fetch all the results that are stored in placements. But this idea is not backed with a code. In other words - at the moment I do not have an idea how to do this special idea of iterating over such a great range:
At the moment I think - it is a basic approach to start with this:
import requests
from bs4 import BeautifulSoup
import pandas as pd
# List of URLs to scrape
urls = [
"https://youth.europa.eu/solidarity/placement/39020_en",
"https://youth.europa.eu/solidarity/placement/38993_en",
"https://youth.europa.eu/solidarity/placement/38973_en",
"https://youth.europa.eu/solidarity/placement/38972_en",
"https://youth.europa.eu/solidarity/placement/38850_en",
"https://youth.europa.eu/solidarity/placement/38633_en"
]
# Function to scrape data from a single URL
def scrape_data(url):
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
# Extracting relevant data
title = soup.find("h2").text.strip()
location = soup.find("span", class_="field--name-field-placement-location").text.strip()
start_date = soup.find("span", class_="field--name-field-placement-start-date").text.strip()
end_date = soup.find("span", class_="field--name-field-placement-end-date").text.strip()
# Returning data as dictionary
return {
"Title": title,
"Location": location,
"Start Date": start_date,
"End Date": end_date,
"URL": url
}
# Scrape data from all URLs
data = []
for url in urls:
data.append(scrape_data(url))
# Convert data to DataFrame
df = pd.DataFrame(data)
# Print DataFrame
print(df)
Which gives me back the following
AttributeError Traceback (most recent call last)
<ipython-input-1-e65c612df65e> in <cell line: 37>()
36 data = []
37 for url in urls:
---> 38 data.append(scrape_data(url))
39
40 # Convert data to DataFrame
<ipython-input-1-e65c612df65e> in scrape_data(url)
20 # Extracting relevant data
21 title = soup.find("h2").text.strip()
---> 22 location = soup.find("span", class_="field--name-field-placement-location").text.strip()
23 start_date = soup.find("span", class_="field--name-field-placement-start-date").text.strip()
24 end_date = soup.find("span", class_="field--name-field-placement-end-date").text.strip()
AttributeError: 'NoneType' object has no attribute 'text'