EN VI

Python - How to iterate a list from a to z - to scrape data and transform it into dataframe?

2024-03-12 22:30:08
Python - How to iterate a list from a to z - to scrape data and transform it into dataframe?

Currently working on a scraper that collects the data of german insurances - here we have a comprehensive list of data insurances-companies from a to z

our menbers:

https://www.gdv.de/gdv/der-gdv/unsere-mitglieder the overview on 478 results:

for the letter a: https://www.gdv.de/gdv/der-gdv/unsere-mitglieder?letter=A for the letter b: https://www.gdv.de/gdv/der-gdv/unsere-mitglieder?letter=B

and so forth: btw: see for example one page - of a company:
https://www.gdv.de/gdv/der-gdv/unsere-mitglieder/ba-die-bayerische-allgemeine-versicherung-ag-47236

With the data, we need to have the contact-data and the adress

Well I think that this task could be done best with a tiny bs4 scraper with request and putting all the data to a dataframe: I use BeautifulSoup for parsing the HTML and Requests for making HTTP requests. Best method - yes I guess is BeautifulSoup and Requests to extract the contact data and address from the given URL (see below and also above). First of all we need to defines a function scrape_insurance_company that takes a URL as input, and then send an HTTP GET request to it, and extract the contact data and address using BeautifulSoup.

Finally, we need to return a dictionary containing the extracted data. well since we need to cover the characters from a to z we have to iterate here: we iterate through a list of URLs containing the insurance companies and call this function for each URL to collect the data. Subsequently we use Pandas to organize the data into a DataFrame.

note: i run this on google-colab:

import requests
from bs4 import BeautifulSoup
import pandas as pd

def scrape_insurance_company(url):
    # Send a GET request to the URL
    response = requests.get(url)
    
    # Check if the request was successful
    if response.status_code == 200:
        # Parse the HTML content
        soup = BeautifulSoup(response.content, 'html.parser')
        
        # Find all the links to insurance companies
        company_links = soup.find_all('a', class_='entry-title')
        
        # List to store the data for all insurance companies
        all_data = []
        
        # Iterate through each company link
        for link in company_links:
            company_url = link['href']
            company_data = scrape_company_data(company_url)
            if company_data:
                all_data.append(company_data)
        
        return all_data
    else:
        print("Failed to fetch the page:", response.status_code)
        return None

def scrape_company_data(url):
    # Send a GET request to the URL
    response = requests.get(url)
    
    # Check if the request was successful
    if response.status_code == 200:
        # Parse the HTML content
        soup = BeautifulSoup(response.content, 'html.parser')
        
        # DEBUG: Print HTML content of the page
        print(soup.prettify())
        
        # Find the relevant elements containing contact data and address
        contact_info = soup.find('div', class_='contact')
        address_info = soup.find('div', class_='address')
        
        # Extract contact data and address if found
        contact_data = contact_info.text.strip() if contact_info else None
        address = address_info.text.strip() if address_info else None
        
        return {'Contact Data': contact_data, 'Address': address}
    else:
        print("Failed to fetch the page:", response.status_code)
        return None

# now we list to store data for all insurance companies
all_insurance_data = []

# and now we iterate through the alphabet
for letter in range(ord('A'), ord('Z') + 1):
    letter_url = f"https://www.gdv.de/gdv/der-gdv/unsere-mitglieder?letter={chr(letter)}"
    print("Scraping page:", letter_url)
    data = scrape_insurance_company(letter_url)
    if data:
        all_insurance_data.extend(data)

# subsequently we convert the data to a Pandas DataFrame
df = pd.DataFrame(all_insurance_data)

# and finally - we save the data to a CSV file
df.to_csv('insurance_data.csv', index=False)

print("Scraping completed and data saved to 'insurance_data.csv'.")

well at the moment all looks like so - i get in google-colab-terminal:

the insurance:

Scraping page: https://www.gdv.de/gdv/der-gdv/unsere-mitglieder?letter=A
Scraping page: https://www.gdv.de/gdv/der-gdv/unsere-mitglieder?letter=B
Scraping page: https://www.gdv.de/gdv/der-gdv/unsere-mitglieder?letter=C
Scraping page: https://www.gdv.de/gdv/der-gdv/unsere-mitglieder?letter=D

Scraping page: https://www.gdv.de/gdv/der-gdv/unsere-mitglieder?letter=Z

Scraping completed and data saved to 'insurance_data.csv'.

bit the list is still empty... i still struggle here a bit

Solution:

First check whether the elements you want to select are contained in response / soup; the ones you are addressing do not appear to be present. So your ResultSet soup.find_all('a', class_='entry-title') is empty and the for loop do not start to fill all_data

Scraping the links of the companies per letter page and fill your all_data list:

# Find all the links to insurance companies
company_links = soup.select('ul.ibmix-download-teaser__list li a')

Find address and contact info on each company site:

# Find the relevant elements containing contact data and address
contact_info = soup.select_one('[href^="mailto"]')
address_info = soup.select_one('.ibmix-article__rte p')
Answer

Login


Forgot Your Password?

Create Account


Lost your password? Please enter your email address. You will receive a link to create a new password.

Reset Password

Back to login