EN VI

Web-scraping - How to scrape a list from Wikipedia?

2024-03-14 04:30:08
Web-scraping - How to scrape a list from Wikipedia?

I am facing a similar problem to the question asked by How can I scrape a list from wikipedia and transfer to a dataframe. I want to create a dataframe from the list 'Modern wars with fewer than 25,000 deaths by death toll' in the Wikipedia page with the column names 'Death toll', 'War', 'Date'.

The solution proposed in the other post does not work for me as the Wikipedia code is different and I can't seem to find a class name for the list.

I am using BeautifulSoup as follows:

url = "https://en.wikipedia.org/wiki/List_of_wars_by_death_toll"

import requests
from bs4 import BeautifulSoup
import pandas as pd

response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")

I have tried multiple things including searching

soup.find_all('lu') or soup.find_all("div", {"class": "mw-content-ltr"})[0].find_all("li") as well as

ul_elements = soup.find_all("ul")
for ul in ul_elements:
    # Find all <li> elements within the <ul> element
    li_elements = ul.find_all("li")
    for li in li_elements:
        # Print the text content of each <li> element
        print(li.get_text())

Nothing seems to work. The last option for instance prints out way more than alone my list, and I don't know how to limit the result just to my list.

Thanks!

Solution:

Try:

import pandas as pd
import requests
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/List_of_wars_by_death_toll"

soup = BeautifulSoup(requests.get(url).content, "html.parser")


data = []
for li in soup.select_one(
    'h3:-soup-contains("Modern wars with fewer than 25,000 deaths by death toll") ~ ul'
).find_all("li"):
    if " – " in li.text:
        a, b = li.text.split(" – ", maxsplit=1)
    else:
        a, b = li.text.split(" ", maxsplit=1)

    if "[" in b:
        b = b.split("[")[0]

    link = None
    if li.a:
        link = li.a["href"]

    data.append((a, b, link))

df = pd.DataFrame(data, columns=["Deathtoll", "Name", "Link"])
print(df.head(10))

Prints:

  Deathtoll                                       Name                                         Link
0    22,211   Croatian War of Independence (1991–1995)           /wiki/Croatian_War_of_Independence
1   22,000+      Dominican Restoration War (1863–1865)              /wiki/Dominican_Restoration_War
2   21,000+                         Six-Day War (1967)                            /wiki/Six-Day_War
3    20,068                     Reform War (1857–1860)                             /wiki/Reform_War
4   20,000+                     Yaqui Wars (1533–1929)                             /wiki/Yaqui_Wars
5   20,000+  War of the Quadruple Alliance (1718–1720)          /wiki/War_of_the_Quadruple_Alliance
6   20,000+                 Ragamuffin War (1835–1845)                         /wiki/Ragamuffin_War
7   20,000+              Italo-Turkish War (1911–1912)                      /wiki/Italo-Turkish_War
8    20,000              Anglo-Spanish War (1727–1729)  /wiki/Anglo-Spanish_War_(1727%E2%80%931729)
9   19,619+             Rhodesian Bush War (1964–1979)                     /wiki/Rhodesian_Bush_War
Answer

Login


Forgot Your Password?

Create Account


Lost your password? Please enter your email address. You will receive a link to create a new password.

Reset Password

Back to login