Web scraping with Python
Technologies et Éditeurs Utilisés
Web-scraping-with-Python is a project aimed at extracting the table of participants in the 2019-2020 French football championship from Wikipedia and processing the data using Python.
Bibliothèques Python
-
Requests : Utilisé pour envoyer des requêtes HTTP pour récupérer le contenu de la page web.
-
BeautifulSoup : Utilisé pour analyser les documents HTML et extraire les données.
-
Pandas : Utilisé pour manipuler et analyser les données, en les convertissant en DataFrame.
Environnement de Développement
-
Python : Langage de programmation principal utilisé pour le développement du script de web scraping.
-
Jupyter Notebook : Utilisé pour le développement interactif et le test des scripts Python.
-
Visual Studio Code : Éditeur de code utilisé pour écrire et gérer les scripts Python.
Project Objectives
The main goal of this project is to automate the process of retrieving specific data from a web page and exporting it to an Excel file. The steps involved in this project are as follows:
TRAVAIL À FAIRE
- Import Necessary Libraries: Import all the Python libraries required for web scraping and data processing.
- Connect to the Website: Access the URL https://fr.m.wikipedia.org/wiki/Championnat_de_France_de_football_2019-2020 and retrieve the content of the page.
- Retrieve All Tables: Extract all the tables present on the Wikipedia page.
- Identify and Locate the Participants Table: Identify the specific table that contains the list of participants in the championship.
- Retrieve the Table and Load into a DataFrame: Extract the identified table and load it into a Pandas DataFrame for easy manipulation and analysis.
- Export the Table to Excel: Export the DataFrame containing the participants' data to an Excel file for further use and analysis.
Steps in Detail
1. Import Necessary Libraries
To start, the project requires the importation of essential Python libraries such as requests
, BeautifulSoup
, and pandas
:
import requests
from bs4 import BeautifulSoup
import pandas as pd
2. Connect to the Website
Using the requests
library, the script connects to the Wikipedia page and retrieves its HTML content:
url = "https://fr.m.wikipedia.org/wiki/Championnat_de_France_de_football_2019-2020"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
3. Retrieve All Tables
The BeautifulSoup
object is used to extract all tables from the page:
tables = soup.find_all('table')
4. Identify and Locate the Participants Table
Among the extracted tables, the script identifies the one containing the list of participants:
participants_table = None
for table in tables:
if "Participants" in str(table):
participants_table = table
break
5. Retrieve the Table and Load into a DataFrame
The identified table is then converted into a Pandas DataFrame:
df = pd.read_html(str(participants_table))[0]
6. Export the Table to Excel
Finally, the DataFrame is exported to an Excel file:
df.to_excel("participants_championship_2019_2020.xlsx", index=False)
This project showcases the capabilities of web scraping with Python to automate data extraction and processing tasks. It is particularly useful for collecting structured data from web pages and exporting it into a more usable format like Excel.
By following these steps, users can easily adapt the script to scrape different types of data from other web pages as well.