13.2 C
London
Tuesday, July 2, 2024
HomePandas in PythonInput/Output in PythonHow to Read HTML Tables with Pandas (Including Example)

How to Read HTML Tables with Pandas (Including Example)

Related stories

Learn About Opening an Automobile Repair Shop in India

Starting a car repair shop is quite a good...

Unlocking the Power: Embracing the Benefits of Tax-Free Investing

  Unlocking the Power: Embracing the Benefits of Tax-Free Investing For...

Income Splitting in Canada for 2023

  Income Splitting in Canada for 2023 The federal government’s expanded...

Can I Deduct Home Office Expenses on my Tax Return 2023?

Can I Deduct Home Office Expenses on my Tax...

Canadian Tax – Personal Tax Deadline 2022

  Canadian Tax – Personal Tax Deadline 2022 Resources and Tools...

You can use the pandas read_html() function to read HTML tables into a pandas DataFrame.

This function uses the following basic syntax:

df = pd.read_html('https://en.wikipedia.org/wiki/National_Basketball_Association')

The following example shows how to use this function to read in a table of NBA team names from this Wikipedia page.

Example: Read HTML Table with Pandas

Before using the read_html() function, you’ll likely have to install lxml:

pip install lxml

Note: If you’re using a Jupyter notebook, you need to restart the kernel after performing this installation.

Next, we can use the read_html() function to read every HTML table on this Wikipedia page:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from unicodedata import normalize

#read all HTML tables from specific URL
tabs = pd.read_html('https://en.wikipedia.org/wiki/National_Basketball_Association')

#display total number of tables read
len(tabs)

44

We can see that a total of 44 HTML tables were found on this page.

I know that the table I’m interested in has the word “Division” in it, so I can use the match argument to only retrieve HTML tables that contain this word:

#read HTML tables from specific URL with the word "Division" in them
tabs = pd.read_html('https://en.wikipedia.org/wiki/National_Basketball_Association',
                    match='Division')

#display total number of tables read
len(tabs)

1

I can then list the names of the columns of the table:

#define table
df = tabs[0]

#list all column names of table
list(df)

[('Division', 'Eastern Conference'),
 ('Team', 'Eastern Conference'),
 ('Location', 'Eastern Conference'),
 ('Arena', 'Eastern Conference'),
 ('Capacity', 'Eastern Conference'),
 ('Coordinates', 'Eastern Conference'),
 ('Founded', 'Eastern Conference'),
 ('Joined', 'Eastern Conference'),
 ('Unnamed: 8_level_0', 'Eastern Conference')]

I’m only interested in the first two columns, so I can filter the DataFrame to only contain these columns:

#filter DataFrame to only contain first two columns
df_final = df.iloc[:, 0:2]

#rename columns
df_final.columns = ['Division', 'Team']

#view first few rows of final DataFrame
print(df_final.head())

   Division                Team
0  Atlantic      Boston Celtics
1  Atlantic       Brooklyn Nets
2  Atlantic     New York Knicks
3  Atlantic  Philadelphia 76ers
4  Atlantic     Toronto Raptors

The final table contains only the ‘Division’ and ‘Team’ columns.

Additional Resources

The following tutorials explain how to read other types of files in pandas:

How to Read a Text File with Pandas
How to Read Excel Files with Pandas
How to Read CSV Files with Pandas

Subscribe

- Never miss a story with notifications

- Gain full access to our premium content

- Browse free from up to 5 devices at once

Latest stories