- author: The PyCoach
Scraping Data with Polars
Polars is a Python library that is gaining popularity in the data analysis and manipulation community. It is a faster alternative to Pandas and allows users to easily scrape websites and extract tables within HTML pages. Although, Polars does not currently have a read HTML function unlike Pandas, there is a straightforward workaround to turn tables from HTML pages into Polars scatter frames. In this article, we will cover how to scrape CSV files and HTML pages with Polars, including how to work with data extracted with Polars in the spreadsheet tool, Quadstat.
Scrape CSV Files with Polars
Scraping CSV files from a website with Polars is simple. First, import Polars in your Jupyter notebook or Python script. If Polars is not installed, use the command pip install polars
. To scrape a specific CSV file, use the PL.read_csv
function and provide the link of the file, as you would with Pandas.
importpolarsaspllink="https://www.example.com/file.csv"df=pl.read_csv(link)print(df)
This will give you a Polars data frame with no indexes. As you can see, you can easily extract data from a website without having to download and then read in the file.
Scrape HTML Pages with Polars
Although Polars does not have a read HTML function, there is a workaround that you can use to turn tables from HTML pages into Polars data frames. First, install the libraries pandas, lxml, and pyarrow. Then import pandas and use the pd.read_html
function to extract the tables from the HTML pages.
importpandasaspdlink="https://en.wikipedia.org/wiki/The_Simpsons"pandas_list=pd.read_html(link)print(pandas_list)
This will return a list of tables from the HTML page. To convert these Pandas data frames into Polars data frames, use the PL.from_pandas
function.
importpandasaspdimportpolarsaspllink="https://en.wikipedia.org/wiki/The_Simpsons"pandas_list=pd.read_html(link)polars_list=[pl.from_pandas(pandas_df)forpandas_dfinpandas_list]
This code will give you a list of Polars data frames equivalent to the original Pandas data frames.
Work with Polars data in Quadstat
Once you have extracted data with Polars, you can work with that data in the spreadsheet tool, Quadstat. Quadstat is a powerful tool that combines the familiarity of a spreadsheet and the power of code.
To try everything Quadstat has to offer for free, visit quadstat.com or click on the link in the description of this article.
Conclusion
Scraping data with Polars is straightforward and simple. With the use of Polars, you can extract data from websites without having to download and then read in the file. Although Polars does not currently have a read HTML function, there is a workaround to extract tables from HTML pages and turn them into Polars data frames. If you’re looking for a powerful tool to work with data extracted with Polars, Quadstat is an excellent option.