My topic for the final project in DATS6103 is to do data mining about U.S Wildlife.
The project will contain three main parts: scraping, data preprocessing and data analyis.
In this section, I will perform the scraping process. The website I selected to extract data is https://www.fws.gov/ (U.S Fish and Wildlife Service).
#import libararies
import requests
import urllib.request
import time
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
The link to search species is: https://ecos.fws.gov/ecp0/reports/ad-hoc-species-report-input
After we select query options in the search page, we will navigate to this base url: https://ecos.fws.gov/ecp0/reports/ad-hoc-species-report
No matter which query options we choose, the result will be the page above. Therefore, when we request the url, the information we can obtain are only scientific name and common name. To retrieve other attributes such as location, status, group, family and date listed, we need to figure out another way.
After exploring the search page, I found out that we can set the name of the attribute we want to retrieve "on" and add it to the base url. The result website will include the information we need in its table.
base_url ="https://ecos.fws.gov/ecp0/reports/ad-hoc-species-report"
listUSA = "&mapstatus=1" #choose species in United States
status = "?fstatus=on" #show status
group = "&fgroup=on" #show taxonomic group
family = "&ffamily=on" #show family
region = "&fcurrdist=on" #show location in U.S
date = "&flistingdate=on" #show first date listed
grouptype = "&fvip=on" #show group type
#add all filter options to the base url
url = base_url + status + group + family + region + date + grouptype + listUSA
url #here is the new url that contains all information we need for analysis
In the next parts, we will scrape the table in the result page and save it to a dataframe.
The table will contain 8 columns corresponding to 8 attributes: Scientific Name, Common Name, Region/State, Family, First Listed Date, Taxonomic Group, Listing Status, Group Type.
html = requests.get(url)
html
soup = BeautifulSoup(html.content, 'lxml')
table = soup.find("table") #find the table
rows = table.find_all("tr") #find all the rows in the table. Each row represents a species
#create empty list to store dataframe as an element
species = []
for row in rows:
temp = []
cells = row.find_all("td") #find all columns in each row. We will have 8 columns for each row
if len(cells) == 8: #if a row has 8 columns then function next queries
#It may be not necessary, just to make sure no distinguished row exits in the table
temp.append([item.text for item in cells]) #save the columns into a list called temp
#temp contains only 1 item and this item is also a list, each element in the item is a column data
#Later when we convert temp to a dataframe, the dataframe will contain 8 columns and 1 row
species.append(pd.DataFrame(temp)) #convert temp to a dataframe
#then save the dataframe into the empty list we created
#after the for-loop, we obtain a list of many dataframes
species[:5] #lets have a look at the list species
species[26] #here is a random element in the list and it is a dataframe
df = pd.concat(species) #combine all the dataframes in the list into a dataframe called df
df.head(10) #let's see out dataframe df
#df has 8732 rows representing 8732 species. The number is the same as shown in the result page
df.drop_duplicates(inplace = True) #the result page still contains some duplicates and we need to remove them
colnames = ["Scientific Name","Common Name","Region","Family","First Listed",
"Taxonomic Group","Status","Type"] #a list of column names that we will use
df.columns = colnames #name our columns
df.index = range(len(df.index)) #reindex
df.head(10)
In the final step, I will save my scraped data into a csv file
df.to_csv("US.Wildlife.csv", sep='\t') # save it in a CSV file