United States Wildlife

My topic for the final project in DATS6103 is to do data mining about U.S Wildlife.

The project will contain three main parts: scraping, data preprocessing and data analyis.

In this section, I will perform the scraping process. The website I selected to extract data is https://www.fws.gov/ (U.S Fish and Wildlife Service).

#import libararies
import requests
import urllib.request
import time
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

Scrape the data from https://www.fws.gov/¶

The link to search species is: https://ecos.fws.gov/ecp0/reports/ad-hoc-species-report-input

After we select query options in the search page, we will navigate to this base url: https://ecos.fws.gov/ecp0/reports/ad-hoc-species-report

No matter which query options we choose, the result will be the page above. Therefore, when we request the url, the information we can obtain are only scientific name and common name. To retrieve other attributes such as location, status, group, family and date listed, we need to figure out another way.

After exploring the search page, I found out that we can set the name of the attribute we want to retrieve "on" and add it to the base url. The result website will include the information we need in its table.

base_url ="https://ecos.fws.gov/ecp0/reports/ad-hoc-species-report"

listUSA = "&mapstatus=1"  #choose species in United States
status = "?fstatus=on"    #show status
group = "&fgroup=on"      #show taxonomic group
family = "&ffamily=on"    #show family
region = "&fcurrdist=on"  #show location in U.S
date = "&flistingdate=on" #show first date listed
grouptype = "&fvip=on"    #show group type


#add all filter options to the base url
url = base_url + status + group + family + region + date + grouptype + listUSA

url #here is the new url that contains all information we need for analysis

'https://ecos.fws.gov/ecp0/reports/ad-hoc-species-report?fstatus=on&fgroup=on&ffamily=on&fcurrdist=on&flistingdate=on&fvip=on&mapstatus=1'

Scrape the table in the page¶

In the next parts, we will scrape the table in the result page and save it to a dataframe.

The table will contain 8 columns corresponding to 8 attributes: Scientific Name, Common Name, Region/State, Family, First Listed Date, Taxonomic Group, Listing Status, Group Type.

html = requests.get(url)

html

<Response [200]>

soup = BeautifulSoup(html.content, 'lxml')

table = soup.find("table") #find the table

rows = table.find_all("tr") #find all the rows in the table. Each row represents a species

#create empty list to store dataframe as an element  
species = []

for row in rows:
    temp = []
    cells = row.find_all("td") #find all columns in each row. We will have 8 columns for each row
    
    if len(cells) == 8: #if a row has 8 columns then function next queries
                        #It may be not necessary, just to make sure no distinguished row exits in the table
            
            temp.append([item.text for item in cells]) #save the columns into a list called temp
    
    #temp contains only 1 item and this item is also a list, each element in the item is a column data 
    #Later when we convert temp to a dataframe, the dataframe will contain 8 columns and 1 row
    
    species.append(pd.DataFrame(temp)) #convert temp to a dataframe
                                       #then save the dataframe into the empty list we created
                                       #after the for-loop, we obtain a list of many dataframes

species[:5] #lets have a look at the list species

[Empty DataFrame
 Columns: []
 Index: [],
                0           1        2         3   4                    5  \
 0  Abies fraseri  Fraser fir  NC, VA;  Pinaceae  NA  Conifers and Cycads   
 
             6  7  
 0  Not Listed  P  ,
                      0                     1    2         3   4        5  \
 0  Ablautus schlingeri  Oso Flaco robber fly  CA;  Asilidae  NA  Insects   
 
             6  7  
 0  Not Listed  I  ,
                 0                             1    2              3   4  \
 0  Abronia alpina  Ramshaw Meadows sand-verbena  CA;  Nyctaginaceae  NA   
 
                   5           6  7  
 0  Flowering Plants  Not Listed  P  ,
                    0                         1 2              3   4  \
 0  Abronia ammophila  Yellowstone Sand Verbena    Nyctaginaceae  NA   
 
                   5           6  7  
 0  Flowering Plants  Not Listed  P  ]

species[26] #here is a random element in the list and it is a dataframe

Save the scraped data into a dataframe¶

df = pd.concat(species) #combine all the dataframes in the list into a dataframe called df

df.head(10)   #let's see out dataframe df
              #df has 8732 rows representing 8732 species. The number is the same as shown in the result page

df.drop_duplicates(inplace = True) #the result page still contains some duplicates and we need to remove them

colnames = ["Scientific Name","Common Name","Region","Family","First Listed",
            "Taxonomic Group","Status","Type"] #a list of column names that we will use

df.columns = colnames  #name our columns
df.index = range(len(df.index))  #reindex

df.head(10)

Save the data into a CSV file¶

In the final step, I will save my scraped data into a csv file

df.to_csv("US.Wildlife.csv", sep='\t') # save it in a CSV file

0	1	2	3	4	5	6	7
Abies fraseri	Fraser fir	NC, VA;	Pinaceae	NA	Conifers and Cycads	Not Listed	P
Ablautus schlingeri	Oso Flaco robber fly	CA;	Asilidae	NA	Insects	Not Listed	I
Abronia alpina	Ramshaw Meadows sand-verbena	CA;	Nyctaginaceae	NA	Flowering Plants	Not Listed	P
Abronia ammophila	Yellowstone Sand Verbena		Nyctaginaceae	NA	Flowering Plants	Not Listed	P
Abronia ammophila var.	No common name		Nyctaginaceae	NA	Flowering Plants	Not Listed	P
Abronia ammphila	[Unnamed] sand-verbena	WY;	Nyctaginaceae	NA	Flowering Plants	Not Listed	P
Abronia bigelovii	[Unnamed] sand-verbena	NM;	Nyctaginaceae	NA	Flowering Plants	Not Listed	P
Abronia macrocarpa	Large-fruited sand-verbena	TX; U.S.A. (TX)	Nyctaginaceae	Sep 28, 1988	Flowering Plants	Endangered	P
Abronia turbinata	[Unnamed] sand-verbena	NV;	Nyctaginaceae	NA	Flowering Plants	Not Listed	P
Abronia umbellata acutalata	Rose-purple sand-verbena	WA; Possibly extinct,last observed in 1940	Nyctaginaceae	NA	Flowering Plants	Not Listed	P