United States Wildlife

My topic for the final project in DATS6103 is to do data mining about U.S Wildlife.

The project will contain three main parts: scraping, data preprocessing and data analyis.

In this section, I will perform the scraping process. The website I selected to extract data is https://www.fws.gov/ (U.S Fish and Wildlife Service).

In [1]:
#import libararies
import requests
import urllib.request
import time
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

Scrape the data from https://www.fws.gov/

The link to search species is: https://ecos.fws.gov/ecp0/reports/ad-hoc-species-report-input

After we select query options in the search page, we will navigate to this base url: https://ecos.fws.gov/ecp0/reports/ad-hoc-species-report

No matter which query options we choose, the result will be the page above. Therefore, when we request the url, the information we can obtain are only scientific name and common name. To retrieve other attributes such as location, status, group, family and date listed, we need to figure out another way.

After exploring the search page, I found out that we can set the name of the attribute we want to retrieve "on" and add it to the base url. The result website will include the information we need in its table.

In [2]:
base_url ="https://ecos.fws.gov/ecp0/reports/ad-hoc-species-report"

listUSA = "&mapstatus=1"  #choose species in United States
status = "?fstatus=on"    #show status
group = "&fgroup=on"      #show taxonomic group
family = "&ffamily=on"    #show family
region = "&fcurrdist=on"  #show location in U.S
date = "&flistingdate=on" #show first date listed
grouptype = "&fvip=on"    #show group type


#add all filter options to the base url
url = base_url + status + group + family + region + date + grouptype + listUSA 
In [3]:
url #here is the new url that contains all information we need for analysis
Out[3]:
'https://ecos.fws.gov/ecp0/reports/ad-hoc-species-report?fstatus=on&fgroup=on&ffamily=on&fcurrdist=on&flistingdate=on&fvip=on&mapstatus=1'

Scrape the table in the page

In the next parts, we will scrape the table in the result page and save it to a dataframe.

The table will contain 8 columns corresponding to 8 attributes: Scientific Name, Common Name, Region/State, Family, First Listed Date, Taxonomic Group, Listing Status, Group Type.

In [4]:
html = requests.get(url) 
In [5]:
html
Out[5]:
<Response [200]>
In [6]:
soup = BeautifulSoup(html.content, 'lxml')
In [7]:
table = soup.find("table") #find the table
In [8]:
rows = table.find_all("tr") #find all the rows in the table. Each row represents a species
In [9]:
#create empty list to store dataframe as an element  
species = []

for row in rows:
    temp = []
    cells = row.find_all("td") #find all columns in each row. We will have 8 columns for each row
    
    if len(cells) == 8: #if a row has 8 columns then function next queries
                        #It may be not necessary, just to make sure no distinguished row exits in the table
            
            temp.append([item.text for item in cells]) #save the columns into a list called temp
    
    #temp contains only 1 item and this item is also a list, each element in the item is a column data 
    #Later when we convert temp to a dataframe, the dataframe will contain 8 columns and 1 row
    
    species.append(pd.DataFrame(temp)) #convert temp to a dataframe
                                       #then save the dataframe into the empty list we created
                                       #after the for-loop, we obtain a list of many dataframes
In [10]:
species[:5] #lets have a look at the list species
Out[10]:
[Empty DataFrame
 Columns: []
 Index: [],
                0           1        2         3   4                    5  \
 0  Abies fraseri  Fraser fir  NC, VA;  Pinaceae  NA  Conifers and Cycads   
 
             6  7  
 0  Not Listed  P  ,
                      0                     1    2         3   4        5  \
 0  Ablautus schlingeri  Oso Flaco robber fly  CA;  Asilidae  NA  Insects   
 
             6  7  
 0  Not Listed  I  ,
                 0                             1    2              3   4  \
 0  Abronia alpina  Ramshaw Meadows sand-verbena  CA;  Nyctaginaceae  NA   
 
                   5           6  7  
 0  Flowering Plants  Not Listed  P  ,
                    0                         1 2              3   4  \
 0  Abronia ammophila  Yellowstone Sand Verbena    Nyctaginaceae  NA   
 
                   5           6  7  
 0  Flowering Plants  Not Listed  P  ]
In [11]:
species[26] #here is a random element in the list and it is a dataframe
Out[11]:
0 1 2 3 4 5 6 7
0 Accipiter cooperii Cooper's hawk CA; Accipitridae NA Birds Not Listed V

Save the scraped data into a dataframe

In [12]:
df = pd.concat(species) #combine all the dataframes in the list into a dataframe called df
In [13]:
df.head(10)   #let's see out dataframe df
              #df has 8732 rows representing 8732 species. The number is the same as shown in the result page
Out[13]:
0 1 2 3 4 5 6 7
0 Abies fraseri Fraser fir NC, VA; Pinaceae NA Conifers and Cycads Not Listed P
0 Ablautus schlingeri Oso Flaco robber fly CA; Asilidae NA Insects Not Listed I
0 Abronia alpina Ramshaw Meadows sand-verbena CA; Nyctaginaceae NA Flowering Plants Not Listed P
0 Abronia ammophila Yellowstone Sand Verbena Nyctaginaceae NA Flowering Plants Not Listed P
0 Abronia ammophila var. No common name Nyctaginaceae NA Flowering Plants Not Listed P
0 Abronia ammphila [Unnamed] sand-verbena WY; Nyctaginaceae NA Flowering Plants Not Listed P
0 Abronia bigelovii [Unnamed] sand-verbena NM; Nyctaginaceae NA Flowering Plants Not Listed P
0 Abronia macrocarpa Large-fruited sand-verbena TX; U.S.A. (TX) Nyctaginaceae Sep 28, 1988 Flowering Plants Endangered P
0 Abronia turbinata [Unnamed] sand-verbena NV; Nyctaginaceae NA Flowering Plants Not Listed P
0 Abronia umbellata acutalata Rose-purple sand-verbena WA; Possibly extinct,last observed in 1940 Nyctaginaceae NA Flowering Plants Not Listed P
In [14]:
df.drop_duplicates(inplace = True) #the result page still contains some duplicates and we need to remove them

colnames = ["Scientific Name","Common Name","Region","Family","First Listed",
            "Taxonomic Group","Status","Type"] #a list of column names that we will use

df.columns = colnames  #name our columns
df.index = range(len(df.index))  #reindex
In [15]:
df.head(10)
Out[15]:
Scientific Name Common Name Region Family First Listed Taxonomic Group Status Type
0 Abies fraseri Fraser fir NC, VA; Pinaceae NA Conifers and Cycads Not Listed P
1 Ablautus schlingeri Oso Flaco robber fly CA; Asilidae NA Insects Not Listed I
2 Abronia alpina Ramshaw Meadows sand-verbena CA; Nyctaginaceae NA Flowering Plants Not Listed P
3 Abronia ammophila Yellowstone Sand Verbena Nyctaginaceae NA Flowering Plants Not Listed P
4 Abronia ammophila var. No common name Nyctaginaceae NA Flowering Plants Not Listed P
5 Abronia ammphila [Unnamed] sand-verbena WY; Nyctaginaceae NA Flowering Plants Not Listed P
6 Abronia bigelovii [Unnamed] sand-verbena NM; Nyctaginaceae NA Flowering Plants Not Listed P
7 Abronia macrocarpa Large-fruited sand-verbena TX; U.S.A. (TX) Nyctaginaceae Sep 28, 1988 Flowering Plants Endangered P
8 Abronia turbinata [Unnamed] sand-verbena NV; Nyctaginaceae NA Flowering Plants Not Listed P
9 Abronia umbellata acutalata Rose-purple sand-verbena WA; Possibly extinct,last observed in 1940 Nyctaginaceae NA Flowering Plants Not Listed P

Save the data into a CSV file

In the final step, I will save my scraped data into a csv file

In [16]:
df.to_csv("US.Wildlife.csv", sep='\t') # save it in a CSV file