Preprocessing US.Wildlife.csv data

Before analysis, I will clean the data to make it look better.

Data Preprocessing

In [1]:
from pandas import DataFrame
import pandas as pd
from functools import reduce
import numpy as np
import urllib.parse
from urllib.parse import urljoin
import warnings
warnings.filterwarnings("ignore")
In [2]:
species = pd.read_csv('US.Wildlife.csv', '\t', index_col=0) # read csv file and save it in species dataframe
In [3]:
species.head(10)
Out[3]:
Scientific Name Common Name Region Family First Listed Taxonomic Group Status Type
0 Abies fraseri Fraser fir NC, VA; Pinaceae NaN Conifers and Cycads Not Listed P
1 Ablautus schlingeri Oso Flaco robber fly CA; Asilidae NaN Insects Not Listed I
2 Abronia alpina Ramshaw Meadows sand-verbena CA; Nyctaginaceae NaN Flowering Plants Not Listed P
3 Abronia ammophila Yellowstone Sand Verbena NaN Nyctaginaceae NaN Flowering Plants Not Listed P
4 Abronia ammophila var. No common name NaN Nyctaginaceae NaN Flowering Plants Not Listed P
5 Abronia ammphila [Unnamed] sand-verbena WY; Nyctaginaceae NaN Flowering Plants Not Listed P
6 Abronia bigelovii [Unnamed] sand-verbena NM; Nyctaginaceae NaN Flowering Plants Not Listed P
7 Abronia macrocarpa Large-fruited sand-verbena TX; U.S.A. (TX) Nyctaginaceae Sep 28, 1988 Flowering Plants Endangered P
8 Abronia turbinata [Unnamed] sand-verbena NV; Nyctaginaceae NaN Flowering Plants Not Listed P
9 Abronia umbellata acutalata Rose-purple sand-verbena WA; Possibly extinct,last observed in 1940 Nyctaginaceae NaN Flowering Plants Not Listed P
In [4]:
species.dtypes  #check data types
Out[4]:
Scientific Name    object
Common Name        object
Region             object
Family             object
First Listed       object
Taxonomic Group    object
Status             object
Type               object
dtype: object

There are five columns that need to be preprocessed.

At first, I will keep the U.S state codes in the Region column, other texts which are the description contents will be removed. In the First Listed column, I only use the year data and remove day and month. Then, I will change the data type of this column to integer. In the Common Name column, any species with no common name will be assigned common name as NaN and I will remove the brackets [Unnamed]. In the Family column, I will make all the family names to lowercase. Finally, in the Status column, any status listed as Original Data in Error will be changed to NaN.

Region Column

In [5]:
split_char = ";" #define a split character. 
                  #In Region column, everything before ";" is U.S state code(s). The texts after ";" will be removed
split_char1 = ","

for i in range (0,len(species)):
    st0 = species.ix[i,"Region"] #get the string data in Region column at row i
    try:
        temp = st0.split(split_char)    #try to split the string
        st1 = split_char.join(temp[:1]) #get the texts before ";" which are U.S state codes
        
    #After split the string by ";", we split it again by ","
    #If the string before commna has length = 2, then st1 will contain the state codes that we need to assign
    #Else, st1 is the description and there is no state code we need for region column, we set the column as nan
    
        temp1 = st1.split(split_char1)    #split st1 by ","
        st2 = split_char1.join(temp1[:1]) #get the text before ","
        if len(st2) == 2:
            species.ix[i,"Region"] = st1 #if length is 2 then st1 contains state codes, we assign st1 to column
        else:
            species.ix[i,"Region"] = np.nan #if not then st1 is the description, we set region as nan
    except:
        pass    #if the column is NaN then we cannot perform split, hence, pass
    
    

First Listed Column

In [6]:
split_char = "," #the year data is after the commna


for i in range (0,len(species)):
    st0 = species.ix[i,"First Listed"]
    try:
        temp = st0.split(split_char)      #try to split the string
        st1 = split_char.join(temp[1:])    #get the year after the commna
        species.ix[i,"First Listed"] = st1 #assign year to the column
    except:
        pass  #if there is no listing date then we cannot split the string so we pass and let it NaN as normal
   
In [7]:
#convert NaN to 0 and change "First Listed" to integer type
species["First Listed"] = species["First Listed"].fillna("0").astype(int) 
species["First Listed"].replace(0, np.nan, inplace=True) #reasign NaN 
pd.set_option('precision', 0) #make sure the year is shown as integer without additional decimal (.0)
In [8]:
species.dtypes #check the data types now
Out[8]:
Scientific Name     object
Common Name         object
Region              object
Family              object
First Listed       float64
Taxonomic Group     object
Status              object
Type                object
dtype: object

Common Name Column

In [9]:
split_char = "]"


for i in range (0,len(species)):
    st0 = species.ix[i,"Common Name"]
    temp = st0.split(split_char)
    st1 = split_char.join(temp[1:]) #select the texts after the bracket "]"
    if (len(st1) == 0): #if the length of texts after the bracket "]" is 0, then we do not need to change anything
        species.ix[i,"Common Name"] = st0
    else:   #else, we select the texts after the bracket "]" and the brackets before it will be removed
        species.ix[i,"Common Name"] = st1
In [10]:
species["Common Name"].replace("No common name", np.nan, inplace=True) #species without common name is assigned as NaN

Family Column

In [11]:
for i in range(0,len(species)):
    text = species.ix[i,"Family"].lower() 
    species.ix[i,"Family"] = text   #change each data in the Family column to lowercase

Status Column

In [12]:
#The column data which state Original Data in Error will have descriptions behind character "-"
#To identify those columns, I split the status data in each row by "-"
#If there is error description behind "-", we change the data in that column to NaN; else we pass

split_char = "-"

for i in range(0,len(species)):
    temp = species.ix[i,"Status"].split(split_char)
    error = st1 = split_char.join(temp[1:])
    if (len(error) != 0):  #if len is not 0, we have texts behind "-" and it is an error description
        species.ix[i,"Status"] = np.nan  #change the column value to NaN
    else: #else, nothing changed
        pass
        
             
   
In [13]:
#remove the whitespace before the Status in each column using lstrip()
for i in range(0,len(species)):
    try:
        species.ix[i,"Status"] = species.ix[i,"Status"].lstrip()
    except:
        pass
In [14]:
species.head(20) #check our dataframe before saving it as CSV file
Out[14]:
Scientific Name Common Name Region Family First Listed Taxonomic Group Status Type
0 Abies fraseri Fraser fir NC, VA pinaceae NaN Conifers and Cycads Not Listed P
1 Ablautus schlingeri Oso Flaco robber fly CA asilidae NaN Insects Not Listed I
2 Abronia alpina Ramshaw Meadows sand-verbena CA nyctaginaceae NaN Flowering Plants Not Listed P
3 Abronia ammophila Yellowstone Sand Verbena NaN nyctaginaceae NaN Flowering Plants Not Listed P
4 Abronia ammophila var. NaN NaN nyctaginaceae NaN Flowering Plants Not Listed P
5 Abronia ammphila sand-verbena WY nyctaginaceae NaN Flowering Plants Not Listed P
6 Abronia bigelovii sand-verbena NM nyctaginaceae NaN Flowering Plants Not Listed P
7 Abronia macrocarpa Large-fruited sand-verbena TX nyctaginaceae 1988 Flowering Plants Endangered P
8 Abronia turbinata sand-verbena NV nyctaginaceae NaN Flowering Plants Not Listed P
9 Abronia umbellata acutalata Rose-purple sand-verbena WA nyctaginaceae NaN Flowering Plants Not Listed P
10 Abronia umbellata breviflora Pink sand-verbena CA, OR nyctaginaceae NaN Flowering Plants Not Listed P
11 Abutilon eremitopetalum NaN HI malvaceae 1991 Flowering Plants Endangered P
12 Abutilon hulseanum Terciopelo PR, VI malvaceae NaN Flowering Plants Not Listed P
13 Abutilon menziesii Ko`oloa`ula HI malvaceae 1986 Flowering Plants Endangered P
14 Abutilon parishii NaN AZ malvaceae NaN Flowering Plants Not Listed P
15 Abutilon sandwicense NaN HI malvaceae 1991 Flowering Plants Endangered P
16 Abutilon virginianum NaN PR, VI malvaceae NaN Flowering Plants Not Listed P
17 Acacia koaia Koai`e HI fabaceae NaN Flowering Plants Not Listed P
18 Acaena exigua Liliwai HI rosaceae 1992 Flowering Plants Endangered P
19 Acalyptera susanae lace bug AR tingidae NaN Insects Not Listed I
In [15]:
species.to_csv("CleanWildlife.csv", sep='\t') # save cleaned data in a CSV file