Pandas

Pandas

  1. Pandas is for fast analysis and data cleaning and preparation.
  2. Built on top of Numpy.
  3. Work with data from a wide variety of sources.

1. Pandas Series

Just like dictionary of Python:

import pandas as pd

pd.Series(data = mylist, index = labelList)
pd.Series(myDict)

2. Pandas DataFrames

df = pd.DataFrame(randn(5,4),index=['A','B','C','D','E'],columns=['W','X','Y','Z'])
df = pd.DataFrame(randn(5,4),index='A B C D E'.split(),columns='W X Y Z'.split())

DataFrames

DataFrames are the workhorse of pandas and are directly inspired by the R programming language. We can think of a DataFrame as a bunch of Series objects put together to share the same index. Let’s use pandas to explore this topic!

import pandas as pd
import numpy as np

from numpy.random import randn
np.random.seed(101)

df = pd.DataFrame(randn(5,4),index='A B C D E'.split(),columns='W X Y Z'.split())

W X Y Z
A 2.706850 0.628133 0.907969 0.503826
B 0.651118 -0.319318 -0.848077 0.605965
C -2.018168 0.740122 0.528813 -0.589001
D 0.188695 -0.758872 -0.933237 0.955057
E 0.190794 1.978757 2.605967 0.683509

Selection and Indexing

Let’s learn the various methods to grab data from a DataFrame

Get Column by Index

df.columns[0:3]
>> Index(['W', 'X', 'Y'], dtype='object')

Get Column by Column Name

df['W']
# SQL Syntax (NOT RECOMMENDED!)
df.W
A    2.706850
B    0.651118
C   -2.018168
D    0.188695
E    0.190794
Name: W, dtype: float64
# Pass a list of column names
df[['W','Z']]
W Z
A 2.706850 0.503826
B 0.651118 0.605965
C -2.018168 -0.589001
D 0.188695 0.955057
E 0.190794 0.683509
type(df['W'])
>> pandas.core.series.Series

Creating a new column:

df['new'] = df['W'] + df['Y']

W X Y Z new
A 2.706850 0.628133 0.907969 0.503826 3.614819
B 0.651118 -0.319318 -0.848077 0.605965 -0.196959
C -2.018168 0.740122 0.528813 -0.589001 -1.489355
D 0.188695 -0.758872 -0.933237 0.955057 -0.744542
E 0.190794 1.978757 2.605967 0.683509 2.796762

Insert A New Column

X_train.insert(0, "x0", [0] * (len(y_train)), False)

Removing Columns

Features of drop function:

  1. drop method’s default axis = 0 ( = the labels), so if we want to delete a column, we shall set axis = 1
    Beacuse Pandas is built on Numpy, so it use Numpy to save the data:
    for df.shape = (5,4), we have df.shape[axis = 1] = Column.
  2. the drop function isn’t inplace df. If we want to do inplace, we need:
    df.drop(‘new’,axis=1,inplace=True)
df.drop('new',axis=1)

W X Y Z
A 2.706850 0.628133 0.907969 0.503826
B 0.651118 -0.319318 -0.848077 0.605965
C -2.018168 0.740122 0.528813 -0.589001
D 0.188695 -0.758872 -0.933237 0.955057
E 0.190794 1.978757 2.605967 0.683509
df.drop('new',axis=1,inplace=True)
W X Y Z
A 2.706850 0.628133 0.907969 0.503826
B 0.651118 -0.319318 -0.848077 0.605965
C -2.018168 0.740122 0.528813 -0.589001
D 0.188695 -0.758872 -0.933237 0.955057
E 0.190794 1.978757 2.605967 0.683509

Can also drop rows this way:

df.drop('E',axis=0)
W X Y Z
A 2.706850 0.628133 0.907969 0.503826
B 0.651118 -0.319318 -0.848077 0.605965
C -2.018168 0.740122 0.528813 -0.589001
D 0.188695 -0.758872 -0.933237 0.955057

Selecting Rows

df.loc['A']
W    2.706850
X    0.628133
Y    0.907969
Z    0.503826
Name: A, dtype: float64

Or select based off of position instead of label: iloc = Index Location (0 Base)

df.iloc[2]
W   -2.018168
X    0.740122
Y    0.528813
Z   -0.589001
Name: C, dtype: float64

Selecting subset of rows and columns

df.loc['B','Y']
>> -0.84807698340363147
df.loc[['A','B'],['W','Y']]
W Y
A 2.706850 0.907969
B 0.651118 -0.848077

Conditional Selection

The conditional selection usage is similar to Numpy.

cond = df > 0

  1. for all data, if data > 0 then data = data, or data = NaN
  2. ‘condition’ can be taken as filter
df>0

W X Y Z new
A True True True True True
B True False False True False
C False True True False False
D True False False True False
E True True True True True
booldf = df > 0
df[booldf]
#or
df[df>0]

W X Y Z new
A 2.706850 0.628133 0.907969 0.503826 3.614819
B 0.651118 NaN NaN 0.605965 NaN
C NaN 0.740122 0.528813 NaN NaN
D 0.188695 NaN NaN 0.955057 NaN
E 0.190794 1.978757 2.605967 0.683509 2.796762
  1. Select data after droped data of !cond(W < 0)
    df[df['W']>0]

W X Y Z new
A 2.706850 0.628133 0.907969 0.503826 3.614819
B 0.651118 -0.319318 -0.848077 0.605965 -0.196959
D 0.188695 -0.758872 -0.933237 0.955057 -0.744542
E 0.190794 1.978757 2.605967 0.683509 2.796762
  1. Select the Y, X columns after droped data of !cond(W < 0)
df[df['W']>0][['Y','X']]

Y X
A 0.907969 0.628133
B -0.848077 -0.319318
D -0.933237 -0.758872
E 2.605967 1.978757
  1. For two conditions you can use | and & with parenthesis:
    The conditions can be seen as
    1. condition table 1 | condition table 2 
    2. condition table 1 & condition table 2 
#Select labels 
df[(df['W']>0) & (df['Y'] > 1)]

W X Y Z new
E 0.190794 1.978757 2.605967 0.683509 2.796762

Index

Reset Index

# Reset to default 0,1...n index
# Note : 1. the current index will be listed in a new column
# : 2. if we want to do it inplace, use df.reset_index(inplace = True)
df.reset_index()

index W X Y Z new
0 A 2.706850 0.628133 0.907969 0.503826 3.614819
1 B 0.651118 -0.319318 -0.848077 0.605965 -0.196959
2 C -2.018168 0.740122 0.528813 -0.589001 -1.489355
3 D 0.188695 -0.758872 -0.933237 0.955057 -0.744542
4 E 0.190794 1.978757 2.605967 0.683509 2.796762

Set New Index by A Column in Table

newind = 'CA NY WY OR CO'.split()
df['States'] = newind
df.set_index('States',inplace=True)

W X Y Z new
States
CA 2.706850 0.628133 0.907969 0.503826 3.614819
NY 0.651118 -0.319318 -0.848077 0.605965 -0.196959
WY -2.018168 0.740122 0.528813 -0.589001 -1.489355
OR 0.188695 -0.758872 -0.933237 0.955057 -0.744542
CO 0.190794 1.978757 2.605967 0.683509 2.796762