Pandas

Pandas is for fast analysis and data cleaning and preparation.
Built on top of Numpy.
Work with data from a wide variety of sources.

1. Pandas Series

Just like dictionary of Python:

import pandas as pd

pd.Series(data = mylist, index = labelList)
pd.Series(myDict)

2. Pandas DataFrames

df = pd.DataFrame(randn(5,4),index=['A','B','C','D','E'],columns=['W','X','Y','Z'])
df = pd.DataFrame(randn(5,4),index='A B C D E'.split(),columns='W X Y Z'.split())

DataFrames

DataFrames are the workhorse of pandas and are directly inspired by the R programming language. We can think of a DataFrame as a bunch of Series objects put together to share the same index. Let’s use pandas to explore this topic!

import pandas as pd
import numpy as np

from numpy.random import randn
np.random.seed(101)

df = pd.DataFrame(randn(5,4),index='A B C D E'.split(),columns='W X Y Z'.split())

	W	X	Y	Z
A	2.706850	0.628133	0.907969	0.503826
B	0.651118	-0.319318	-0.848077	0.605965
C	-2.018168	0.740122	0.528813	-0.589001
D	0.188695	-0.758872	-0.933237	0.955057
E	0.190794	1.978757	2.605967	0.683509

Selection and Indexing

Let’s learn the various methods to grab data from a DataFrame

Get Column by Index

df.columns[0:3]
>> Index(['W', 'X', 'Y'], dtype='object')

Get Column by Column Name

df['W']
# SQL Syntax (NOT RECOMMENDED!)
df.W

A    2.706850
B    0.651118
C   -2.018168
D    0.188695
E    0.190794
Name: W, dtype: float64

# Pass a list of column names
df[['W','Z']]

	W	Z
A	2.706850	0.503826
B	0.651118	0.605965
C	-2.018168	-0.589001
D	0.188695	0.955057
E	0.190794	0.683509

type(df['W'])
>> pandas.core.series.Series

Creating a new column:

df['new'] = df['W'] + df['Y']

	W	X	Y	Z	new
A	2.706850	0.628133	0.907969	0.503826	3.614819
B	0.651118	-0.319318	-0.848077	0.605965	-0.196959
C	-2.018168	0.740122	0.528813	-0.589001	-1.489355
D	0.188695	-0.758872	-0.933237	0.955057	-0.744542
E	0.190794	1.978757	2.605967	0.683509	2.796762

Insert A New Column

X_train.insert(0, "x0", [0] * (len(y_train)), False)

Removing Columns

Features of drop function:

drop method’s default axis = 0 ( = the labels), so if we want to delete a column, we shall set axis = 1
Beacuse Pandas is built on Numpy, so it use Numpy to save the data:
for df.shape = (5,4), we have df.shape[axis = 1] = Column.
the drop function isn’t inplace df. If we want to do inplace, we need:
df.drop(‘new’,axis=1,inplace=True)

df.drop('new',axis=1)

	W	X	Y	Z
A	2.706850	0.628133	0.907969	0.503826
B	0.651118	-0.319318	-0.848077	0.605965
C	-2.018168	0.740122	0.528813	-0.589001
D	0.188695	-0.758872	-0.933237	0.955057
E	0.190794	1.978757	2.605967	0.683509

df.drop('new',axis=1,inplace=True)

	W	X	Y	Z
A	2.706850	0.628133	0.907969	0.503826
B	0.651118	-0.319318	-0.848077	0.605965
C	-2.018168	0.740122	0.528813	-0.589001
D	0.188695	-0.758872	-0.933237	0.955057
E	0.190794	1.978757	2.605967	0.683509

Can also drop rows this way:

df.drop('E',axis=0)

	W	X	Y	Z
A	2.706850	0.628133	0.907969	0.503826
B	0.651118	-0.319318	-0.848077	0.605965
C	-2.018168	0.740122	0.528813	-0.589001
D	0.188695	-0.758872	-0.933237	0.955057

Selecting Rows

df.loc['A']

W    2.706850
X    0.628133
Y    0.907969
Z    0.503826
Name: A, dtype: float64

Or select based off of position instead of label: iloc = Index Location (0 Base)

df.iloc[2]

W   -2.018168
X    0.740122
Y    0.528813
Z   -0.589001
Name: C, dtype: float64

Selecting subset of rows and columns

df.loc['B','Y']
>> -0.84807698340363147
df.loc[['A','B'],['W','Y']]

	W	Y
A	2.706850	0.907969
B	0.651118	-0.848077

Conditional Selection

The conditional selection usage is similar to Numpy.

cond = df > 0

for all data, if data > 0 then data = data, or data = NaN
‘condition’ can be taken as filter

df>0

	W	X	Y	Z	new
A	True	True	True	True	True
B	True	False	False	True	False
C	False	True	True	False	False
D	True	False	False	True	False
E	True	True	True	True	True

booldf = df > 0
df[booldf]
#or
df[df>0]

	W	X	Y	Z	new
A	2.706850	0.628133	0.907969	0.503826	3.614819
B	0.651118	NaN	NaN	0.605965	NaN
C	NaN	0.740122	0.528813	NaN	NaN
D	0.188695	NaN	NaN	0.955057	NaN
E	0.190794	1.978757	2.605967	0.683509	2.796762

Select data after droped data of !cond(W < 0)
df[df['W']>0]

	W	X	Y	Z	new
A	2.706850	0.628133	0.907969	0.503826	3.614819
B	0.651118	-0.319318	-0.848077	0.605965	-0.196959
D	0.188695	-0.758872	-0.933237	0.955057	-0.744542
E	0.190794	1.978757	2.605967	0.683509	2.796762

Select the Y, X columns after droped data of !cond(W < 0)

df[df['W']>0][['Y','X']]

	Y	X
A	0.907969	0.628133
B	-0.848077	-0.319318
D	-0.933237	-0.758872
E	2.605967	1.978757

For two conditions you can use | and & with parenthesis:
The conditions can be seen as
```
1. condition table 1 | condition table 2 
2. condition table 1 & condition table 2 
```

#Select labels 
df[(df['W']>0) & (df['Y'] > 1)]

	W	X	Y	Z	new
E	0.190794	1.978757	2.605967	0.683509	2.796762

Index

Reset Index

# Reset to default 0,1...n index
# Note : 1. the current index will be listed in a new column
#      : 2. if we want to do it inplace, use df.reset_index(inplace = True)
df.reset_index()

	index	W	X	Y	Z	new
0	A	2.706850	0.628133	0.907969	0.503826	3.614819
1	B	0.651118	-0.319318	-0.848077	0.605965	-0.196959
2	C	-2.018168	0.740122	0.528813	-0.589001	-1.489355
3	D	0.188695	-0.758872	-0.933237	0.955057	-0.744542
4	E	0.190794	1.978757	2.605967	0.683509	2.796762

Set New Index by A Column in Table

newind = 'CA NY WY OR CO'.split()
df['States'] = newind
df.set_index('States',inplace=True)

	W	X	Y	Z	new
States
CA	2.706850	0.628133	0.907969	0.503826	3.614819
NY	0.651118	-0.319318	-0.848077	0.605965	-0.196959
WY	-2.018168	0.740122	0.528813	-0.589001	-1.489355
OR	0.188695	-0.758872	-0.933237	0.955057	-0.744542
CO	0.190794	1.978757	2.605967	0.683509	2.796762