Pandas
- Pandas is for fast analysis and data cleaning and preparation.
- Built on top of Numpy.
- Work with data from a wide variety of sources.
1. Pandas Series
Just like dictionary of Python:
import pandas as pd |
2. Pandas DataFrames
df = pd.DataFrame(randn(5,4),index=['A','B','C','D','E'],columns=['W','X','Y','Z']) |
DataFrames
DataFrames are the workhorse of pandas and are directly inspired by the R programming language. We can think of a DataFrame as a bunch of Series objects put together to share the same index. Let’s use pandas to explore this topic!
import pandas as pd |
W | X | Y | Z | |
---|---|---|---|---|
A | 2.706850 | 0.628133 | 0.907969 | 0.503826 |
B | 0.651118 | -0.319318 | -0.848077 | 0.605965 |
C | -2.018168 | 0.740122 | 0.528813 | -0.589001 |
D | 0.188695 | -0.758872 | -0.933237 | 0.955057 |
E | 0.190794 | 1.978757 | 2.605967 | 0.683509 |
Selection and Indexing
Let’s learn the various methods to grab data from a DataFrame
Get Column by Index
df.columns[0:3] |
Get Column by Column Name
df['W'] |
A 2.706850
B 0.651118
C -2.018168
D 0.188695
E 0.190794
Name: W, dtype: float64
# Pass a list of column names |
W | Z | |
---|---|---|
A | 2.706850 | 0.503826 |
B | 0.651118 | 0.605965 |
C | -2.018168 | -0.589001 |
D | 0.188695 | 0.955057 |
E | 0.190794 | 0.683509 |
type(df['W']) |
Creating a new column:
df['new'] = df['W'] + df['Y'] |
W | X | Y | Z | new | |
---|---|---|---|---|---|
A | 2.706850 | 0.628133 | 0.907969 | 0.503826 | 3.614819 |
B | 0.651118 | -0.319318 | -0.848077 | 0.605965 | -0.196959 |
C | -2.018168 | 0.740122 | 0.528813 | -0.589001 | -1.489355 |
D | 0.188695 | -0.758872 | -0.933237 | 0.955057 | -0.744542 |
E | 0.190794 | 1.978757 | 2.605967 | 0.683509 | 2.796762 |
Insert A New Column
X_train.insert(0, "x0", [0] * (len(y_train)), False) |
Removing Columns
Features of drop function:
- drop method’s default axis = 0 ( = the labels), so if we want to delete a column, we shall set axis = 1
Beacuse Pandas is built on Numpy, so it use Numpy to save the data:
for df.shape = (5,4), we have df.shape[axis = 1] = Column. - the drop function isn’t inplace df. If we want to do inplace, we need:
df.drop(‘new’,axis=1,inplace=True)
df.drop('new',axis=1) |
W | X | Y | Z | |
---|---|---|---|---|
A | 2.706850 | 0.628133 | 0.907969 | 0.503826 |
B | 0.651118 | -0.319318 | -0.848077 | 0.605965 |
C | -2.018168 | 0.740122 | 0.528813 | -0.589001 |
D | 0.188695 | -0.758872 | -0.933237 | 0.955057 |
E | 0.190794 | 1.978757 | 2.605967 | 0.683509 |
df.drop('new',axis=1,inplace=True) |
W | X | Y | Z | |
---|---|---|---|---|
A | 2.706850 | 0.628133 | 0.907969 | 0.503826 |
B | 0.651118 | -0.319318 | -0.848077 | 0.605965 |
C | -2.018168 | 0.740122 | 0.528813 | -0.589001 |
D | 0.188695 | -0.758872 | -0.933237 | 0.955057 |
E | 0.190794 | 1.978757 | 2.605967 | 0.683509 |
Can also drop rows this way:
df.drop('E',axis=0) |
W | X | Y | Z | |
---|---|---|---|---|
A | 2.706850 | 0.628133 | 0.907969 | 0.503826 |
B | 0.651118 | -0.319318 | -0.848077 | 0.605965 |
C | -2.018168 | 0.740122 | 0.528813 | -0.589001 |
D | 0.188695 | -0.758872 | -0.933237 | 0.955057 |
Selecting Rows
df.loc['A'] |
W 2.706850
X 0.628133
Y 0.907969
Z 0.503826
Name: A, dtype: float64
Or select based off of position instead of label: iloc = Index Location (0 Base)
df.iloc[2] |
W -2.018168
X 0.740122
Y 0.528813
Z -0.589001
Name: C, dtype: float64
Selecting subset of rows and columns
df.loc['B','Y'] |
W | Y | |
---|---|---|
A | 2.706850 | 0.907969 |
B | 0.651118 | -0.848077 |
Conditional Selection
The conditional selection usage is similar to Numpy.
cond = df > 0
- for all data, if data > 0 then data = data, or data = NaN
- ‘condition’ can be taken as filter
df>0 |
W | X | Y | Z | new | |
---|---|---|---|---|---|
A | True | True | True | True | True |
B | True | False | False | True | False |
C | False | True | True | False | False |
D | True | False | False | True | False |
E | True | True | True | True | True |
booldf = df > 0 |
W | X | Y | Z | new | |
---|---|---|---|---|---|
A | 2.706850 | 0.628133 | 0.907969 | 0.503826 | 3.614819 |
B | 0.651118 | NaN | NaN | 0.605965 | NaN |
C | NaN | 0.740122 | 0.528813 | NaN | NaN |
D | 0.188695 | NaN | NaN | 0.955057 | NaN |
E | 0.190794 | 1.978757 | 2.605967 | 0.683509 | 2.796762 |
- Select data after droped data of !cond(W < 0)
df[df['W']>0]
W | X | Y | Z | new | |
---|---|---|---|---|---|
A | 2.706850 | 0.628133 | 0.907969 | 0.503826 | 3.614819 |
B | 0.651118 | -0.319318 | -0.848077 | 0.605965 | -0.196959 |
D | 0.188695 | -0.758872 | -0.933237 | 0.955057 | -0.744542 |
E | 0.190794 | 1.978757 | 2.605967 | 0.683509 | 2.796762 |
- Select the Y, X columns after droped data of !cond(W < 0)
df[df['W']>0][['Y','X']] |
Y | X | |
---|---|---|
A | 0.907969 | 0.628133 |
B | -0.848077 | -0.319318 |
D | -0.933237 | -0.758872 |
E | 2.605967 | 1.978757 |
- For two conditions you can use | and & with parenthesis:
The conditions can be seen as1. condition table 1 | condition table 2 2. condition table 1 & condition table 2
#Select labels |
W | X | Y | Z | new | |
---|---|---|---|---|---|
E | 0.190794 | 1.978757 | 2.605967 | 0.683509 | 2.796762 |
Index
Reset Index
# Reset to default 0,1...n index |
index | W | X | Y | Z | new | |
---|---|---|---|---|---|---|
0 | A | 2.706850 | 0.628133 | 0.907969 | 0.503826 | 3.614819 |
1 | B | 0.651118 | -0.319318 | -0.848077 | 0.605965 | -0.196959 |
2 | C | -2.018168 | 0.740122 | 0.528813 | -0.589001 | -1.489355 |
3 | D | 0.188695 | -0.758872 | -0.933237 | 0.955057 | -0.744542 |
4 | E | 0.190794 | 1.978757 | 2.605967 | 0.683509 | 2.796762 |
Set New Index by A Column in Table
newind = 'CA NY WY OR CO'.split() |
W | X | Y | Z | new | |
---|---|---|---|---|---|
States | |||||
CA | 2.706850 | 0.628133 | 0.907969 | 0.503826 | 3.614819 |
NY | 0.651118 | -0.319318 | -0.848077 | 0.605965 | -0.196959 |
WY | -2.018168 | 0.740122 | 0.528813 | -0.589001 | -1.489355 |
OR | 0.188695 | -0.758872 | -0.933237 | 0.955057 | -0.744542 |
CO | 0.190794 | 1.978757 | 2.605967 | 0.683509 | 2.796762 |