Do you feel stuck in removing data from DataFrame in pandas? If you do, read this article, I will show you how to drop columns of DataFrame in pandas step-by-step.
Article Contents
Drop columns of DataFrame in pandas
In order to start the tutorial, let’s use following simple dataset to manipulate.
id,name,job,score
1,'Pete Houston','Software Engineer',92
2,'John Wick','Assassin',95
3,'Bruce Wayne','Batman',99
4,'Clark Kent','Superman',96
4,'Clark Kent','Superman',96
5,,'Superman',83
6,,,72
It is a CSV-format file, you can save and load CSV file into pandas’ DataFrame.
>>> import pandas
>>> df = pandas.read_csv('data/sample.csv')
>>> df
id name job score
0 1 'Pete Houston' 'Software Engineer' 92
1 2 'John Wick' 'Assassin' 95
2 3 'Bruce Wayne' 'Batman' 99
3 4 'Clark Kent' 'Superman' 96
4 4 'Clark Kent' 'Superman' 96
5 5 NaN 'Hacker' 83
6 6 NaN NaN 72
Drop one or more columns
At first glance, it looks like we don’t have interest in the id
column, so we change move it to index.
>>> df.set_index('id', drop=False, inplace=True)
>>> df
id name job score
id
1 1 'Pete Houston' 'Software Engineer' 92
2 2 'John Wick' 'Assassin' 95
3 3 'Bruce Wayne' 'Batman' 99
4 4 'Clark Kent' 'Superman' 96
4 4 'Clark Kent' 'Superman' 96
5 5 NaN 'Superman' 83
6 6 NaN NaN 72
I use set_index()
to make id
column indexed but with drop=False
the original id
column is still being kept.
Okay, we need to drop id
column now. Use drop()
on DataFrame to remove it.
When dropping columns from DataFrame, we need to specify the axis, which is the direction of data table in DataFrame. The axis can be:
0
: which is in row index direction.1
: which is in column direction.
In this case, the axis is 1
to drop columns.
Let’s try it.
>>> df.drop('id', axis=1)
name job score
id
1 'Pete Houston' 'Software Engineer' 92
2 'John Wick' 'Assassin' 95
3 'Bruce Wayne' 'Batman' 99
4 'Clark Kent' 'Superman' 96
4 'Clark Kent' 'Superman' 96
5 NaN 'Superman' 83
6 NaN NaN 72
It looks nice but it returns a newly-modified DataFrame object.
However, pandas also provide a handy option to make modification change directly into current DataFrame, inplace=True
. It is inplace=False
by default.
>>> df.drop('id', axis=1, inplace=True)
>>> df
name job score
id
1 'Pete Houston' 'Software Engineer' 92
2 'John Wick' 'Assassin' 95
3 'Bruce Wayne' 'Batman' 99
4 'Clark Kent' 'Superman' 96
4 'Clark Kent' 'Superman' 96
5 NaN 'Superman' 83
6 NaN NaN 72
We can drop more columns by putting them in a list.
Let’s drop name
and score
.
>>> df.drop(['name', 'score'], axis=1)
job
id
1 'Software Engineer'
2 'Assassin'
3 'Batman'
4 'Superman'
4 'Superman'
5 'Superman'
6 NaN
Remove the duplication
At current point, our data have some duplicated rows, we can filter them out by using drop_duplicates()
.
>>> df.drop(['name', 'score'], axis=1).drop_duplicates('job')
job
id
1 'Software Engineer'
2 'Assassin'
3 'Batman'
4 'Superman'
6 NaN
No more duplication presents.
But still having one problem, please continue the next section.
Remove missing data
Any row that doesn’t have value, pandas presents them as NaN
. To remove this kind of data, use dropna()
.
Try it out.
>>> df.drop(['name', 'score'], axis=1).drop_duplicates('job').dropna()
job
id
1 'Software Engineer'
2 'Assassin'
3 'Batman'
4 'Superman'
Cool! We’ve just enumerated all unique job
from CSV file.
However, there are more options to the dropna()
.
Get back to the original dataset from beginning
>>> df
id name job score
0 1 'Pete Houston' 'Software Engineer' 92
1 2 'John Wick' 'Assassin' 95
2 3 'Bruce Wayne' 'Batman' 99
3 4 'Clark Kent' 'Superman' 96
4 4 'Clark Kent' 'Superman' 96
5 5 NaN 'Superman' 83
6 6 NaN NaN 72
We can specify how to drop columns having missing data with how=
option. There are two valid values for how
.
any
(default): drop row if any column of row isNaN
.all
: drop row if all fields areNaN
.
If there requires at least some fields being valid to keep, use thresh=
option. It is the at least number of non-NA fields a row should have to be kept. Otherwise, it will be removed.
>>> df.dropna(thresh=3)
id name job score
0 1 'Pete Houston' 'Software Engineer' 92
1 2 'John Wick' 'Assassin' 95
2 3 'Bruce Wayne' 'Batman' 99
3 4 'Clark Kent' 'Superman' 96
4 4 'Clark Kent' 'Superman' 96
5 5 NaN 'Superman' 83
Another nice option is subset=[]
, to select list of columns to look for missing data to drop.
>>> df.dropna(subset=['job'])
id name job score
0 1 'Pete Houston' 'Software Engineer' 92
1 2 'John Wick' 'Assassin' 95
2 3 'Bruce Wayne' 'Batman' 99
3 4 'Clark Kent' 'Superman' 96
4 4 'Clark Kent' 'Superman' 96
5 5 NaN 'Superman' 83
>>> df.dropna(subset=['name'])
id name job score
0 1 'Pete Houston' 'Software Engineer' 92
1 2 'John Wick' 'Assassin' 95
2 3 'Bruce Wayne' 'Batman' 99
3 4 'Clark Kent' 'Superman' 96
4 4 'Clark Kent' 'Superman' 96
Last option is to reflect changes of drop into requested DataFrame object, so we don’t have to re-assign into another variable, inplace=True
.
Conclusion
Hope this article provides you some sorts of understanding how to drop columns from DataFrame in pandas.