Do you feel stuck in removing data from DataFrame in pandas? If you do, read this article, I will show you how to drop columns of DataFrame in pandas step-by-step.
Drop columns of DataFrame in pandas
In order to start the tutorial, let’s use following simple dataset to manipulate.
id,name,job,score
1,'Pete Houston','Software Engineer',92
2,'John Wick','Assassin',95
3,'Bruce Wayne','Batman',99
4,'Clark Kent','Superman',96
4,'Clark Kent','Superman',96
5,,'Superman',83
6,,,72It is a CSV-format file, you can save and load CSV file into pandas’ DataFrame.
>>> import pandas
>>> df = pandas.read_csv('data/sample.csv')
>>> df
   id            name                  job  score
0   1  'Pete Houston'  'Software Engineer'     92
1   2     'John Wick'           'Assassin'     95
2   3   'Bruce Wayne'             'Batman'     99
3   4    'Clark Kent'           'Superman'     96
4   4    'Clark Kent'           'Superman'     96
5   5             NaN             'Hacker'     83
6   6             NaN                  NaN     72Drop one or more columns
At first glance, it looks like we don’t have interest in the id column, so we change move it to index.
>>> df.set_index('id', drop=False, inplace=True)
>>> df
    id            name                  job  score
id                                                
1    1  'Pete Houston'  'Software Engineer'     92
2    2     'John Wick'           'Assassin'     95
3    3   'Bruce Wayne'             'Batman'     99
4    4    'Clark Kent'           'Superman'     96
4    4    'Clark Kent'           'Superman'     96
5    5             NaN           'Superman'     83
6    6             NaN                  NaN     72I use set_index() to make id column indexed but with drop=False the original id column is still being kept.
Okay, we need to drop id column now. Use drop() on DataFrame to remove it.
When dropping columns from DataFrame, we need to specify the axis, which is the direction of data table in DataFrame. The axis can be:
- 0: which is in row index direction.
- 1: which is in column direction.
In this case, the axis is 1 to drop columns.
Let’s try it.
>>> df.drop('id', axis=1)
              name                  job  score
id                                            
1   'Pete Houston'  'Software Engineer'     92
2      'John Wick'           'Assassin'     95
3    'Bruce Wayne'             'Batman'     99
4     'Clark Kent'           'Superman'     96
4     'Clark Kent'           'Superman'     96
5              NaN           'Superman'     83
6              NaN                  NaN     72It looks nice but it returns a newly-modified DataFrame object.
However, pandas also provide a handy option to make modification change directly into current DataFrame, inplace=True. It is inplace=False by default.
>>> df.drop('id', axis=1, inplace=True)
>>> df
              name                  job  score
id                                            
1   'Pete Houston'  'Software Engineer'     92
2      'John Wick'           'Assassin'     95
3    'Bruce Wayne'             'Batman'     99
4     'Clark Kent'           'Superman'     96
4     'Clark Kent'           'Superman'     96
5              NaN           'Superman'     83
6              NaN                  NaN     72We can drop more columns by putting them in a list.
Let’s drop name and score.
>>> df.drop(['name', 'score'], axis=1)
                    job
id                     
1   'Software Engineer'
2            'Assassin'
3              'Batman'
4            'Superman'
4            'Superman'
5            'Superman'
6                   NaNRemove the duplication
At current point, our data have some duplicated rows, we can filter them out by using drop_duplicates().
>>> df.drop(['name', 'score'], axis=1).drop_duplicates('job')
                    job
id                     
1   'Software Engineer'
2            'Assassin'
3              'Batman'
4            'Superman'
6                   NaNNo more duplication presents.
But still having one problem, please continue the next section.
Remove missing data
Any row that doesn’t have value, pandas presents them as NaN. To remove this kind of data, use dropna().
Try it out.
>>> df.drop(['name', 'score'], axis=1).drop_duplicates('job').dropna()
                    job
id                     
1   'Software Engineer'
2            'Assassin'
3              'Batman'
4            'Superman'Cool! We’ve just enumerated all unique job from CSV file.
However, there are more options to the dropna().
Get back to the original dataset from beginning
>>> df
   id            name                  job  score
0   1  'Pete Houston'  'Software Engineer'     92
1   2     'John Wick'           'Assassin'     95
2   3   'Bruce Wayne'             'Batman'     99
3   4    'Clark Kent'           'Superman'     96
4   4    'Clark Kent'           'Superman'     96
5   5             NaN           'Superman'     83
6   6             NaN                  NaN     72We can specify how to drop columns having missing data with how= option. There are two valid values for how.
- any(default): drop row if any column of row is- NaN.
- all: drop row if all fields are- NaN.
If there requires at least some fields being valid to keep, use thresh= option. It is the at least number of non-NA fields a row should have to be kept. Otherwise, it will be removed.
>>> df.dropna(thresh=3)
   id            name                  job  score
0   1  'Pete Houston'  'Software Engineer'     92
1   2     'John Wick'           'Assassin'     95
2   3   'Bruce Wayne'             'Batman'     99
3   4    'Clark Kent'           'Superman'     96
4   4    'Clark Kent'           'Superman'     96
5   5             NaN           'Superman'     83Another nice option is subset=[], to select list of columns to look for missing data to drop.
>>> df.dropna(subset=['job'])
   id            name                  job  score
0   1  'Pete Houston'  'Software Engineer'     92
1   2     'John Wick'           'Assassin'     95
2   3   'Bruce Wayne'             'Batman'     99
3   4    'Clark Kent'           'Superman'     96
4   4    'Clark Kent'           'Superman'     96
5   5             NaN           'Superman'     83
>>> df.dropna(subset=['name'])
   id            name                  job  score
0   1  'Pete Houston'  'Software Engineer'     92
1   2     'John Wick'           'Assassin'     95
2   3   'Bruce Wayne'             'Batman'     99
3   4    'Clark Kent'           'Superman'     96
4   4    'Clark Kent'           'Superman'     96Last option is to reflect changes of drop into requested DataFrame object, so we don’t have to re-assign into another variable, inplace=True.
Conclusion
Hope this article provides you some sorts of understanding how to drop columns from DataFrame in pandas.
 
			 
		