Drop columns of DataFrame in pandas

0
3803
Drop columns of DataFrame in pandas

Do you feel stuck in removing data from DataFrame in pandas? If you do, read this article, I will show you how to drop columns of DataFrame in pandas step-by-step.

Drop columns of DataFrame in pandas

In order to start the tutorial, let’s use following simple dataset to manipulate.

id,name,job,score
1,'Pete Houston','Software Engineer',92
2,'John Wick','Assassin',95
3,'Bruce Wayne','Batman',99
4,'Clark Kent','Superman',96
4,'Clark Kent','Superman',96
5,,'Superman',83
6,,,72

It is a CSV-format file, you can save and load CSV file into pandas’ DataFrame.

>>> import pandas
>>> df = pandas.read_csv('data/sample.csv')
>>> df
   id            name                  job  score
0   1  'Pete Houston'  'Software Engineer'     92
1   2     'John Wick'           'Assassin'     95
2   3   'Bruce Wayne'             'Batman'     99
3   4    'Clark Kent'           'Superman'     96
4   4    'Clark Kent'           'Superman'     96
5   5             NaN             'Hacker'     83
6   6             NaN                  NaN     72

Drop one or more columns

At first glance, it looks like we don’t have interest in the id column, so we change move it to index.

>>> df.set_index('id', drop=False, inplace=True)
>>> df
    id            name                  job  score
id                                                
1    1  'Pete Houston'  'Software Engineer'     92
2    2     'John Wick'           'Assassin'     95
3    3   'Bruce Wayne'             'Batman'     99
4    4    'Clark Kent'           'Superman'     96
4    4    'Clark Kent'           'Superman'     96
5    5             NaN           'Superman'     83
6    6             NaN                  NaN     72

I use set_index() to make id column indexed but with drop=False the original id column is still being kept.

Okay, we need to drop id column now. Use drop() on DataFrame to remove it.

When dropping columns from DataFrame, we need to specify the axis, which is the direction of data table in DataFrame. The axis can be:

  • 0: which is in row index direction.
  • 1: which is in column direction.

In this case, the axis is 1 to drop columns.

Let’s try it.

>>> df.drop('id', axis=1)
              name                  job  score
id                                            
1   'Pete Houston'  'Software Engineer'     92
2      'John Wick'           'Assassin'     95
3    'Bruce Wayne'             'Batman'     99
4     'Clark Kent'           'Superman'     96
4     'Clark Kent'           'Superman'     96
5              NaN           'Superman'     83
6              NaN                  NaN     72

It looks nice but it returns a newly-modified DataFrame object.

However, pandas also provide a handy option to make modification change directly into current DataFrame, inplace=True. It is inplace=False by default.

>>> df.drop('id', axis=1, inplace=True)
>>> df
              name                  job  score
id                                            
1   'Pete Houston'  'Software Engineer'     92
2      'John Wick'           'Assassin'     95
3    'Bruce Wayne'             'Batman'     99
4     'Clark Kent'           'Superman'     96
4     'Clark Kent'           'Superman'     96
5              NaN           'Superman'     83
6              NaN                  NaN     72

We can drop more columns by putting them in a list.

Let’s drop name and score.

>>> df.drop(['name', 'score'], axis=1)
                    job
id                     
1   'Software Engineer'
2            'Assassin'
3              'Batman'
4            'Superman'
4            'Superman'
5            'Superman'
6                   NaN

Remove the duplication

At current point, our data have some duplicated rows, we can filter them out by using drop_duplicates().

>>> df.drop(['name', 'score'], axis=1).drop_duplicates('job')
                    job
id                     
1   'Software Engineer'
2            'Assassin'
3              'Batman'
4            'Superman'
6                   NaN

No more duplication presents.

But still having one problem, please continue the next section.

Remove missing data

Any row that doesn’t have value, pandas presents them as NaN. To remove this kind of data, use dropna().

Try it out.

>>> df.drop(['name', 'score'], axis=1).drop_duplicates('job').dropna()
                    job
id                     
1   'Software Engineer'
2            'Assassin'
3              'Batman'
4            'Superman'

Cool! We’ve just enumerated all unique job from CSV file.

However, there are more options to the dropna().

Get back to the original dataset from beginning

>>> df
   id            name                  job  score
0   1  'Pete Houston'  'Software Engineer'     92
1   2     'John Wick'           'Assassin'     95
2   3   'Bruce Wayne'             'Batman'     99
3   4    'Clark Kent'           'Superman'     96
4   4    'Clark Kent'           'Superman'     96
5   5             NaN           'Superman'     83
6   6             NaN                  NaN     72

We can specify how to drop columns having missing data with how= option. There are two valid values for how.

  • any(default): drop row if any column of row is NaN.
  • all: drop row if all fields are NaN.

If there requires at least some fields being valid to keep, use thresh= option. It is the at least number of non-NA fields a row should have to be kept. Otherwise, it will be removed.

>>> df.dropna(thresh=3)
   id            name                  job  score
0   1  'Pete Houston'  'Software Engineer'     92
1   2     'John Wick'           'Assassin'     95
2   3   'Bruce Wayne'             'Batman'     99
3   4    'Clark Kent'           'Superman'     96
4   4    'Clark Kent'           'Superman'     96
5   5             NaN           'Superman'     83

Another nice option is subset=[], to select list of columns to look for missing data to drop.

>>> df.dropna(subset=['job'])
   id            name                  job  score
0   1  'Pete Houston'  'Software Engineer'     92
1   2     'John Wick'           'Assassin'     95
2   3   'Bruce Wayne'             'Batman'     99
3   4    'Clark Kent'           'Superman'     96
4   4    'Clark Kent'           'Superman'     96
5   5             NaN           'Superman'     83
>>> df.dropna(subset=['name'])
   id            name                  job  score
0   1  'Pete Houston'  'Software Engineer'     92
1   2     'John Wick'           'Assassin'     95
2   3   'Bruce Wayne'             'Batman'     99
3   4    'Clark Kent'           'Superman'     96
4   4    'Clark Kent'           'Superman'     96

Last option is to reflect changes of drop into requested DataFrame object, so we don’t have to re-assign into another variable, inplace=True.

Conclusion

Hope this article provides you some sorts of understanding how to drop columns from DataFrame in pandas.