Set index for DataFrame in pandas

0
146
Set index for DataFrame in pandas

In order to improve data searching, we always need to create indexes for data lookup purpose. I will show you how to set index for DataFrame in pandas.

Set index for DataFrame in pandas

To get started, I will use the following data set, or you could use any dataset you have to experiment with.

>>> import pandas as pd

>>> df = pandas.read_csv('data/sample.csv')
>>> df
             name                  job  score
0  'Pete Houston'  'Software Engineer'     92
1     'John Wick'           'Assassin'     95
2   'Bruce Wayne'             'Batman'     99
3    'Clark Kent'           'Superman'     96

As you can see, I read CSV file and save into DataFrame. It is just to avoid lots of typing on Python REPL.

Data with no index

Do you notice the leftmost column?

It is auto-generated index column, because pandas always tries to optimize every dataset it handles, so it generated.

However, that auto-generated index field starts from 0 and unnamed. We need to update it.

It can be done by manipulating the DataFrame.index property.

Okay, let’s update the index field with number starting from 1.

>>> df.index = [x for x in range(1, len(df.values)+1)]
>>> df
             name                  job  score
1  'Pete Houston'  'Software Engineer'     92
2     'John Wick'           'Assassin'     95
3   'Bruce Wayne'             'Batman'     99
4    'Clark Kent'           'Superman'     96

The DataFrame.index is a list, so we can generate it easily via simple Python loop.

For your info, len(df.values) will return the number of pandas.Series, in other words, it is number of rows in current DataFrame.

We set name for index field through simple assignment:

>>> df.index.name = 'id'
>>> df
              name                  job  score
id                                            
1   'Pete Houston'  'Software Engineer'     92
2      'John Wick'           'Assassin'     95
3    'Bruce Wayne'             'Batman'     99
4     'Clark Kent'           'Superman'     96

Now, the index column name is shown as id.

There is another case, check out following section.

Data with pre-defined index

See this dataset:

>>> import pandas
>>> df = pandas.read_csv('data/sample.csv')
>>> df
   id            name                  job  score
0   1  'Pete Houston'  'Software Engineer'     92
1   2     'John Wick'           'Assassin'     95
2   3   'Bruce Wayne'             'Batman'     99
3   4    'Clark Kent'           'Superman'     96

It is no difference from previous one but with one extra column id, that is from the CSV file.

It is pretty much obvious that we would like to have that id column is the index field instead of the auto-generated index field.

Through assignment as stated in previous section, we can assign the index field to use id column. Let’s try it.

>>> df.index = df['id']
>>> df
    id            name                  job  score
id                                                
1    1  'Pete Houston'  'Software Engineer'     92
2    2     'John Wick'           'Assassin'     95
3    3   'Bruce Wayne'             'Batman'     99
4    4    'Clark Kent'           'Superman'     96

So we have a duplicate situation with two id columns. Dropping the id column will solve the problem.

>>> df.drop(['id'], axis=1)
              name                  job  score
id                                            
1   'Pete Houston'  'Software Engineer'     92
2      'John Wick'           'Assassin'     95
3    'Bruce Wayne'             'Batman'     99
4     'Clark Kent'           'Superman'     96

However, drop will return new modified DataFrame, so we need to store it again in order to save changes.

>>> df = df.drop(['id'], axis=1)
>>> df
              name                  job  score
id                                            
1   'Pete Houston'  'Software Engineer'     92
2      'John Wick'           'Assassin'     95
3    'Bruce Wayne'             'Batman'     99
4     'Clark Kent'           'Superman'     96

Okay, it looks very fine-tuning for now.

Use set_index

It is very common to see data engineers to set index for DataFrame in pandas; so, a function is made to help with this situation, set_index().

Let’s try it.

>>> df.set_index('id')
              name                  job  score
id                                            
1   'Pete Houston'  'Software Engineer'     92
2      'John Wick'           'Assassin'     95
3    'Bruce Wayne'             'Batman'     99
4     'Clark Kent'           'Superman'     96

We still need to assign to a variable to save changes. However, there is an option for this to directly modifying the current DataFrame without re-assignment, inplace=True.

>>> df.set_index('id', inplace=True)
>>> df
              name                  job  score
id                                            
1   'Pete Houston'  'Software Engineer'     92
2      'John Wick'           'Assassin'     95
3    'Bruce Wayne'             'Batman'     99
4     'Clark Kent'           'Superman'     96

With inplace=False, it returns a new modified DataFrame.

If you want to keep the id field in column list, the add this one drop=False. By default, the column to be indexed will be drop, drop=True.

>>> df.set_index('id', drop=False, inplace=True)
>>> df
    id            name                  job  score
id                                                
1    1  'Pete Houston'  'Software Engineer'     92
2    2     'John Wick'           'Assassin'     95
3    3   'Bruce Wayne'             'Batman'     99
4    4    'Clark Kent'           'Superman'     96

Conclusion

It looks too long for a simple tutorial about indexing in pandas. Well, I just to explain in details for you to understand what are possible ways to handle it.

Anyway, have fun with data science!