Set index for DataFrame in pandas

13168

In order to improve data searching, we always need to create indexes for data lookup purpose. I will show you how to set index for DataFrame in pandas.

Contents hide

Data with no index

Data with pre-defined index

Use set_index

Conclusion

Set index for DataFrame in pandas

To get started, I will use the following data set, or you could use any dataset you have to experiment with.

>>> import pandas as pd

>>> df = pandas.read_csv('data/sample.csv')
>>> df
             name                  job  score
0  'Pete Houston'  'Software Engineer'     92
1     'John Wick'           'Assassin'     95
2   'Bruce Wayne'             'Batman'     99
3    'Clark Kent'           'Superman'     96

As you can see, I read CSV file and save into DataFrame. It is just to avoid lots of typing on Python REPL.

Data with no index

Do you notice the leftmost column?

It is auto-generated index column, because pandas always tries to optimize every dataset it handles, so it generated.

However, that auto-generated index field starts from 0 and unnamed. We need to update it.

It can be done by manipulating the DataFrame.index property.

Okay, let’s update the index field with number starting from 1.

>>> df.index = [x for x in range(1, len(df.values)+1)]
>>> df
             name                  job  score
1  'Pete Houston'  'Software Engineer'     92
2     'John Wick'           'Assassin'     95
3   'Bruce Wayne'             'Batman'     99
4    'Clark Kent'           'Superman'     96

The DataFrame.index is a list, so we can generate it easily via simple Python loop.

For your info, len(df.values) will return the number of pandas.Series, in other words, it is number of rows in current DataFrame.

We set name for index field through simple assignment:

>>> df.index.name = 'id'
>>> df
              name                  job  score
id                                            
1   'Pete Houston'  'Software Engineer'     92
2      'John Wick'           'Assassin'     95
3    'Bruce Wayne'             'Batman'     99
4     'Clark Kent'           'Superman'     96

Now, the index column name is shown as id.

There is another case, check out following section.

Data with pre-defined index

See this dataset:

>>> import pandas
>>> df = pandas.read_csv('data/sample.csv')
>>> df
   id            name                  job  score
0   1  'Pete Houston'  'Software Engineer'     92
1   2     'John Wick'           'Assassin'     95
2   3   'Bruce Wayne'             'Batman'     99
3   4    'Clark Kent'           'Superman'     96

It is no difference from previous one but with one extra column id, that is from the CSV file.

It is pretty much obvious that we would like to have that id column is the index field instead of the auto-generated index field.

Through assignment as stated in previous section, we can assign the index field to use id column. Let’s try it.

>>> df.index = df['id']
>>> df
    id            name                  job  score
id                                                
1    1  'Pete Houston'  'Software Engineer'     92
2    2     'John Wick'           'Assassin'     95
3    3   'Bruce Wayne'             'Batman'     99
4    4    'Clark Kent'           'Superman'     96

So we have a duplicate situation with two id columns. Dropping the id column will solve the problem.

>>> df.drop(['id'], axis=1)
              name                  job  score
id                                            
1   'Pete Houston'  'Software Engineer'     92
2      'John Wick'           'Assassin'     95
3    'Bruce Wayne'             'Batman'     99
4     'Clark Kent'           'Superman'     96

However, drop will return new modified DataFrame, so we need to store it again in order to save changes.

>>> df = df.drop(['id'], axis=1)
>>> df
              name                  job  score
id                                            
1   'Pete Houston'  'Software Engineer'     92
2      'John Wick'           'Assassin'     95
3    'Bruce Wayne'             'Batman'     99
4     'Clark Kent'           'Superman'     96

Okay, it looks very fine-tuning for now.

Use `set_index`

It is very common to see data engineers to set index for DataFrame in pandas; so, a function is made to help with this situation, set_index().

Let’s try it.

>>> df.set_index('id')
              name                  job  score
id                                            
1   'Pete Houston'  'Software Engineer'     92
2      'John Wick'           'Assassin'     95
3    'Bruce Wayne'             'Batman'     99
4     'Clark Kent'           'Superman'     96

We still need to assign to a variable to save changes. However, there is an option for this to directly modifying the current DataFrame without re-assignment, inplace=True.

>>> df.set_index('id', inplace=True)
>>> df
              name                  job  score
id                                            
1   'Pete Houston'  'Software Engineer'     92
2      'John Wick'           'Assassin'     95
3    'Bruce Wayne'             'Batman'     99
4     'Clark Kent'           'Superman'     96

With inplace=False, it returns a new modified DataFrame.

If you want to keep the id field in column list, the add this one drop=False. By default, the column to be indexed will be drop, drop=True.

>>> df.set_index('id', drop=False, inplace=True)
>>> df
    id            name                  job  score
id                                                
1    1  'Pete Houston'  'Software Engineer'     92
2    2     'John Wick'           'Assassin'     95
3    3   'Bruce Wayne'             'Batman'     99
4    4    'Clark Kent'           'Superman'     96

Conclusion

It looks too long for a simple tutorial about indexing in pandas. Well, I just to explain in details for you to understand what are possible ways to handle it.

Anyway, have fun with data science!

How to Use PHPacker Step-by-Step

Build a Weather CLI App in Go: Step-by-Step Guide for Beginners

Top 5 Hidden Gin Features Every Go Developer Should Know

Mastering Custom Middleware in Gin: A Deep Dive into Extending Your…

Marketing explanation to your friends

2018 Recap and Happy New Year 2019

Life is a Game as We Know It

Automate everything with Katalon Studio – Udemy Free Course

Set index for DataFrame in pandas

Set index for DataFrame in pandas

Data with no index

Data with pre-defined index

Use `set_index`

Conclusion

Build FastCGI development kit

Fix issue missed schedule in WordPress

Top 5 Hidden Gin Features Every Go Developer Should Know

Eclipse IDE for Beginners: Increase Your Java Productivity – Udemy Course...