Read CSV file using pandas

2
6415
Read CSV file using pandas

In this post, we will discuss about how to read CSV file using pandas, an awesome library to deal with data written in Python.

Read CSV file using pandas

First, make sure you have pandas installed in your system, and use Python 3.

Let say we have to deal with this CSV file sample.csv.

id,name,job,score
1,'Pete Houston','Software Engineer',92
2,'John Wick','Assassin',95
3,'Bruce Wayne','Batman',99
4,'Clark Kent','Superman',95

Load CSV file

To read CSV using pandas, we will use read_csv function, and it’s like this:

import pandas

df = pandas.read_csv('data/sample.csv')

print(df)

The output of execution will be following:

$ python sample.py

   id            name                  job  score
0   1  'Pete Houston'  'Software Engineer'     92
1   2     'John Wick'           'Assassin'     95
2   3   'Bruce Wayne'             'Batman'     99
3   4    'Clark Kent'           'Superman'     95

The command will output the CSV file content nicely.

Indicate the separator

CSV file doesn’t necessarily use the comma , character for field separation, it could be any character like \t, | … etc.

Fortunately, there is one option in read_csv to help loading CSV file with specific field separator character.

df = pandas.read_csv('data/sample.csv', sep='\t')

An alias for sep is delimiter. The result output will be the same.

df = pandas.read_csv('data/sample.csv', delimiter='\t')

Separator can be a sequence of characters.

df = pandas.read_csv('data/sample.csv', sep='||')

Control the header row

CSV file might not have the header row, so if you execute the script in this case:

df = pandas.read_csv('data/sample.csv')

The output looks interesting.

1 'Pete Houston' 'Software Engineer'  92
0  2    'John Wick'          'Assassin'  95
1  3  'Bruce Wayne'            'Batman'  99
2  4   'Clark Kent'          'Superman'  95

pandas will assume the first row is the header.

For this, we can pass header=None option to tell pandas that there is no header row in this CSV file.

df = pandas.read_csv('data/sample.csv', header=None)

and we have this output.

0               1                    2   3
0  1  'Pete Houston'  'Software Engineer'  92
1  2     'John Wick'           'Assassin'  95
2  3   'Bruce Wayne'             'Batman'  99
3  4    'Clark Kent'           'Superman'  95

Since we don’t specify any header, pandas use integer value to index the header row.

We can also set name for each column through names=[]option.

df = pandas.read_csv('data/sample.csv', names=['id', 'name', 'job', 'score'])

It will show the expected output.

id            name                  job  score
0   1  'Pete Houston'  'Software Engineer'     92
1   2     'John Wick'           'Assassin'     95
2   3   'Bruce Wayne'             'Batman'     99
3   4    'Clark Kent'           'Superman'     95

Set index column

There is one catch from the output. The very first column on the left, it is the auto-index column generated by pandas. We can opt it out of the DataFrame with index_col=.

df = pandas.read_csv('data/sample.csv', header=None, names=['id', 'name', 'job', 'score'], index_col=0)
df = pandas.read_csv('data/sample.csv', header=None, names=['id', 'name', 'job', 'score'], index_col=['id'])

You can use either column index value or a sequence of column names to make index. The results of both are the same.

name                  job  score
id
1   'Pete Houston'  'Software Engineer'     92
2      'John Wick'           'Assassin'     95
3    'Bruce Wayne'             'Batman'     99
4     'Clark Kent'           'Superman'     95

Skip some rows

To skip some rows when reading CSV, you can use skiprows=N, where N is the number of rows to skip from start of the file.

df = pandas.read_csv('data/sample.csv', names=['id', 'name', 'job', 'score'], skiprows=1)

Limit number of rows to process

Have you only have interest in processing some rows, you can use nrows=N, where N is the number of rows to process.

df = pandas.read_csv('data/sample.csv', names=['id', 'name', 'job', 'score'], nrows=2)

Certainly, you can combine skiprows and nrows to process files from a specific position with a limit number of rows.

# df = pandas.read_csv('data/sample.csv', names=['id', 'name', 'job', 'score'], skiprows=1, nrows=2)

Handling large CSV files

The CSV file could be really large in some cases. To handle this large file, we need to cut CSV file into many chunks and process each of them.

The chunk process in pandas can be done by using this option chunksize=.

for df_chunk in pandas.read_csv('data/sample.csv', names=['id', 'name', 'job', 'score'], chunksize=1):
    print(df_chunk)

The code produces this output.

id            name                  job  score
0   1  'Pete Houston'  'Software Engineer'     92
   id         name         job  score
1   2  'John Wick'  'Assassin'     95
   id           name       job  score
2   3  'Bruce Wayne'  'Batman'     99
   id          name         job  score
3   4  'Clark Kent'  'Superman'     95

About from_csv()

If you read any tutorial about reading CSV file using pandas, they might use from_csv function. However, as indicating from pandas official documentation, it is deprecated. Hence, it is recommended to use read_csv instead.

Conclusion

So, I have introduced with you how to read CSV file in pandas in short tutorial, along with common-use parameters.

If you need to know more options for read_csv function, check out its documentation.