In this post, we will discuss about how to read CSV file using pandas, an awesome library to deal with data written in Python.
Article Contents
Read CSV file using pandas
First, make sure you have pandas installed in your system, and use Python 3.
Let say we have to deal with this CSV file sample.csv
.
id,name,job,score
1,'Pete Houston','Software Engineer',92
2,'John Wick','Assassin',95
3,'Bruce Wayne','Batman',99
4,'Clark Kent','Superman',95
Load CSV file
To read CSV using pandas, we will use read_csv
function, and it’s like this:
import pandas
df = pandas.read_csv('data/sample.csv')
print(df)
The output of execution will be following:
$ python sample.py
id name job score
0 1 'Pete Houston' 'Software Engineer' 92
1 2 'John Wick' 'Assassin' 95
2 3 'Bruce Wayne' 'Batman' 99
3 4 'Clark Kent' 'Superman' 95
The command will output the CSV file content nicely.
Indicate the separator
CSV file doesn’t necessarily use the comma ,
character for field separation, it could be any character like \t
, |
… etc.
Fortunately, there is one option in read_csv
to help loading CSV file with specific field separator character.
df = pandas.read_csv('data/sample.csv', sep='\t')
An alias for sep
is delimiter
. The result output will be the same.
df = pandas.read_csv('data/sample.csv', delimiter='\t')
Separator can be a sequence of characters.
df = pandas.read_csv('data/sample.csv', sep='||')
Control the header row
CSV file might not have the header row, so if you execute the script in this case:
df = pandas.read_csv('data/sample.csv')
The output looks interesting.
1 'Pete Houston' 'Software Engineer' 92
0 2 'John Wick' 'Assassin' 95
1 3 'Bruce Wayne' 'Batman' 99
2 4 'Clark Kent' 'Superman' 95
pandas will assume the first row is the header.
For this, we can pass header=None
option to tell pandas that there is no header row in this CSV file.
df = pandas.read_csv('data/sample.csv', header=None)
and we have this output.
0 1 2 3
0 1 'Pete Houston' 'Software Engineer' 92
1 2 'John Wick' 'Assassin' 95
2 3 'Bruce Wayne' 'Batman' 99
3 4 'Clark Kent' 'Superman' 95
Since we don’t specify any header, pandas use integer value to index the header row.
We can also set name for each column through names=[]
option.
df = pandas.read_csv('data/sample.csv', names=['id', 'name', 'job', 'score'])
It will show the expected output.
id name job score
0 1 'Pete Houston' 'Software Engineer' 92
1 2 'John Wick' 'Assassin' 95
2 3 'Bruce Wayne' 'Batman' 99
3 4 'Clark Kent' 'Superman' 95
Set index column
There is one catch from the output. The very first column on the left, it is the auto-index column generated by pandas. We can opt it out of the DataFrame with index_col=
.
df = pandas.read_csv('data/sample.csv', header=None, names=['id', 'name', 'job', 'score'], index_col=0)
df = pandas.read_csv('data/sample.csv', header=None, names=['id', 'name', 'job', 'score'], index_col=['id'])
You can use either column index value or a sequence of column names to make index. The results of both are the same.
name job score
id
1 'Pete Houston' 'Software Engineer' 92
2 'John Wick' 'Assassin' 95
3 'Bruce Wayne' 'Batman' 99
4 'Clark Kent' 'Superman' 95
Skip some rows
To skip some rows when reading CSV, you can use skiprows=N
, where N
is the number of rows to skip from start of the file.
df = pandas.read_csv('data/sample.csv', names=['id', 'name', 'job', 'score'], skiprows=1)
Limit number of rows to process
Have you only have interest in processing some rows, you can use nrows=N
, where N
is the number of rows to process.
df = pandas.read_csv('data/sample.csv', names=['id', 'name', 'job', 'score'], nrows=2)
Certainly, you can combine skiprows
and nrows
to process files from a specific position with a limit number of rows.
# df = pandas.read_csv('data/sample.csv', names=['id', 'name', 'job', 'score'], skiprows=1, nrows=2)
Handling large CSV files
The CSV file could be really large in some cases. To handle this large file, we need to cut CSV file into many chunks and process each of them.
The chunk process in pandas can be done by using this option chunksize=
.
for df_chunk in pandas.read_csv('data/sample.csv', names=['id', 'name', 'job', 'score'], chunksize=1):
print(df_chunk)
The code produces this output.
id name job score
0 1 'Pete Houston' 'Software Engineer' 92
id name job score
1 2 'John Wick' 'Assassin' 95
id name job score
2 3 'Bruce Wayne' 'Batman' 99
id name job score
3 4 'Clark Kent' 'Superman' 95
About from_csv()
If you read any tutorial about reading CSV file using pandas, they might use from_csv
function. However, as indicating from pandas official documentation, it is deprecated. Hence, it is recommended to use read_csv
instead.
Conclusion
So, I have introduced with you how to read CSV file in pandas in short tutorial, along with common-use parameters.
If you need to know more options for read_csv
function, check out its documentation.