Shankar Iyer's answer to In what situations should you use SQL instead of Pandas as a data scientist?

What is the use of Pandas instead of SQL?

Core Data Scientist at Facebook · Upvoted by

, Data Science Manager at Quora and

, Masters Data Science, University of South Australia · Author has 316 answers and 1.7M answer views · 7y ·

I think of Pandas as a toolkit for performing SQL-like manipulations on “relatively small” datasets entirely within Python. The meaning of “relatively small” here depends upon the memory limits of the machine on which Python is running.

A very common workflow for me is that I query a data warehouse or distributed database using an SQL-like language (e.g., Hive, Redshift, etc.) to pull all of the data that I could conceivably need for my analysis into Python. In the process, I do all of the large-scale aggregations, joins, and sampling that reduce the large amount of data sitting in the database to an amount that can sit in the memory of the machine on which I’m running Python. A general principle here is that I don’t pull data at finer granularity than I need for my analysis into Python: for example, if I’m studying country-by-country time series of how often some event occurs, I definitely don’t need data at the individual event level, so I will do aggregations over day and country before pulling the data into Python.

After pulling the necessary data into Python, I then do further analysis of portions of the complete dataset using Pandas. The Python side of the analysis might involve smaller scale aggregations and joins, which are easy to accomplish using Pandas, because the package includes equivalents of all common SQL operations.

In reasoning about which parts of the analysis to do in Pandas, I take into account the following disadvantages of doing things Python-side:

the need to store data in the memory of a single machine
the lack of the parallel computing advantages that are built into queries to modern distributed databases

I also take into account the following advantages of manipulating data in Python:

lower I/O cost and latency than repeatedly querying a database
the ability to apply arbitrary functions to rows and columns of data, instead of just the functions available to Hive, Redshift, etc.

Eventually, I’ve thought through these issues enough that I build a pretty solid intuition for which parts of an analysis to do in Hive / Redshift vs. Python. I’m sure it’s the same for other working data scientists.

I’ll close by saying that Pandas is easily one of the most useful Python packages that I use in my day-to-day work. If you’re a current or aspiring data scientist who works in Python and you haven’t yet learned Pandas, I highly recommend that you do so.

35K views ·

View upvotes

View 10 shares

1 of 18 answers

Something went wrong. Wait a moment and try again.

View 17 other answers to this question

About · Careers · Privacy · Terms · Contact · Languages · Your Ad Choices · Press ·