Profile photo for Shankar Iyer

I think of Pandas as a toolkit for performing SQL-like manipulations on “relatively small” datasets entirely within Python. The meaning of “relatively small” here depends upon the memory limits of the machine on which Python is running.

A very common workflow for me is that I query a data warehouse or distributed database using an SQL-like language (e.g., Hive, Redshift, etc.) to pull all of the data that I could conceivably need for my analysis into Python. In the process, I do all of the large-scale aggregations, joins, and sampling that reduce the large amount of data sitting in the database to an amount that can sit in the memory of the machine on which I’m running Python. A general principle here is that I don’t pull data at finer granularity than I need for my analysis into Python: for example, if I’m studying country-by-country time series of how often some event occurs, I definitely don’t need data at the individual event level, so I will do aggregations over day and country before pulling the data into Python.

After pulling the necessary data into Python, I then do further analysis of portions of the complete dataset using Pandas. The Python side of the analysis might involve smaller scale aggregations and joins, which are easy to accomplish using Pandas, because the package includes equivalents of all common SQL operations.

In reasoning about which parts of the analysis to do in Pandas, I take into account the following disadvantages of doing things Python-side:

  • the need to store data in the memory of a single machine
  • the lack of the parallel computing advantages that are built into queries to modern distributed databases

I also take into account the following advantages of manipulating data in Python:

  • lower I/O cost and latency than repeatedly querying a database
  • the ability to apply arbitrary functions to rows and columns of data, instead of just the functions available to Hive, Redshift, etc.

Eventually, I’ve thought through these issues enough that I build a pretty solid intuition for which parts of an analysis to do in Hive / Redshift vs. Python. I’m sure it’s the same for other working data scientists.

I’ll close by saying that Pandas is easily one of the most useful Python packages that I use in my day-to-day work. If you’re a current or aspiring data scientist who works in Python and you haven’t yet learned Pandas, I highly recommend that you do so.

View 17 other answers to this question
About · Careers · Privacy · Terms · Contact · Languages · Your Ad Choices · Press ·
© Quora, Inc. 2025