An Over­view of Mod­ern Python Data­frame Libraries



Python has gained immense pop­ular­ity in the data stack space, largely due to its rich eco­sys­tem of lib­rar­ies. Among these, Python Data­frame lib­rar­ies play a pivotal role in sim­pli­fy­ing the pro­cess of work­ing with data. Data­frames are one of the most com­mon data struc­tures used in mod­ern data ana­lyt­ics – they organ­ize data into 2‑dimensional tables of rows and columns, sim­ilar to spread­sheets or SQL tables.

Here we will explore some of the most com­mon Python Data­frame lib­rar­ies, their key fea­tures, and use cases. Whether you are a data sci­ent­ist, data engin­eer, or Python enthu­si­ast, this guide will help you nav­ig­ate through the myriad options avail­able to effi­ciently handle your data.

Pan­das

We can­not talk about Data­frames and Python without start­ing with Pan­das. It was developed in 2008 by Wes McKin­ney and today is one of the premium open-source data sci­ence lib­rar­ies. Pan­das enables us to write fast, flex­ible and express­ive data struc­tures, designed to work with rela­tional or labeled data, easy and intuitive.

Pan­das is great! It does a lot of things well:

  • Auto­matic and expli­cit data align­ment, hand­ling of miss­ing data for many data types, size mutability
  • Groupby and power­ful split apply func­tion­al­ity for aggreg­at­ing and transforming
  • Intel­li­gent index­ing fea­tures, intu­it­ive and flex­ible join­ing, reshap­ing and pivot­ing operation
  • I/O tools – read­ing and writ­ing from and to dif­fer­ent sources
  • Time-series – win­dow stat­ist­ics, date shift­ing, etc.

As the amount of data to ana­lyze has grown, so has Pan­das’ old archi­tec­ture star­ted to show it’s lim­its. Even it’s cre­ator has admit­ted as much. After 5 years of devel­op­ment and learn­ings, in 2013 Wes McKin­ney wrote about 11 things he hates about Pandas:

  1. Intern­als too far from “the metal”
  2. No sup­port for memory-mapped datasets
  3. Poor per­form­ance in data­base and file ingest / expor
  4. Warty miss­ing data support
  5. Lack of trans­par­ency into memory use, RAM management
  6. Weak sup­port for cat­egor­ical data
  7. Com­plex groupby oper­a­tions awk­ward and slow
  8. Append­ing data to a Data­frame tedi­ous and very costly
  9. Lim­ited, non-extens­ible type metadata
  10. Eager eval­u­ation model, no query planning
  11. “Slow”, lim­ited mul­ticore algorithms for large datasets
A table of Speed comparison.
Table 1: Speed com­par­ison by Marc Gar­cia

To solve these issues Pan­das devel­op­ment has shif­ted to Apache Arrow, as a key tech­no­logy for the next gen­er­a­tion of data sci­ence tools. Cur­rently, Pan­das 2.0, based on the Arrow backend, fixes a lot of those points (not­ably 3, 4, 5, 6, 8, 9) and con­tin­ues to develop. Ver­sion 2.0 brings bet­ter inter­op­er­ab­il­ity, faster speeds, and rep­res­ent­a­tion of “miss­ing data” to Pan­das. For single com­pute work­loads up to 10 GB Pan­das is still a great choice with the biggest community.

Side note: Apache Arrow

Apache Arrow is a devel­op­ment plat­form for in-memory ana­lyt­ics. It con­tains a set of tech­no­lo­gies that enable big data sys­tems to store, pro­cess and move data fast. It spe­cifies a stand­ard­ized lan­guage-inde­pend­ent colum­nar memory format for flat and hier­arch­ical data, organ­ized for effi­cient ana­lytic oper­a­tions on mod­ern hard­ware.  The pro­ject is devel­op­ing a multi-lan­guage col­lec­tion of lib­rar­ies for solv­ing prob­lems related to in-memory ana­lyt­ical data pro­cessing. This enables us to have a uni­fied trans­la­tion layer between dif­fer­ent imple­ment­a­tions of com­mon data structures.

PyAr­row are the Arrow Python bind­ings and have first-class integ­ra­tion with NumPy, Pan­das, and built-in Python objects. They are based on the C++ imple­ment­a­tion of Arrow.

Arrow ecosystems and standardization
Fig­ure 1: Arrow eco­sys­tems and standardization

Where Arrow has become the defacto stand­ard for rep­res­ent­ing tab­u­lar data, other pro­jects have risen to enable the stand­ard­iz­a­tion of data hand­ling in numer­ical com­put­ing, data sci­ence, machine learn­ing, and deep learn­ing. One of those pro­jects is Sub­strait – it is a uni­ver­sal stand­ard for rep­res­ent­ing rela­tional oper­a­tions, and will com­ple­ment Arrow in future inter­op­er­ab­il­ity of data systems.

Most other Python Data­frame lib­rar­ies mimic Pan­das API, due to its pop­ular­ity, but oth­ers adopt new, dis­trib­uted con­cepts to help with the data pro­cessing logic. With an aim to tackle the frag­ment­a­tion in Array and Python Data­frame lib­rar­ies, the Con­sor­tium for Python Data API Stand­ards was estab­lished. It’s a part­ner­ship across industry to estab­lish cross-pro­ject and cross-eco­sys­tem align­ment on APIs, data exchange mech­an­isms, and to facil­it­ate coordin­a­tion and communication.

Pan­das inspired a whole eco­sys­tem of lib­rar­ies, and here we present some of the note­worthy ones.

Star history of GitHub repositories
Fig­ure 2: Star his­tory of Git­Hub repositories
Polars

Polars is a Data­frame lib­rary built in Rust on top of a OLAP query engine using Apache Arrow as the memory model. It has no indexes, sup­ports both eager and lazy eval­u­ation, a power­ful expres­sion API, enables query optim­iz­a­tion, it is multi-threaded and much more. Polars is how Pan­das would look if it was imple­men­ted today. The cons of Polars are exactly that – it is still young so does not yet have all the fea­tures of Pan­das, nor the eco­sys­tem that grew out of it. Some not­able miss­ing fea­tures are the visu­al­iz­a­tion API, absence of Pan­das dot nota­tion, dtype effi­ciency, as well as gen­eral com­pat­ib­il­ity issues.

What Polars lacks in fea­tures it makes up in speed. Bench­marked against com­pet­it­ors in read­ing and ana­lyt­ical pro­cessing Polars is a clear win­ner. Although it’s young, for non-crit­ical pro­jects, Polars would today be my choice for Pan­das use-cases.

Reading and transforming parquet
Fig­ure 3: Read­ing and trans­form­ing parquet
Dask

Dask is a flex­ible par­al­lel com­put­ing lib­rary, a task sched­uler, that seam­lessly integ­rates with Pan­das. It extends Pan­das’ cap­ab­il­it­ies by enabling it to handle data­sets that don’t fit into memory. Dask Data­frame provides a famil­iar Pan­das-like inter­face while dis­trib­ut­ing com­pu­ta­tions across mul­tiple cores or even clusters. It allows you to scale data ana­lysis tasks, mak­ing it an excel­lent choice for hand­ling large data­sets and per­form­ing dis­trib­uted com­put­ing. Under the hood it’s still Pan­das, but optim­ized for dis­trib­uted work­loads. Depend­ing on the algorithm and use case, it can be a bet­ter choice than Spark – and def­in­itely cheaper and easier to maintain.

Vaex

Vaex is a high-per­form­ance Data­frame lib­rary designed for hand­ling large-scale data­sets. It is spe­cific­ally optim­ized for out-of-core com­pu­ta­tions, allow­ing you to work with data­sets that are lar­ger than your avail­able memory. Vaex is built to effi­ciently execute cal­cu­la­tions on disk-res­id­ent data, mak­ing it sig­ni­fic­antly faster than Pan­das 1.x. It uses HDF5 to cre­ate memory maps that avoid load­ing data­sets to memory, as well as imple­ment­ing some parts in faster lan­guages. It is mostly used to visu­al­ize and explore big tab­u­lar datasets.

Modin

Modin is a mod­u­lar Data­frame lib­rary that aims to provide a faster Pan­das replace­ment by util­iz­ing dis­trib­uted com­put­ing frame­works like Dask or Ray. It allows you to seam­lessly switch between using a single machine or a dis­trib­uted cluster without chan­ging your code. While Pan­das is single-threaded, Modin lets you instantly speed up your work­flows by scal­ing Pan­das so it uses all of your cores. Modin retains the ease-of-use and flex­ib­il­ity of Pan­das while provid­ing sig­ni­fic­ant per­form­ance improve­ments, espe­cially when deal­ing with large data­sets. The API is so sim­ilar to Pan­das that Modin calls itself a drop-in replace­ment for Pandas.

cuDF

cuDF is a GPU Data­frame lib­rary built on top of Apache Arrow. It provides an API sim­ilar to Pan­das, so it can be eas­ily used without any CUDA pro­gram­ming know­ledge. End-to-end com­pu­ta­tion on the GPU avoids unne­ces­sary copy­ing and con­vert­ing of data off the GPU, redu­cing com­pute time and cost for high-per­form­ance ana­lyt­ics com­mon in arti­fi­cial intel­li­gence workloads.

PyS­park

PyS­park is the Python API for Apache Spark, the ana­lyt­ics engine for large-scale data pro­cessing. It sup­ports a rich set of higher-level tools includ­ing Spark­SQL for SQL and Data­frames, Pan­das API on Spark for Pan­das work­loads, MLlib for machine learn­ing, GraphX for graph pro­cessing, and Struc­tured Stream­ing for stream processing.

Pan­das API for Spark provides a Pan­das-like inter­face on top of Apache Spark, allow­ing you to lever­age the scalab­il­ity and dis­trib­uted com­put­ing cap­ab­il­it­ies of Spark while enjoy­ing the ease of work­ing with Data­frames. PyS­park has for years been the king of dis­trib­uted data processing.

Datafu­sion

Datafu­sion is a very fast, extens­ible query engine for build­ing high-qual­ity data-cent­ric sys­tems, using the Apache Arrow in-memory format, and offer­ing SQL and Data­frame APIs. It is more pop­u­lar in the Rust com­munity, and such best used in need of inter­op­er­ab­il­ity needed between Python and Rust.

Ibis

Ibis is a high level Python lib­rary that provides a uni­ver­sal inter­face for data wrangling and  ana­lysis. It has a Pan­das-like Data­frame syn­tax and can express any SQL query, sup­port­ing mod­u­lar backends for query sys­tems (Spark, Snow­flake, DuckDB, Pan­das, Dask, etc). Another import­ant part of it is differed exe­cu­tion, so exe­cu­tion of code is pushed to the query engine, boost­ing performance.

It is espe­cially good as an inter­face for mul­tiple query engines, since it uni­fies the syn­tax. As such it shines when you want to use the same code with dif­fer­ent backends.

DuckDB

DuckDB is an in-pro­cess SQL OLAP data­base man­age­ment sys­tem. The reason why it’s been men­tioned in the same cat­egory is because it can effi­ciently run SQL quer­ies on Pan­das Data­frames, and can query Arrow data­sets dir­ectly and stream query res­ults back to Arrow. This enables us to use the bene­fits of Arrow and par­al­lel vec­tor­ized exe­cu­tion of DuckDB in one. It is a no hassle DBMS and a clear choice for pro­to­typ­ing with SQL.

Con­clu­sion

Data­frame lib­rar­ies in Python play a cru­cial role in sim­pli­fy­ing the pro­cess of work­ing with struc­tured data, offer­ing power­ful tools and effi­cient data manip­u­la­tion cap­ab­il­it­ies. Whether you prefer the ver­sat­il­ity of Pan­das, the scalab­il­ity of Dask and Vaex, the dis­trib­uted com­put­ing cap­ab­il­it­ies of Modin, or the seam­less integ­ra­tion with Apache Spark provided by PyS­park, Python offers a diverse range of options to suit your data ana­lysis needs.

Under­stand­ing the strengths and use cases of these lib­rar­ies can sig­ni­fic­antly enhance your pro­ductiv­ity and allow you to handle data­sets of vary­ing sizes effect­ively. With the rise of Arrow types, con­vert­ing between them will be almost imme­di­ate, with little metadata over­head. Happy datalyzing!

For more insights and resources, visit our web­site at synvert.com.