Pandas 3.0
Posted by jonbaer 6 days ago
Comments
Comment by edschofield 1 day ago
It’s probably not worth incurring the pain of a compatibility-breaking Pandas upgrade. Switch to Polars instead for new projects and you won’t look back.
Comment by data-ottawa 1 day ago
Pandas created the modern Python data stack when there was not really any alternatives (except R and closed source). The original split-apply-combine paradigm was well thought out, simple, and effective, and the built in tools to read pretty much anything (including all of your awful csv files and excel tables) and deal with timestamps easily made it fit into tons of workflows. It pioneered a lot, and basically still serves as the foundation and common format for the industry.
I always recommend every member of my teams read Modern Pandas by Tom Augspurger when they start, as it covers all the modern concepts you need to get data work done fast and with high quality. The concepts carry over to polars.
And I have to thank the pandas team for being a very open and collaborative bunch. They’re humble and smart people, and every PR or issue I’ve interacted with them on has been great.
Polars is undeniably great software, it’s my standard tool today. But they did benefit from the failures and hard edges of pandas, pyspark, dask, the tidyverse, and xarray. It’s an advantage pandas didn’t have, and they still pay for.
I’m not trying to take away from polars at all. It’s damn fast — the benchmarks are hard to beat. I’ve been working on my own library and basically every optimization I can think of is already implemented in polars.
I do have a concern with their VC funding/commercialization with cloud. The core library is MIT licensed, but knowing they’ll always have this feauture wall when you want to scale is not ideal. I think it limits the future of the library a lot, and I think long term someone will fill that niche and the users will leave.
Comment by neves 1 day ago
Comment by data-ottawa 1 day ago
Comment by nothrowaways 1 day ago
Comment by sampo 1 day ago
For better or worse, like Excel and like the simpler programming languages of old, Pandas lets you overwrite data in place.
Prepare some data
df_pandas = pd.DataFrame({'a': [1, 2, 3, 4, 5], 'b': [10, 20, 30, 40, 50]})
df_polars = pl.from_pandas(df_pandas)
And then df_pandas.loc[1:3, 'b'] += 1
df_pandas
a b
0 1 10
1 2 21
2 3 31
3 4 41
4 5 50
Polars comes from a more modern data engineering philosopy, and data is immutable. In Polars, if you ever wanted to do such a thing, you'd write a pipeline to process and replace the whole column. df_polars = df_polars.with_columns(
pl.when(pl.int_range(0, pl.len()).is_between(1, 3))
.then(pl.col("b") + 1)
.otherwise(pl.col("b"))
.alias("b")
)
If you are just interactively playing around with your data, and want to do it in Python and not in Excel or R, Pandas might still hit the spot. Or use Polars, and if need be then temporarily convert the data to Pandas or even to a Numpy array, manipulate, and then convert back.P.S. Polars has an optimization to overwite a single value
df_polars[4, 'b'] += 5
df_polars
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1 ┆ 10 │
│ 2 ┆ 21 │
│ 3 ┆ 31 │
│ 4 ┆ 41 │
│ 5 ┆ 55 │
└─────┴─────┘
But as far as I know, it doesn't allow slicing or anything.Comment by richardbachman 17 hours ago
df.with_columns(pl.col.b + pl.row_index().is_between(1, 3))
# shape: (5, 2)
# ┌─────┬─────┐
# │ a ┆ b │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 1 ┆ 10 │
# │ 2 ┆ 21 │
# │ 3 ┆ 31 │
# │ 4 ┆ 41 │
# │ 5 ┆ 50 │
# └─────┴─────┘
> Polars has an optimization to overwite a single valueI believe it is just "syntax sugar" for calling `Series.scatter()`[1]
> it doesn't allow slicing
I believe you are correct:
df_polars[1:3, "b"] += 1
# TypeError: cannot use "slice(1, 3, None)" for indexing
You can do: df_polars[list(range(1, 4)), "b"] += 1
Perhaps nobody has requested slice syntax? It seems like it would be easy to add.[1]: https://github.com/pola-rs/polars/blob/9079e20ae59f8c75dcce8...
Comment by goatlover 1 day ago
Comment by thijsn 1 day ago
pandas is write-optimized, so you can quickly and powerfully transform your data. Once you're used to it, it allows you to quickly get your work done. But figuring out what is happening in that code after returning to it a while later is a lot harder compared to Polars, which is more read-optimized. This read-optimized API coincidentally allows the engine to perform more optimizations because all implicit knowledge about data must be typed out instead of kept in your head.
Comment by goatlover 1 day ago
No doubt some of this comes down to preference as to what's considered readable. I never really bought that argument that regular expressions create more problems than they're worth. Perhaps I side on the expressivity end of the readability debate.
Comment by thijsn 17 hours ago
Comment by thereisnospork 1 day ago
Comment by satvikpendem 1 day ago
Polars is great, but it is better precisely because it learned from all the mistakes of Pandas. Don't besmirch the latter just because it now has to deal with the backwards compatibility of those mistakes, because when it first started, it was revolutionary.
Comment by crystal_revenge 1 day ago
I (and many others) hated Pandas long before Polars was a thing. The main problem is that it's a DSL that doesn't really work well with the rest of Python (that and multi-index is awful outside of the original financial setting). If you're doing pure data science work it doesn't really come up, but as soon as you need to transform that work into a production solution it starts to feel quite gross.
Before Polars my solution was (and still largely remains) to do most of the relational data transformations in the data layer, and the use dicts, lists and numpy for all the additional downstream transformations. This made it much easier to break out of the "DS bubble" and incorporate solutions into main products.
Comment by vegabook 1 day ago
Comment by data-ottawa 1 day ago
To say pandas just copied it but worse is overly dismissive. The core of pandas has always been indexing/reindexing, split-apply-combine, and slicing views.
It’s a different approach than R’s data tables or frames.
Comment by aidos 1 day ago
Can you show an example? Seems interesting considering that code knowing about external context is not generally a good pattern when it comes to maintainability (security, readability).
I’ve lived through some horrific 10M line coldfusion codebases that embraced this paradigm to death - they were a whole other extreme where you could _write_ variables in the scope of where you were called from!
Comment by condwanaland 1 day ago
I can write code like: penguin_sizes <- select(penguins, weight, height)
Here, weight and height are columns inside the dataframe. But I can refer to them as if they were objects in the environment (I., e without quotes) because the select function looks for them inside the penguins dataframe (it's first argument)
This is a very simple example but it's used extensively in some R paradigms
Comment by data-ottawa 1 day ago
And its why you can do plot(x, sin) and get properly labelled graphs. It also powers the formula API that made caret and glm modules so easy to use.
Comment by sampo 1 day ago
Dataframes first appeared in S-PLUS in 1991-1992. Then R copied S, and from 1995-1996-1997 onwards R started to grow in popularity in statistics. As free and open source software, R started to take over the market among statisticians and other people who were using other statistical software, mainly SAS, SPSS and Stata.
Given that S and R existed, why were they mostly not picked up by data analysts and programmers in 1995-2008, and only Python and Pandas made dataframes popular from 2008 onwards?
Comment by xtracto 1 day ago
Comment by BeetleB 1 day ago
(Yes, yes - I know some people wish that were the case!)
Comment by Xunjin 1 day ago
Comment by bicepjai 1 day ago
Comment by v3ss0n 1 day ago
Comment by gkbrk 1 day ago
They get forked and stay open source? At least this is what happens to all the popular ones. You can't really un-open-source a project if users want to keep it open-source.
Comment by stingraycharles 1 day ago
Comment by v3ss0n 19 hours ago
Comment by quentindanjou 1 day ago
Comment by disgruntledphd2 1 day ago
Comment by rdedev 1 day ago
I work with chemical datasets and this always involves converting SMILES string to Rdkit Molecule objects. Polars cannot do this as simply as calling .map on pandas.
Pandas is also much better to do EDA. So calling it worse in every instance is not true. If you are doing pure data manipulation then go ahead with polars
Comment by data-ottawa 1 day ago
When it feels like you’re writing some external udf thats executed in another environment, it does not feel as nice as throwing in a lambda, even if the lambda is not ideal.
Comment by vegabook 1 day ago
https://docs.pola.rs/api/python/dev/reference/expressions/ap...
You can also iter_rows into a lambda if you really want to.
https://docs.pola.rs/api/python/stable/reference/dataframe/a...
Personally I find it extremely rare that I need to do this given Polars expressions are so comprehensive, including when.then.otherwise when all else fails.
Comment by data-ottawa 1 day ago
It also does batches when you declare scalar outputs, but you can't control the batch size, which usually isn't an issue, but I've run into situations where it is.
Comment by rich_sasha 1 day ago
Where I certainly disagree is the "frame as a dict of time series" setting, and general time series analysis.
The feel is also different. Pandas is an interactive data analysis container, poorly suited for production use. Polars I feel is the other way round.
Comment by thelastbender12 1 day ago
Comment by ohyoutravel 1 day ago
Comment by cruffle_duffle 1 day ago
Like jquery, which hasn’t fundamentally changed since I was a wee lad doing web dev. They didn’t make major changes despite their approach to web dev being replaced by newer concepts found on angular, backbone, mustache, and eventually react. And that is a good thing.
What I personally don’t want is something like angular that basically radically changed between 1.0 and 2.0. Might as well just call 2.0 something new.
Note: I’ve never heard of polars until this comment thread. Can’t wait to try it out.
Comment by ptman 1 day ago
Comment by sirfz 1 day ago
Comment by lairv 1 day ago
import polars as pl
from concurrent.futures import ProcessPoolExecutor
pl.DataFrame({"a": [1,2,3], "b": [4,5,6]}).write_parquet("test.parquet")
def read_parquet():
x = pl.read_parquet("test.parquet")
print(x.shape)
with ProcessPoolExecutor() as executor:
futures = [executor.submit(read_parquet) for _ in range(100)]
r = [f.result() for f in futures]
Using thread pool or "spawn" start method works but it makes polars a pain to use inside e.g. PyTorch dataloaderComment by skylurk 1 day ago
import polars as pl
pl.DataFrame({"a": [1, 2, 3]}).write_parquet("test.parquet")
def print_shape(df: pl.DataFrame) -> pl.DataFrame:
print(df.shape)
return df
lazy_frames = [
pl.scan_parquet("test.parquet")
.map_batches(print_shape)
for _ in range(100)
]
pl.collect_all(lazy_frames, comm_subplan_elim=False)
(comm_subplan_elim is important)Comment by ritchie46 1 day ago
However, this is not a Polars issue. Using "fork" can leave ANY MUTEX in the system process invalid (a multi-threaded query engine has plenty of mutexes). It is highly unsafe and has the assumption that none of you libraries in your process hold a lock at that time. That's an assumption that's not PyTorch dataloaders to make.
Comment by lairv 1 day ago
That said for PyTorch DataLoader specifically, switching from fork to spawn removes copy-on-write, which can significantly increase startup time and more importantly memory usage. It often requires non-trivial refactors, many training codebase aren't designed for this and will simply OOM. So in practice for this use case, I've found it more practical to just use pandas rather than doing a full refactor
Comment by schmidtleonard 1 day ago
Do they really still not have a good mechanism to toss a flag on a for loop to capture embarrassing parallelism easily?
Comment by datsci_est_2015 1 day ago
Comment by jvican 1 day ago
Comment by devin-petersohn 1 day ago
Comment by datsci_est_2015 1 day ago
Edit: hah, based on the sibling comment, I stand corrected
Comment by bovermyer 1 day ago
The professor doesn't actually care which tool we use as long as we produce nice graphs, so this is as good a time as any to experiment.
Comment by __mharrison__ 1 day ago
Pandas is better for plotting and third party integration.
Comment by vaylian 1 day ago
I used Pandas a lot with Jupyter notebooks. I don't have any experience with Polars. Is it also possible to work with Polars dataframes in Jupyter notebooks?
Comment by disgruntledphd2 1 day ago
Comment by torcete 1 day ago
Comment by bhadass 1 day ago
Comment by data-ottawa 1 day ago
UDFs in most dataframe libraries tend to feel better than writing udfs for a sql engine as well.
Polars specifically has lazy mode which enables a query optimizer, so you get predicate push down and all the goodies if SQL, with extra control/primitives (sane pivoting, group_by_dynamic, etc)
I do use ibis on top of duckdb sometimes, but the UDF situation persists and the way they organize their docs is very difficult to use.
Comment by vegabook 1 day ago
Comment by data-ottawa 1 day ago
But I do agree with you.
Comment by bikelang 1 day ago
Comment by pelasaco 1 day ago
Comment by noitpmeder 1 day ago
Comment by pelasaco 20 hours ago
Comment by noo_u 1 day ago
Unfortunately, there are a lot of third party libraries that work with Pandas that do not work with Polars, so the switch, even for new projects, should be done with that in mind.
Comment by skylurk 1 day ago
I maintain one of those libraries and everything is polars internally.
Comment by postalcoder 1 day ago
Comment by lvl155 1 day ago
OT, but I can’t imagine data science being a job category for too long. It’s got to be one of the first to go in AI age especially since the market is so saturated with mediocre talents.
Comment by data-ottawa 1 day ago
I don’t think the field will go away with AI, frankly with LLMs I’ve automated that bottom 80% of queries I used to have to do for other users and now I just focus on actual hard problems.
That “build a self serve dashboard” or number fetching is now an agentic tool I built.
But the real meat of “my business specializes in X, we need models to do this well” has not yet been replaceable. I think most hard DS work is internal so isn’t in training sets (yet).
Comment by claytonjy 1 day ago
Data Engineers took over the plumbing once they moved on from Scala and Spark. ML Engineers took over the modeling (and LLMs are now killing this job too, as it’s rare to need model training outside of big labs). Data analysts have to know SQL and python these days, and most DS are now just this, but with a nicer title and higher pay.
Once upon a time I thought DS would be much more about deeper statistics and causal inference, but those have proven to be rare, niche needs outside soft science academia.
Comment by datsci_est_2015 1 day ago
> as it’s rare to need model training outside of big labs
Do you think there are pre-trained models for e.g. process optimization for the primary metallurgy process for steel manufacturing? Industrial engineers don’t know anything about machine learning (by trade), and there are companies that bring specialized Data Science know-how to that industry to improve processes using modern data-driven methods, especially model building.
It’s almost like 99% of comments on this topic think that DS begins at image classification and ends at LLMs, with maybe a little bit of landing page A/B testing or something. Wild.
> Once upon a time I thought DS would be much more about deeper statistics and causal inference, but those have proven to be rare, niche needs outside soft science academia.
This is my entire career lol.
Comment by datsci_est_2015 1 day ago
Depends what your definition of “to go” means. Responsibilities swallowed by peers? Sure, and new job titles might pop up like Research & Development Engineer or something.
The discipline of creating automated systems to extract insights from data to create business value? I can’t really see that going anywhere. I mean, why tf would we be building so many data centers if there’s no value in the data they’re storing.
Comment by iugtmkbdfil834 1 day ago
This is interesting. I wanted to dig into it a little since I am not sure I am following the logic of that statement.
Do you mean that AI would take over the field, because by default most people there are already not producing anything that a simple 'talk to data' LLM won't deliver?
Comment by mynameisash 1 day ago
I used to work on teams where DS would put a ton of time into building quality models, gating production with defensible metrics. Now, my DS counterparts are writing prompts and calling it a day. I'm not at all convinced that the results are better, but I guess if you don't spend time (=money) on the work, it's hard to argue with the ROI?
Comment by datsci_est_2015 1 day ago
> writing prompts and calling it a day
What does this mean? They’re not creating pull requests and maintaining learning / analytics systems?
This kind of vagueposting gets on my nerves.
Comment by mynameisash 1 day ago
Sure, they check prompts into git. And there are a few notebooks that have been written and deployed, but most of that is collecting data and handing it off to ChatGPT. No, they're not maintaining learning/analytics systems. My team builds our data processing pipelines, and we support everything in production.
> This kind of vagueposting gets on my nerves.
What is vague about my comment?
Whereas in the past, the DS teams I worked with would do feature engineering and rigorous evaluation of models with retraining based on different criteria, now I'm seeing that teams are being lazy and saying, "We'll let the LLM do things. It can handle unstructured data, and we can give it new data without additional work on our part." Hence, they're simply writing a prompt and not doing much more.
Comment by datsci_est_2015 1 day ago
So many questions. That’s why I called it vague. I don’t know how any data scientist could read this and not have a million follow up questions. Is this offline learning? Online learning? What are the guardrails? Are there guardrails? Mostly, wtf?
Comment by mritchie712 1 day ago
It's funny to look back at the tricks that were needed to get gpt3 and 3.5 to write SQL (e.g. "you are a data analyst looking at a SQL database with table [tables]"). It's almost effortless now.
Comment by wodenokoto 1 day ago
Comment by howling 1 day ago
Comment by postalcoder 1 day ago
Comment by iugtmkbdfil834 1 day ago
Comment by gHA5 1 day ago
Comment by edschofield 1 day ago
Comment by postalcoder 1 day ago
Comment by crimsoneer 1 day ago
Comment by postalcoder 1 day ago
Like you said, perhaps the demise of phind was inevitable, with large models displacing them kind of like how Spotify displaced music piracy.
Comment by thibaut_barrere 1 day ago
I have integrated Explorer https://github.com/elixir-explorer/explorer, which leverages it, into many Elixir apps, so happy to have this.
Comment by alex7o 1 day ago
Comment by thegabriele 1 day ago
Is this everyone's experience?
Comment by OGWhales 1 day ago
Comment by mynameisash 1 day ago
Comment by mjhay 1 day ago
Comment by OutOfHere 1 day ago
Comment by teekert 1 day ago
From there, of course, you slowly start to learn about types etc, and slowly you start to appreciate libraries and IDEs. But I knew tables, and statistics and graphs, and Pandas (with the visual style of Notebooks) lead me to programming via that familiar world. At first with some frustration about Pandas and needing to write to Excel, do stuff, and read again, but quickly moving into the opposite flow, where Excel itself became the limiting factor and being annoyed when having to use it.
I offered some "Programming for Biologists" courses, to teach people like me to do programming in this way, because it would be much less "dry" (pd.read.excel().barplot() and now you're programming). So far, wherever I offered the courses they said they prefer to teach programming "from the base up". Ah well! I've been told I'm not a programmer, I don't care. I solve problems (and that is the only way I am motivated enough to learn, I can't sit down solving LeetCode problems for hours, building exactly nothing).
(To be clear, I now do the Git, the Vim, the CI/CD, the LLM, the Bash, The Linux, the Nix, the Containers... Just like a real programmer, my journey was just different, and suited me well, I believe others can repeat my journey and find joy in programming, via a different route.)
Comment by kayson 1 day ago
I tried pandera and it left a lot to be desired. Static frame [1] seems promising but doesn't appear to be popular for some reason.
Comment by jtrueb 1 day ago
Comment by EForEndeavour 1 day ago
Comment by jtrueb 1 day ago
Comment by alexcasalboni 1 day ago
Comment by gku 1 day ago
Comment by QuadmasterXLII 1 day ago
Comment by optimalsolver 1 day ago
Comment by uncletoxa 1 day ago
Comment by leadingthenet 1 day ago
Comment by esafak 1 day ago
Comment by g-mork 1 day ago
Comment by OutOfHere 1 day ago
Comment by swyx 1 day ago
Comment by OutOfHere 1 day ago
The main exception is for legacy code requiring maintenance when they are unwilling to upgrade Pandas.
Comment by swyx 1 day ago