Scientific datasets are riddled with copy-paste errors

Posted by jruohonen 1 day ago

Comments

Comment by TrackerFF 1 day ago

It wouldn't surprise me one bit if many of these things can be attributed to Excel usage. I'm a "power user" of excel, and when working on larger problems with tens of sheets, smaller mistakes can easily carry on. Even more so if you're not a proficient user.

One of my first jobs as an analyst was to clean up messy spreadsheets made by people, even very senior employees, who never bothered to learn excel properly.

Comment by jcattle 1 day ago

Yes, immediately thought the same. CSV alone is a footgun and a half on any computer which doesn't have . as the decimal separator.

Let alone column sorting and joining of data.

Comment by IanCal 23 hours ago

CSV occupies, even years after moving away from more raw data work, way too much of my brain is still dedicated to "ways of dealing with CSV from random places".

I can already hear people who like CSV coming in now, so to get some of my bottled up anger about CSV out and to forestall the responses I've seen before

* It's not standardised

* Yes I know you found an RFC from long after many generators and parsers were written. It's not a standard, is regularly not followed, doesn't specify allowing UTF-8 (lmao, in 2005 no less) or other character sets as just files. I have learned about many new character sets from submitted data from real users. I have had to split up files written in multiple different character sets because users concatenated files.

* "You can edit it in a text editor" which feels like a monkeys-paw wish "I want to edit the file easily" "Granted - your users can now edit the files easily". Users editing the files in text editors results in broken CSV files because your text editor isn't checking it's standards compliant or typed correctly, and couldn't even if it wanted to.

* Errors are not even detectable in many cases.

* Parsers are often either strict and so fail to deal with real world cases or deal with real world cases but let through broken files.

* Literally no types. Nice date field you have there, shame if someone were to add a mixture of different dd/mm/yy and mm/dd/yy into it.

* You can blame excel for being excel, but at some point if that csv file leaves an automated data handling system and a user can do something to it, it's getting loaded into excel and rewritten out. Say goodbye to prefixed 0s, a variety of gene names, dates and more in a fully unrecoverable fashion.

* "ah just use tabs" no your users will put tabs in. "That's why I use pipes" yes pipes too. I have written code to use actual data separators and actual record separators that exist in ASCII and still users found some way of adding those in mid word in some arbitrary data. The only three places I've ever seen these characters are 1. lists of ascii characters where I found them, 2. my code, 3. this users data. It must have been crafted deliberately to break things.

This, excel and other things are enormous issues. The fact that there any are manual steps along the path for this introduces so many places for errors. People writing things down then entering them into excel/whatever. Moving data between files. You ran some analysis and got graphs, are those the ones in the paper? Are they based on the same datasets? You later updated something, are all the downstream things updated?

This occurs in all kinds of papers, I've seen clear and obvious issues over datasets covering many billions of spending, in aggregate trillions. I can only assume the same is true in many other fields as well as those processes exist there too.

There is so much scope to improve things, and yet so much of this work is done by people who don't know what the options are and often are working late hours in personal time to sort that it's rarely addressed. My wife was still working on papers for a research position she left and was not being paid for any more years after, because the whole process is so slow for research -> publication. What time is there then for learning and designing a better way of tracking and recording data and teaching all the other people how to update & generate stats? I built things which helped but there's only so much of the workflow I could manage.

Comment by xorcist 21 hours ago

While I appreciate a good rant just as much as the next person, most of these points have nothing to do with CSV. They are a general problem with underspecifying data, which is exactly what happens when you move data between systems.

The amount of hours I have wasted on unifying character sets across single database tables is horrifying to even think about. And the months it took before an important national dataset that supposedly many people use across several types of businesses was staggering. That fact that that XML came with a DTD was apparently not a hindrance to doing unspeakable horrors with both attributes and cdata constructs.

Sure, you can specify MM/DD/YY in a table, but it people put DD/MM/YY in there, what are you going to do about it? And that's exactly what happens in the real world when people move data across systems. That's why mojibake is still a thing in 2026.

Comment by mcdonje 22 hours ago

You're blaming a lot of normal ETL problems on DSVs.

Like, specifying date as a type for a field in JSON isn't going to ensure that people format it correctly and uniformly. You still have parsing issues, except now you're duplicating the ignored schema for every data point. The benefit you get for all of that overhead is more useful for network issues than ensuring a file is well formed before sending it. The people who send garbage will be more likely to send garbage when the format isn't tabular.

There are types and there is a spec WHEN YOU DEFINE IT.

You define a spec. You deal with garbage that doesn't match the spec. You adjust your tools if the garbage-sending account is big. You warn or fire them if they're small. You shit-talk the garbage senders after hours to blow off steam. That's what ETL is.

DSVs aren't the problem. Or maybe they are for you because you're unable to address problems in your process, so you need a heavy unreadable format that enforces things that could be handled elsewhere.

Comment by jcattle 22 hours ago

I would kind of disagree.

We are talking here in the context of scientific datasets. Of course ETL plays a part here. However here it is really more the interplay of Excel with CSV which is often outputted by scientific instruments or scientific assistants.

You get your raw sensor data as a csv, just want to take a look in excel, it understandably mangles the data in attempt to infer column types, because of course it does, its's CSV! Then you mistakenly hit save and boom, all your data on disk is now an unrecoverable mangled mess.

Of course this is also the fault of not having good clean data practices, but with CSV and Excel it is just so, so easy to hold it wrong, simply because there is no right.

> so you need a heavy unreadable format

I prefer human unreadable if it means I get machine readable without any guesswork.

Comment by mcdonje 16 hours ago

That's Excel's type inference causing problems. Not an issue with CSV or any other type of DSV.

It is possible to import a CSV into Excel without type conversion. I just tested it two different ways.

While possible, it's not Excel's default way of doing things. Not always obvious or easy. Not enough people who use Excel really know how to use it.

Regardless, Excel mangling files via type inference is an Excel problem. It's not the fault of the file formats Excel reads in.

Comment by jcattle 5 hours ago

If you get an .xls which doesn't have very esoteric functions, I expect it to open about the same way in any Excel program and any other office suite.

With CSV I do not have that expectation. I know that for some random user-submitted CSVs, I will have to fiddle. Even if that means finding the one row in thousand rows which has some null value placeholder, messing up the whole automatic inference.

Comment by thunderfork 13 hours ago

The file format being ambiguous and underspecified enough to mangle is, though.

Comment by db48x 18 hours ago

> "You can edit it in a text editor" which feels like a monkeys-paw wish

Yes :) Although I will note that some editors are good enough to maintain the structure as the user edits. Consider Emacs with `csv-mode`, for example. Of course most users don’t have Emacs so they’ll just end up using notepad (or worse, Word).

Comment by adrianN 21 hours ago

This is an excellent rant, thanks for sharing. I didn’t have to work with csv s mich as you, but what experience I had I share your sentiment.

Comment by jcattle 23 hours ago

Completely agree. If CSVs stay read only and are not user-submitted but computer generated they can be okay at best.

Anything else? Nope nope nope!

Comment by abstractdev 17 hours ago

[flagged]

Comment by tempaccount5050 23 hours ago

I think that's a little unfair. It really comes down to parsing text and you'll have similar issues even if you use a database or whatever you think the "real" solution is. I have a project I'm working on right now that stores dates, phone numbers, and website links. Cleaning/parsing has been 90% of the work and I still have edge cases that aren't fully solved. Every time I think I'm done, I find something else I haven't thought about. Local AI models have been a huge help though for sanitizing.

Comment by tremon 23 hours ago

that's a little unfair. It really comes down to parsing text and you'll have similar issues even if you use a database or whatever

Feel free to show a real-world example of a database or whatever that takes the input string "IGF1 SEPT2 PRX3 MARCH1" and writes that into storage as ["IGF1", "2026-09-02", "PRX3", "2026-03-01"].

Also with Excel, an inadvertent click+drag can move data between cells, and since the cells are uniform it's hard to see that anything unintended happened. I've seen people lose files in Windows Explorer the same way: double-click with a shaky hand can easily move a file into a subdirectory.

Comment by tempaccount5050 21 hours ago

You still have to push and pull from the db. Meaning transforms still need to happen in either direction. I get what you're saying but it's just as easy to screw up a regex in either direction. Or making assumptions about how your language of choice will handle dates etc.

Comment by theshrike79 4 hours ago

Excel is the only "language" that just silently transforms anything that might be a date into a date. In such a frequency that SCIENTISTS HAVE RENAMED GENES to prevent it: https://www.theverge.com/2020/8/6/21355674/human-genes-renam...

Comment by theshrike79 4 hours ago

Excel doesn't have unit testing or validation built in.

That's my biggest problem in the world right now. SO MANY BIG THINGS in the world are running on random Excel sheets created by FSM knows who full of formulas and shit that have zero validation.

...and people are worried about "vibe coded slop" - at least that stuff is made with actual programming languages with unit testing frameworks.

Nobody has ever went through an inherited Excel sheet can confirmed that every field in column CB has the same formula and no-one in the 42 people long inheritance chain has accidentally fat-fingered a static number in there.

Comment by coppsilgold 1 day ago

What should give people pause is how not complicated (I'd hesitate to say easy) it would be to create a python script that would generate fake data such that it would be all but impossible to determine whether it's real or not. You just need to model the measuring device and hypothesis you want to support, then sample away.

The people who get caught red handed like this are lazy, incompetent and stupid. Makes you wonder what about the ones not getting caught.

Comment by IanCal 23 hours ago

The other explanation is that often these are just mistakes that occur with a team of experts in their field but not data management, without a budget for building a more robust system, manually doing a lot of things with data. It's so easy to copy and paste something into the wrong place, to sort by a field and get things out of order, all kinds of issues like that.

Comment by kqr 23 hours ago

On the other hand, any time a hypothesis appears significant, the first reaction should be to verify that all the data going into the calculation is correct, rather than just assume it is. In my day-to-day industry experience, significant results come far more often from incorrect data than an actual discovery.

Comment by tempaccount5050 23 hours ago

Reminds me of a job I had doing QA at a pill factory in high school in the 90s. Weight, size, crush tests. After 3 months seeing 99.99% pass rates, I pulled out my statistics book and started generating fake data and just read books. If I could do it at 16...

Comment by ansgri 1 day ago

It's likely easier to publish genuine research if you have knowledge and rigor to properly synthesize the data. It's not as common as you'd think, and a good simulator is easily publishable, at least was some years ago in domains I'm familiar with.

Comment by kqr 23 hours ago

The conspiratory reason would be that copy-paste errors give plausible deniability of ill intent.

Comment by scotty79 1 day ago

> The people who get caught red handed like this are lazy, incompetent and stupid.

Being a cheat significantly correlates with laziness, incompetence and stupidity so there are probably very few cheats smart and diligent enough to not get caught.

Comment by isolli 1 day ago

The sample of cheaters that we know about is biased towards cheaters who get caught.

Comment by BobaFloutist 13 hours ago

To those, yes, but also to the weirdly high number of cheaters that can't help bragging to everyone that'll listen about their cheating.

Comment by tpoacher 1 day ago

Based on my experience of the field, I very much challenge that assumption.

Comment by coppsilgold 1 day ago

Indeed, the niches of smart cheats or smart criminals have a lot of room. Because the trajectories to reach that stage without being caught by a legal (good legal work) or moral (good person) attractor are sparse and that makes them somewhat rare.

Handwaving correlations between cheating/criminality and most personality/intelligence aspects is an error, not least because there is a selection bias problem (eg. who gets caught).

Comment by thaumasiotes 1 day ago

> Being a cheat significantly correlates with laziness, incompetence and stupidity

There is no evidence for this.

Comment by ethan_smith 1 day ago

[dead]

Comment by jcattle 1 day ago

Just a thought: This data engineering can only really occur in sciences with a significant "moat".

Expensive tools, expensive test setups, live, gene-altered animals, etc.

In fields such as deep learning or other more digital fields (my field is using a lot of freely available satellite data) replication is often cheaper and actual application of research outcomes is a lot more common.

Comment by mattkrause 20 hours ago

I used to think that but....

I've reviewed for a few "replication tracks" at ML Conferences and there are a surprising number of reports where people are simply unable to replicate published results. The reasons are all over the map: sometimes the original authors' code just needs to be fixed (new libraries, different environments), but other results simply don't seem to hold up.

Comment by steve_adams_86 1 day ago

This is legitimately so challenging to avoid, because loads of scientific processes are—to some degrees or others—bespoke and difficult to fully streamline and introduce efficient, well-structured, comprehensive QA.

A LOT of labour goes into making it work. Most scientists I know and work with are very diligent people who care a lot about the outputs being as correct as possible, but wow, their workflows aren't great.

My job is to try and address this in whatever ways are practical for the data and the people doing the science, and it's kind of like Saas in that you think it should be easy enough to spot problems, solve them, and carry on/become a billionaire, but... The world is much more complicated than that, and it's easier to fail in this endeavour than it is to break even.

The classic "DropBox is just rsync" or "I could build Airbnb in a weekend" sentiments have their commonalities and counterparts in science, and the reality is similarly defeating and punishing on both sides. Making science go faster while maintaining correctness is exceedingly difficult. There are so many moving parts. So many disparate participants who are wildly technical and capable, or brilliant at studying bacteria in starfish yet terrified to run a command in a terminal. Your user base has virtually nothing in common in terms of ability and willingness to do anything other than get their own work done. It's brutal.

So, I sympathize with the authors of these papers and I hope readers don't assume they're bad at what they do or that it's done in bad faith. It's genuinely difficult.

An anecdote: I created a tool for validating biodiversity data against a specification called Darwin Core. Initially our published data was failing to validate so much that I thought I'd made the tool wrong. Rather, the spec is so complex and vast that the people I work with were unable to manage to get valid data into the public repositories. And yet! They were able to publish, because the public repositories' own validation is... Invalid. That's the state of things.

Granted, the data is still correct enough to be useful, and the errors don't cause the results to indicate anything that they shouldn't. It's more like minor metadata issues, failures to maintain referential integrity across different datasets, etc. But it's a very real, very difficult problem.

Science isn't easy at all. So many hoops to jump through, so much rigor, so much data. Mistakes are inevitable.

Comment by SubiculumCode 1 day ago

One Offs. A lot of research results in one-off code. You may not go back to this dataset, these ideas again. When you do, sometimes years, later, you go, oh shit, this is hard to work with. So then you begin to build better structures, do the extra work it takes to make things easy to apply to new purposes or to accept new (but slightly different) datasets. It takes time and effort, and money. And that is where it all breaks down. Most scientists have to be jacks of many trades to get by.

Comment by nippoo 1 day ago

It's hard to avoid, but there are steps we can make towards fixing it. I spent years in academia building open-source data processing pipelines for neuroscience data and helping other researchers do the same. Most quantitative research goes through "lossy" steps between raw data and final results involving Excel spreadsheets, one-off MATLAB commands, copy pasting the results, etc.

In a lot of cases (where data is being collected by humans with a tape measure, say) there is room for error. But one of the things that's getting traction in some fields is open-source publication of both raw datasets and the evaluation/processing methods (in a Jupyter Notebook, say) in a way that lets other people run their analysis on your data, your analysis on their data, or at least re-run your start-to-finish pipeline and look for errors!

As is often the case, the holdups are mostly political: methods papers are less prestigious than the "real science" ones, and it takes journals / funders to mandate these things and provide funding/hosting for datasets for 10+ years, etc - researchers are a time-poor bunch and often won't do things unless there's an incentive to!

Comment by jiggunjer 1 day ago

Taking notebooks to a production environment isn't fun either. With ai there's no more excuse for using that coding crutch.

Comment by TheTaytay 1 day ago

Yes…mistakes are inevitable, and I get not expecting or demanding perfection. But the subtext here is that this is unlikely to be a mistake, and much more likely to be fraud.

There are incentives for these spreadsheets having the values that they do, and also there is no conceivable way that the values are correct, and on top of that, the most likely ways to get these values are to copy and paste large amounts of numbers, and even perturb some of them manually.

If you see this in accounting,(where there are also mistakes), it’s definitely fraud. (Awww man - we accidentally inflated our revenue and profit to meet expectations by accidentally duplicating numerous revenue lines and no one internally caught it! Dang interns!) If you see it in science, you ask the authors about it and they shrug and mumble a semi plausible explanation if you’re lucky? I can totally imagine a lab tech or grad student making a large copy paste mistake. I can’t imagine them making a series of them in such a way that it bolsters or proves the author’s claim AND goes completely undetected by everyone involved.

Comment by energy123 1 day ago

> I can’t imagine them making a series of them in such a way that it bolsters or proves the author’s claim AND goes completely undetected by everyone involved.

The small minority of cases that do fit this pattern get selected to be on the front page of HN. So we aren't drawing from a random sample of mistakes. All the selection effects work against the more common categories of mistakes showing up on the HN front page, such as author disinterest, reader disinterest, to rejection by the journal, to a lack of publicity if the null result is published. The more reliable tell that it's a fraud is that the authors didn't respond when the errors were discovered.

Comment by SubiculumCode 1 day ago

well, in that case, its bad. Obviously.

Comment by dataviz1000 1 day ago

> their workflows aren't great

Sounds like a startup idea.

Comment by analog31 1 day ago

Spend a few years working in the target environment. It will disabuse you of the idea that science research can be regularized with technology.

Comment by adampunk 1 day ago

You'll want to sit down when I tell you the budget these folks have for workflow solutions. Ain't gonna take long but might be shocking if you've got big startup hopes. ;)

Comment by steve_adams_86 1 day ago

For sure. These are often people who want better equipment to do their research, not software subscriptions that promise to force them to work in unfamiliar and uncompelling ways. You'd need a fantastic, game-changing idea to get meaningful traction.

One example of these might be systems like S3 and distributed computing in AWS. Like, huge ideas that take massive initiatives to implement, but make science meaningfully easier. I can't think of many other modern technologies we use that the team doesn't mostly resent (like Slack or Google Drive). They're largely interested in just doing the science, the rest eats into funding (which is increasingly sparse these days).

Comment by Starman_Jones 20 hours ago

This was almost two decades ago, but I worked in a lab running particle detection experiments from an “internet-capable”computer that started life with “Windows 98 already installed- no upgrade needed.” Any “workflow solutions” talk started and ended with “Can we get undergrads to do it for class credit?”

Comment by devmor 1 day ago

If you want to make no money, sure.

The solutions these scientists need are bespoke and share little in common. They also have fixed grant funding.

In 2009 I made $15/hr working with some PhDs and grad students in a couple different labs to automate their workflows - I was the highest paid person in the room most of the time.

Comment by devmor 1 day ago

A lot of the work I have done for scientists when I was a contractor (and a bit while working for bespoke software consultancies) was quite literally just making programmatic applications out of Excel sheets.

In one case, we used mdftools to literally use the original excel spreadsheet as our logic engine.

Comment by cyanydeez 1 day ago

just imagine you scan private insustry. this is a generic problem that LLMs wont solve in generative capabilities.

Comment by l5870uoo9y 1 day ago

> It could be either a fat-finger mistake when editing the Excel file or deliberate tampering to cover up real data that didn't tell the right story.

I can easily imagine after spending years or decades devoted to discovering a scientific breakthrough that some could be tempted to slightly alter the data. I believe there was some scandal about this a few years back with climate data. Fixing this is however something that AI would do fairly well.

Comment by CWwdcdk7h 20 hours ago

> some could be tempted to slightly alter the data

We even suspect the G. Mendel of altering his pea data, it is unlikely he got results so close to predicted ratios. So it is not "some", everyone is tempted to "clean up" data, be it by removing outliers or by duplicating "good" rows.

Comment by shevy-java 1 day ago

> AI would do fairly well

But AI can also hallucinate data. I am not sure this is an area for an automatic "AI is better than humans". Honesty is very important in science. There were even fake articles generated:

https://www.thelancet.com/journals/lancet/article/PIIS0140-6...

And some other article I forgot, about arsene or some other ion being used in/for DNA or so. Turned out to be totally fabricated. Right now I don't remember the name of the article; was from some years ago.

Comment by devmor 1 day ago

I don’t believe fixing this is not something AI would do well.

Identifying it is something AI could do well, though. It’s very good at finding patterns - that’s kind of essential to how it works.

Comment by kriro 22 hours ago

Innocent mistakes and frustrating back and forth are also very common, especially for interdisciplinary teams. The mismatch in tooling and workflows and manual copy & paste conversion is a thing to behold. Add multiple countries and Excel to the mix (dot vs comma, formulas being language specific), maybe have a couple of Chinese, Japanese, Russian or Arabic speking researchers in the group for some extra UTF-8 magic. Line endings on Linux vs. OSX vs. Windows.

Comment by cowartc 20 hours ago

The real rate is certainly higher because this only catches the laziest form of error. The harder problem is the same one we see in production ML. Your pipeline can produce confident results on garbage data and nothing in the system tells you. The first step isn't better models or better tools, its profiling the input before you trust anything downstream of it.

Comment by hackeraccount 16 hours ago

I always think about that Simpsons quote about alcohol "The cause of... and solution to... all of life's problems"

That's how I feel about copy-paste. Nothing is ever so janus-faced.

Comment by mmooss 6 hours ago

Note this exchange with the OP author:

> [Paper author:] Englund's claim that the Model 680 "records raw light measurements as it sees them without any post-processing" is incorrect. [...] It converts analog intensity signals to absorbance values using Beer's law, rounding results to the nearest 0.001 OD.

> [OP author:] Ok, that was incorrect on my part and shows my lack of knowledge about photometers. I was attempting to paraphrase an email from Bio-rad where they said: “The system [Bio-rad 680] was very basic and recorded OD as seen, it did not do any onboard manipulation of the data, it gave raw results for the user to interpret.”

For someone focusing on accuracy and sloppy work, that's a significant problem. How much of the OP is based on their "lack of knowledge" and reckless application of ignorance?

For the first issue, the section titled "Verdict" is followed by,

> the authors have so far not responded.

I would not issue verdicts without making sure I understand the other's perspective (note that it is a requirement in courts of law). That applies especially when I lack direct knowledge or expertise.

When I haven't taken that step, I've learned a thousand times that my certainty usually reflects my lack of imagination, knowledge, or careful thought. Even when I'm 'right', I'm 'wrong': even if the 'verdict' is the same, the truth differs from what I was so certain about. I used to express my certainty prematurely; now I know to keep it to myself until I know what I'm talking about, which frequently saves me from major and/or embarassing errors.

Comment by shevy-java 1 day ago

Not only that but sometimes at universities, they use AI to generate descriptions.

Recent example I found (semi-accidentally, I was only looking for microscopy related courses):

https://ufind.univie.ac.at/de/course.html?lv=301053&semester...

At the end of the description it has:

"Übersetzt mit DeepL.com (kostenlose Version)"

This means, in english, "translated via DeepL.com (free version)" aka the not-paid-for version. What I found baffling is that even for a single paragraph, some are too lazy to write stuff on their own - or, at the least, remove that disclaimer. Other people also pointed out that they saw this in autogenerated brochures/booklets, in the USA for instance; think I saw this about 3 months ago but I forgot which booklet it was. But the whole booklet was AI-autogenerated. To me this is all spam. I can not want to be bothered to read AI "content" when it is really just glorified slop-spam.