To most people, the world of scientific literature is a vast black box, an impenetrable mess of millions upon millions of jargon-filled studies. The idea of asking a scientific question of that black box — and getting a straight answer — is obviously appealing. A new AI-powered search engine called Consensus purports to do just that — but experts say that in its current beta version, this AI provides answers that can range from wrong to incoherent.

And in some cases, Grid’s own tests show, the results are not only misleading but potentially dangerous — suggesting that the anti-malaria drug ivermectin is effective against covid or that vaccines cause autism, for instance, despite strong, reputable, peer-reviewed evidence against both propositions.

Sponsored by the Allen Institute for AI — a prominent Seattle lab — and several venture-capital firms, Consensus joins a growing list of supposedly AI-driven tools in the public spotlight; this week, Open AI’s ChatGPT sent many heads spinning with its apparent ability to engage in dialogue, compose essays or poetry, and even write code. But just as many people were able to expose the gaps in that AI’s abilities, Consensus’ promised power to offer up science’s best answer to any given question has some gaping holes.

“I don’t think there is major ethics concerns here, as much as the question is the ‘Does this actually work?’ one,” said Seth Baum, executive director of the Global Catastrophic Risk Institute, who studies AI safety. “Something like this might be a valuable tool if it helps people dig deeper into scientific literature. The concern is that it might just make them stop.”

But other experts suggested that directing users to disproved, dubious or outdated results could mislead them.

Co-founder and CEO Eric Olson told Grid that Consensus is still only 10 weeks old with much to improve upon. “Right now we show conclusions that are a reflection of the relevant research papers,” he said, adding that some problematic results will appear in response to queries using the current version of the AI app. Future iterations should guide users to a more thorough answer to most questions, he said, but “we have a long way to go.”

In the meantime, Consensus is publicly available through a website that promises “Verified Scientific Results” and claims to make “getting information from peer-reviewed research as easy as a Google search.”

“Is Earth warming?”

Unveiled in September to at least some good reviews, Consensus works by combing through more than 400 million peer-reviewed papers from thousands of peer-reviewed scientific journals. The idea is to take key conclusions from those papers and offer them up in response to a question; according to the site, the app “extracts and surfaces the most relevant findings.” So, how does it do?

“I am not impressed with this, and I don’t think it’s ready for prime-time,” said Andrew Dessler, a professor and director of the Texas Center for Climate Studies at Texas A&M University, after testing out the site. Dessler asked the question, “Is the Earth warming?” which he said should be “a layup” for Consensus; for decades, climate scientists have warned that human activities are heating up the planet. Yet, Dessler said the first answer he got from Consensus blamed the sun for such a rise in planetary temperature, inadvertently echoing one of the oldest climate change-denier talking points.

“That’s not just wrong, but it didn’t answer the question I asked,” he said. The concept of the tool “would be very helpful” if its performance could be significantly improved, he added.

Grid asked the Consensus search engine several questions ranging in topic and complexity, and received some questionable results. Two of the first three results to the query “do vaccines cause autism” suggested they do — while the actual scientific consensus is the exact opposite. “How old is the Earth?” offered a top result from a paper published in 1946, with an age off from the accepted number by 1.5 billion years. Meanwhile, asking Google the same thing produces the actual scientific consensus number — 4.54 billion years — in large bold print at the top of the page, with multiple results below explaining how science arrived at it.

Consensus results for the query "Do vaccines cause autism."

“Unfortunately, in the case of something like the vaccine autism question, there are flawed papers that claim there is a connection. Right now, those will make their way into the results,” acknowledged Olson. In the future, his team hopes to add “quality indicators” to results, showing how many papers agree with a given conclusion compared to another.

Glenn Branch, the deputy director of the National Center for Science Education (NCSE), also put Consensus through its paces. He compared it to another tool for querying the scientific literature, Google Scholar, and suggested that while Consensus may offer some advantages — better use of natural language-processing, for example — it can also overreach, suggesting science has reached conclusions even on topics where essentially no research has been done. (“Is kale unhealthy for cats” was Branch’s test.)

“Certainly neither can be uncritically relied upon to deliver the correct verdict,” he told Grid. “Both provide at best a starting point for investigation.”

Consensus results for the query "how old is earth."

Peer review is no guarantee

Consensus, the search engine, notes it relies on peer-reviewed studies, where a journal publishes a paper only after two or three outside researchers (the “peers”) have vetted it, often reworking its wording or demanding additional experiments before publication. That is only the start of real-life “consensus” on questions, resolved by long review, often bitter disagreement, replication of experiments and acknowledgment that scientific questions often have no final answer.

“Peer review is no guarantee of quality,” said Eric Topol, a cardiologist who founded and directs the Scripps Research Translational Institute. “You often see problems with papers published in very reputable journals, where they are detected only after 100 or so people on Twitter have taken a hard look at them.”

In other words, peer review is better than going without — but it’s not infallible.

Topol noted that the quality of journals is very uneven. Some “predatory” journals publish findings with little oversight in return for author fees. “There’s a lot of crap out there.”

Nicholas Christakis, a sociologist and physician at Yale University and an adviser to Consensus, acknowledged that a lot of scientific papers are wrong “and there are a lot of things where scientists disagree, which is the result you’ll get from Consensus in that case.” But he added that science is supposed to be a self-correcting enterprise where new results correct errors and lead to more cohesive answers: “This is almost a ‘philosophy of science’ point here, all knowledge is provisional, but we can feel more confident about conclusions over time.”





Scientists have tried to tackle study quality in several ways, from ranking journals and individual papers by the number of citations they get — presuming that good papers get more head nods — to developing lists of predatory journals to warn off scholars.

Consensus results for the query "how effective is ivermectin for covid."

It’s not clear whether Consensus’ AI takes the perceived quality of a journal into account. The top result for Grid’s “do vaccines cause autism query” was sourced from a 2006 paper in the Egyptian Journal of Immunology, rather than major medical journals such as the Lancet or the New England Journal of Medicine.

“Nor are all papers created equal,” said Jena Barchas-Lichtenstein, who leads media research at Knology, a scientific collective that transmits social science finds to the public and news outlets. “How well does the technology differentiate between a review article which itself seeks to identify consensus and a single research study?”

Epistemic trespassing

More fundamentally, some experts saw scientific literature as an unrealistic source of search engine answers. Scientific studies are an always-changing kaleidoscope of experimental and natural observations aimed at proving and disproving hypotheses of interest to scientific fields, said Topol. Older papers are meant to be superseded by newer, better ones. And review articles aside, they are not intended as encyclopedia entries.

Consensus raises issues of “epistemic trespassing” in science, added Barchas-Lichtenstein, where expertise and judgment in one field is simply not transferable to another (like a radiologist trying to fix a transmission or an economist weighing in on climate science). That leaves even an expert in one field “simply not knowing how to assess whether work is good or bad” in other, unrelated areas of study.

“Within my own field, I know the reputations of various journals and scholars. I know these folks’ work. I can look at a bibliography and see if there are glaring red flags in terms of what is and isn’t cited,” she said. “As soon as I move to another field, even a closely related one, I simply can’t do that.”

Consensus results for the query "is climate change real."

Scholars might rely on cues such as journal “impact factor” (measure of their citation by other scientists) and citation counts, where the best papers do get cited most often. However, those measures are controversial within science. “You might have one terrible paper cited 1,000 times because it’s provocative — and wrong,” said Topol. “That outweighs the 10 other papers cited 10 times that are right.”

Moreover, the scientific jargon that conclusions are couched in, and Consensus provides, is hard for general readers to understand and can change in meaning from scientific field to scientific field. “You almost have to be an expert to understand what the sentences that are pulled out mean,” said Baum.

Still, that isn’t to say that there isn’t a need here as science gets increasingly complicated over time. “Peer-reviewed research is always going to be distilled for the general public — that’s what teachers and professors, science journalists, Wikipedia editors and science communicators in general have been doing all along,” said Branch, of the NCSE. “The ever-increasing volume of scientific research is probably the chief argument for automating the distillation process as far as possible.”

Inherent limitations

Although AI is a trendy term, at the bottom it is simply a statistical approach to finding links between data points in a pile of data — in the case of Consensus, conclusive scientific sentences pulled out of much longer, dense scientific papers. An AI doesn’t “know” what it is talking about any more than a calculator understands the concept of infinity.

One of the biggest debates in AI right now is whether this raw statistical approach to finding linkages — which powers translation, chatbots, art and much else in “machine learning” studies — is up to cracking open tasks that require some nuance and actual understanding of the subject matter. “My intuition is that the statistical extrapolation approach seems inherently limited,” said Baum. “But what do I know?”

In the case of the ChatGPT chatbot, at least, critics have noticed this tripping up an AI asked simple questions, such as whether 10 kilograms of iron is heavier than 10 kilograms of cotton (you may recall this one from junior high), or to explain why churros are so important in home surgery. “At least for now, it still takes a human to know which plausible bits actually belong together,” wrote New York University AI scientist Gary Marcus on Substack.

Unlike ChatGPT, said Olson, “our models are extractive and not generative (meaning the results are word-for-word quotes) and that protects us from hallucinating answers” with citations to sources. Eventually, the Consensus team wants to implement “generative AI into our product,” he added, but with “guardrails,” where papers are summarized with results tied back to the underlying sources.

Are you feeling lucky?

Several search engines, notably PubMed — operated by the National Institutes of Health — and Google Scholar, already allow searches of scientific literature. However, those heavily rely on scientific knowledge of keywords, often jargon, central to a research question, to produce sensible results. Consensus is, at least, aimed at overcoming that hurdle to give the public more access to scientific research, which is often taxpayer funded.

“To their credit, the concept of Consensus seems to be that if you use their tool, you are supposed to look deep and think about the results to judge if you are seeing a real consensus,” said Baum, noting the search engine gives multiple answers to queries. It’s not the “I’m feeling lucky” one-shot answer from the Google search engine, he noted.

Dessler, of Texas A&M, said that the rapid advance of AI makes a tool like Consensus seem inevitable. “I hope they can improve its performance,” he said. “If they could get it working well, it would be very helpful.”

“What else do we have except science?” asked Christakis, the Consensus adviser. “It really is the best way we have to getting at the truth.”

Thanks to Lillian Barkley for copy editing this article.