Technology

Gemini’s data analysis capabilities aren’t as good as Google claims

Published

10 months ago

June 29, 2024

IAM

One of the strengths of Google’s flagship generative AI models, Gemini 1.5 Pro and 1.5 Flash, is the quantity of data they’ll supposedly process and analyze. During press conferences and demonstrations, Google has repeatedly claimed that these models can perform previously not possible tasks due to “long context” such as summarizing multiple 100-page documents or looking through scenes in video footage.

But recent research suggests that these models actually aren’t very good at this stuff.

Two separate studies examined how well Google’s Gemini models and others make sense of big amounts of data—think the length of “War and Peace.” Both models find that Gemini 1.5 Pro and 1.5 Flash struggle to accurately answer questions on large data sets; in a single set of document-based tests, the models got the reply right only 40% and 50% of the time.

“While models like Gemini 1.5 Pro can technically process long contexts, we have seen many cases indicating that the models don’t actually ‘understand’ the content,” Marzena Karpińska, a postdoc at UMass Amherst and co-author on one in all the studios, told TechCrunch.

The Gemini context window is incomplete

Model context or context window refers back to the input data (e.g. text) that the model considers before generating output data (e.g. additional text). An easy query – “Who won the 2020 US presidential election?” — might be used as context, very similar to a script for a movie, show, or audio clip. As context windows grow, the scale of the documents they contain also increases.

The latest versions of Gemini can accept greater than 2 million tokens as context. (“Tokens” are broken-down chunks of raw data, such as the syllables “fan,” “tas,” and “tic” in “fantastic.”) That’s roughly corresponding to 1.4 million words, two hours of video, or 22 hours of audio—essentially the most context of any commercially available model.

In a briefing earlier this 12 months, Google showed off several pre-recorded demos intended as an instance the potential of Gemini’s long-context capabilities. One involved Gemini 1.5 Pro combing through the transcript of the Apollo 11 moon landing broadcast—some 402 pages—on the lookout for quotes containing jokes, then finding a scene in the printed that looked like a pencil sketch.

Google DeepMind’s vp of research Oriol Vinyals, who chaired the conference, called the model “magical.”

“(1.5 Pro) does these kinds of reasoning tasks on every page, on every word,” he said.

That may need been an exaggeration.

In one in all the aforementioned studies comparing these capabilities, Karpińska and researchers from the Allen Institute for AI and Princeton asked models to judge true/false statements about fiction books written in English. The researchers selected recent works in order that the models couldn’t “cheat” on prior knowledge, and so they supplemented the statements with references to specific details and plot points that will be not possible to know without reading the books of their entirety.

Given a press release such as “With her Apoth abilities, Nusis is able to reverse engineer a type of portal opened using the reagent key found in Rona’s wooden chest,” Gemini 1.5 Pro and 1.5 Flash — after swallowing the suitable book — had to find out whether the statement was true or false and explain their reasoning.

Image Credits: University of Massachusetts at Amherst

Tested on a single book of about 260,000 words (~520 pages), the researchers found that the 1.5 Pro accurately answered true/false statements 46.7% of the time, while Flash only answered accurately 20% of the time. This implies that the coin is significantly higher at answering questions on the book than Google’s latest machine learning model. Averaging across all benchmark results, neither model achieved higher than likelihood when it comes to accuracy in answering questions.

“We have noticed that models have greater difficulty verifying claims that require considering larger sections of a book, or even the entire book, compared to claims that can be solved by taking evidence at the sentence level,” Karpinska said. “Qualitatively, we also observed that models have difficulty validating claims for implicit information that are clear to a human reader but not explicitly stated in the text.”

The second of the 2 studies, co-authored by researchers at UC Santa Barbara, tested the power of Gemini 1.5 Flash (but not 1.5 Pro) to “reason” about videos — that’s, to seek out and answer questions on their content.

The co-authors created a data set of images (e.g., a photograph of a birthday cake) paired with questions for the model to reply concerning the objects depicted in the pictures (e.g., “What cartoon character is on this cake?”). To evaluate the models, they randomly chosen one in all the pictures and inserted “distraction” images before and after it to create a slideshow-like video.

Flash didn’t do thoroughly. In a test by which the model transcribed six handwritten digits from a “slideshow” of 25 images, Flash performed about 50% of the transcriptions accurately. Accuracy dropped to about 30% at eight digits.

“For real question-and-answer tasks in images, this seems particularly difficult for all the models we tested,” Michael Saxon, a doctoral student at UC Santa Barbara and one in all the study’s co-authors, told TechCrunch. “That little bit of reasoning — recognizing that a number is in a box and reading it — can be what breaks the model.”

Google is promising an excessive amount of with Gemini

Neither study was peer-reviewed, nor did it examine the launch of Gemini 1.5 Pro and 1.5 Flash with contexts of two million tokens. (Both tested context versions with 1 million tokens.) Flash just isn’t intended to be as efficient as Pro when it comes to performance; Google advertises it as a low-cost alternative.

Still, each add fuel to the fireplace that Google has been overpromising — and underdelivering — with Gemini from the beginning. None of the models the researchers tested, including OpenAI’s GPT-4o and Anthropic’s Claude 3.5 Sonnet, performed well. But Google is the one model provider to place the context window at the highest of its list in its ads.

“There is nothing wrong with simply saying, ‘Our model can accept X tokens,’ based on objective technical details,” Saxon said. “But the question is: What useful thing can be done with it?”

Overall, generative AI is coming under increasing scrutiny as businesses (and investors) grow to be increasingly frustrated with the technology’s limitations.

In two recent Boston Consulting Group surveys, about half of respondents—all CEOs—said they didn’t expect generative AI to deliver significant productivity advantages and that they were concerned about potential errors and data breaches resulting from generative AI tools. PitchBook recently reported that early-stage generative AI deal activity has declined for 2 consecutive quarters, down 76% from its peak in Q3 2023.

With meeting recap chatbots conjuring fictitious details about people and AI search platforms which can be essentially plagiarism generators, customers are on the lookout for promising differentiators. Google — which had been racing, sometimes clumsily, to meet up with its rivals in the sphere of generative AI — desperately wanted the Gemini context to be one in all those differentiators.

However, it seems that the idea was premature.

“We haven’t figured out how to really show that ‘reasoning’ or ‘understanding’ is happening across long documents, and basically every group publishing these models is just pulling together their own ad hoc assessments to make these claims,” Karpińska said. “Without knowing how long the context processing is happening—and the companies don’t share that detail—it’s hard to say how realistic these claims are.”

Google didn’t reply to a request for comment.

Both Saxon and Karpińska consider that the antidote to the grandiose claims about generative AI is best benchmarks and, in the identical vein, a greater emphasis on third-party criticism. Saxon notes that one in all the more common long-context tests (heavily cited by Google in its marketing materials), the “needle in a haystack,” measures only a model’s ability to retrieve specific pieces of knowledge, such as names and numbers, from datasets—not how well it answers complex questions on that information.

“All scientists and most engineers using these models generally agree that our current benchmarking culture is broken,” Saxon said, “so it’s important that the public understands that these giant reports with numbers like ‘general intelligence in “comparative tests” with an enormous pinch of salt.”

This article was originally published on : techcrunch.com

Up Next

Biggest Data Breaches of 2024: 1 Billion Records Stolen and Growing

Don't Miss

Plaid, once aimed primarily at fintechs, is expanding its corporate business and now has over 1,000 registered customers

Click to comment

Technology

Palantir Exec defends work in the company’s immigration supervision

Published

8 hours ago

April 21, 2025

IAM

One of the founders of the Y startup accelerator Y Combinator offered this weekend the Palantir Data Analytical Company that doesn’t describe the controversial analytical company, running the company’s director to supply a broad defense of Palantir’s work.

Then it appeared forward federal applications He showed that American immigration and customs enforcement (ICE) – the task of conducting the aggressive strategy of the deportation of the Trump administration – pays Palantir $ 30 million for creating What does this call the immigration system operating systemSo immigration to assist ICE resolve who to direct to the deportation, and likewise offer “real -time visibility” in self -complacency.

Y founding father of Combinator Paul Graham divided the headlines about the Palantir contract on the subject of XWriting: “It is now a very exciting time in technology. If you are a first -rate programmer, there is a huge number of other places where you can work, and not in a company building infrastructure of a police state.”

In response, the global business head of Palantir Ted Mabrey wrote that “he is looking forward to the next set of employees who decided to submit a request to Palantir after reading your post.”

Mabrey didn’t discuss the details of the current work of Palantir with ice, but said that the company began cooperation with the Internal Security Department (in accordance with which ICE works) “in an immediate response to the assassination of agent Jaime Zapata by Zetas in an effort called Fallen Hero surgery. “

“When people live because of what you built and others were not alive, because what you built was not good enough yet, you develop a completely different view on the meaning of your work,” said Mabrey.

He also compared Graham’s criticism with protests on the Google Maven project in 2018, which ultimately prompted the company to stop the work of drone photos for the army. (Google then signaled that he again became more open to defense works.)

Mabrey called everyone interested in working for Palantir to read the latest book CEO Alexander Karp “The Technological Republic”, which claims that the software industry must rebuild its relationship with the government. (The company was Recruitment at university campus With signs declaring that “the moment of counting arrived west”)

“We employ believers,” Mabrey continued. “Not in the sense of the homogeneity of religion, but in the internal ability to imagine in something greater than you

Graham then Pressed Mabrey “To publicly commit himself on behalf of Palantir, so as not to build things that help the government violate the US constitution,” although he confirmed in one other post that such a commitment “would not have legal force.”

“However, I hope that if (they make a commitment) and a Palantir’s employee is one day asked to do something illegal, he will say” I didn’t join for it “and refused,” wrote Graham.

Mabrey in turn compared Graham’s query In order for “or” you promise to stop beating a trick in court, but he added that the company “has made so many ways from Sunday”, ranging from the commitment to “3,500 thoughtful people who polish only because they believe that they make the world a better place every day because they see their first hand.”

(Tagstotransate) palantir

This article was originally published on : techcrunch.com

Technology

Congress has questions about 23andme bankruptcy

Published

2 days ago

April 20, 2025

IAM

3 The leaders of the Energy and Trade Committee said that they’re investigating how 23ndme’s bankruptcy can affect customer data.

Representatives of Brett Guthrie, Gus Biliakis and Gary Palmer (all Republicans) He sent a letter On Thursday, Joe Selsavage, Joe Selsavage, ask a variety of questions about how 23andme will serve customer data if the corporate is sold.

The letter also says that some customers have reported problems with deleting their data from the 23ndme website, and notes that corporations directly for consumption, reminiscent of 23andme, are generally not protected by the Act on the portability and accountability of medical insurance (Hipaa).

“Considering the lack of HIPAA protection, a patchwork of state regulations covering genetic privacy and uncertainty related to customer information in the case of transmitting the sale of company or clients data, we are afraid that this best -confidential information is threatened with a player,” representatives write.

23andme, which has decided to violate data For $ 30 million last 12 months, he applied for bankruptcy in Chapter 11 in March, and the co -founder and general director Anne Wojciki said he was resigning from the corporate’s private bidder.

(Tagstotransate) 23andme

This article was originally published on : techcrunch.com

Technology

The White House replaces the Covid.gov page with the theory “Lab Leak”

Published

3 days ago

April 18, 2025

IAM

The Covid.gov government website has used Covid-19, tests and treatment to store information. Now, under the sight of President Trump, page redirects to the side of the White House Talking to the unverified theory that Covid-19 comes from the Chinese laboratory.

A theory during which many virologists have objected to in the report Through House Republicans last yr, which found that Pandemia began with a laboratory leak in China. House democrats He spent the overthrow At that point, the statement that the probe didn’t define Cavid’s real origin.

Covidtes.Gov website, during which people could order free coronavirus tests before, can be redirected to this New page.

The latest website of the White House also includes medical disinformation on the treatment of the virus, falsely claiming that social distance, mask and lock fines should not effective in alleviating the spread of Covid-19. However, Hundreds of research They showed that these preventive measures In fact, reduce respiratory infections equivalent to Covid-19.

In the months, since Trump again confirmed his role of the US president, many web sites have been edited to reflect the program of his administration. With the help of Doge Elona Musk, the government tried to remove tons of of words related to diversity from government documents. This Include Words equivalent to “black”, “disability”, “diversity”, “sex”, “racism”, “women” and lots of more. The government also removed the mention of scientifically proven climate change from environmental sites.

(Tagstotranslate) covid

This article was originally published on : techcrunch.com