Connect with us

Technology

Gemini’s data analysis capabilities aren’t as good as Google claims

Published

on

In this photo illustration a Gemini logo and a welcome message on Gemini website are displayed on two screens.

One of the strengths of Google’s flagship generative AI models, Gemini 1.5 Pro and 1.5 Flash, is the quantity of data they’ll supposedly process and analyze. During press conferences and demonstrations, Google has repeatedly claimed that these models can perform previously not possible tasks due to “long context” such as summarizing multiple 100-page documents or looking through scenes in video footage.

But recent research suggests that these models actually aren’t very good at this stuff.

Two separate studies examined how well Google’s Gemini models and others make sense of big amounts of data—think the length of “War and Peace.” Both models find that Gemini 1.5 Pro and 1.5 Flash struggle to accurately answer questions on large data sets; in a single set of document-based tests, the models got the reply right only 40% and 50% of the time.

“While models like Gemini 1.5 Pro can technically process long contexts, we have seen many cases indicating that the models don’t actually ‘understand’ the content,” Marzena Karpińska, a postdoc at UMass Amherst and co-author on one in all the studios, told TechCrunch.

The Gemini context window is incomplete

Model context or context window refers back to the input data (e.g. text) that the model considers before generating output data (e.g. additional text). An easy query – “Who won the 2020 US presidential election?” — might be used as context, very similar to a script for a movie, show, or audio clip. As context windows grow, the scale of the documents they contain also increases.

The latest versions of Gemini can accept greater than 2 million tokens as context. (“Tokens” are broken-down chunks of raw data, such as the syllables “fan,” “tas,” and “tic” in “fantastic.”) That’s roughly corresponding to 1.4 million words, two hours of video, or 22 hours of audio—essentially the most context of any commercially available model.

In a briefing earlier this 12 months, Google showed off several pre-recorded demos intended as an instance the potential of Gemini’s long-context capabilities. One involved Gemini 1.5 Pro combing through the transcript of the Apollo 11 moon landing broadcast—some 402 pages—on the lookout for quotes containing jokes, then finding a scene in the printed that looked like a pencil sketch.

Google DeepMind’s vp of research Oriol Vinyals, who chaired the conference, called the model “magical.”

“(1.5 Pro) does these kinds of reasoning tasks on every page, on every word,” he said.

That may need been an exaggeration.

In one in all the aforementioned studies comparing these capabilities, Karpińska and researchers from the Allen Institute for AI and Princeton asked models to judge true/false statements about fiction books written in English. The researchers selected recent works in order that the models couldn’t “cheat” on prior knowledge, and so they supplemented the statements with references to specific details and plot points that will be not possible to know without reading the books of their entirety.

Given a press release such as “With her Apoth abilities, Nusis is able to reverse engineer a type of portal opened using the reagent key found in Rona’s wooden chest,” Gemini 1.5 Pro and 1.5 Flash — after swallowing the suitable book — had to find out whether the statement was true or false and explain their reasoning.

Image Credits: University of Massachusetts at Amherst

Tested on a single book of about 260,000 words (~520 pages), the researchers found that the 1.5 Pro accurately answered true/false statements 46.7% of the time, while Flash only answered accurately 20% of the time. This implies that the coin is significantly higher at answering questions on the book than Google’s latest machine learning model. Averaging across all benchmark results, neither model achieved higher than likelihood when it comes to accuracy in answering questions.

“We have noticed that models have greater difficulty verifying claims that require considering larger sections of a book, or even the entire book, compared to claims that can be solved by taking evidence at the sentence level,” Karpinska said. “Qualitatively, we also observed that models have difficulty validating claims for implicit information that are clear to a human reader but not explicitly stated in the text.”

The second of the 2 studies, co-authored by researchers at UC Santa Barbara, tested the power of Gemini 1.5 Flash (but not 1.5 Pro) to “reason” about videos — that’s, to seek out and answer questions on their content.

The co-authors created a data set of images (e.g., a photograph of a birthday cake) paired with questions for the model to reply concerning the objects depicted in the pictures (e.g., “What cartoon character is on this cake?”). To evaluate the models, they randomly chosen one in all the pictures and inserted “distraction” images before and after it to create a slideshow-like video.

Flash didn’t do thoroughly. In a test by which the model transcribed six handwritten digits from a “slideshow” of 25 images, Flash performed about 50% of the transcriptions accurately. Accuracy dropped to about 30% at eight digits.

“For real question-and-answer tasks in images, this seems particularly difficult for all the models we tested,” Michael Saxon, a doctoral student at UC Santa Barbara and one in all the study’s co-authors, told TechCrunch. “That little bit of reasoning — recognizing that a number is in a box and reading it — can be what breaks the model.”

Google is promising an excessive amount of with Gemini

Neither study was peer-reviewed, nor did it examine the launch of Gemini 1.5 Pro and 1.5 Flash with contexts of two million tokens. (Both tested context versions with 1 million tokens.) Flash just isn’t intended to be as efficient as Pro when it comes to performance; Google advertises it as a low-cost alternative.

Still, each add fuel to the fireplace that Google has been overpromising — and underdelivering — with Gemini from the beginning. None of the models the researchers tested, including OpenAI’s GPT-4o and Anthropic’s Claude 3.5 Sonnet, performed well. But Google is the one model provider to place the context window at the highest of its list in its ads.

“There is nothing wrong with simply saying, ‘Our model can accept X tokens,’ based on objective technical details,” Saxon said. “But the question is: What useful thing can be done with it?”

Overall, generative AI is coming under increasing scrutiny as businesses (and investors) grow to be increasingly frustrated with the technology’s limitations.

In two recent Boston Consulting Group surveys, about half of respondents—all CEOs—said they didn’t expect generative AI to deliver significant productivity advantages and that they were concerned about potential errors and data breaches resulting from generative AI tools. PitchBook recently reported that early-stage generative AI deal activity has declined for 2 consecutive quarters, down 76% from its peak in Q3 2023.

With meeting recap chatbots conjuring fictitious details about people and AI search platforms which can be essentially plagiarism generators, customers are on the lookout for promising differentiators. Google — which had been racing, sometimes clumsily, to meet up with its rivals in the sphere of generative AI — desperately wanted the Gemini context to be one in all those differentiators.

However, it seems that the idea was premature.

“We haven’t figured out how to really show that ‘reasoning’ or ‘understanding’ is happening across long documents, and basically every group publishing these models is just pulling together their own ad hoc assessments to make these claims,” Karpińska said. “Without knowing how long the context processing is happening—and the companies don’t share that detail—it’s hard to say how realistic these claims are.”

Google didn’t reply to a request for comment.

Both Saxon and Karpińska consider that the antidote to the grandiose claims about generative AI is best benchmarks and, in the identical vein, a greater emphasis on third-party criticism. Saxon notes that one in all the more common long-context tests (heavily cited by Google in its marketing materials), the “needle in a haystack,” measures only a model’s ability to retrieve specific pieces of knowledge, such as names and numbers, from datasets—not how well it answers complex questions on that information.

“All scientists and most engineers using these models generally agree that our current benchmarking culture is broken,” Saxon said, “so it’s important that the public understands that these giant reports with numbers like ‘general intelligence in “comparative tests” with an enormous pinch of salt.”

This article was originally published on : techcrunch.com
Continue Reading
Advertisement
Click to comment

Leave a Reply

Your email address will not be published. Required fields are marked *

Technology

MIT Develops Recyclable 3D-Printed Glass Blocks for Construction Applications

Published

on

By

MIT develops recyclable 3D-printed glass blocks for construction

The use of 3D printing has been praised as an alternative choice to traditional construction, promising faster construction times, creative design and fewer construction errors, all while reducing the carbon footprint. New research from MIT points to an interesting latest approach to the concept, involving the usage of 3D-printed glass blocks in the form of a figure eight, which may be connected together like Lego bricks.

The team points to glass’s optical properties and “infinite recyclability” as reasons to pursue the fabric. “As long as it’s not contaminated, you can recycle glass almost infinitely,” says assistant professor of mechanical engineering Kaitlyn Becker.

The team relied on 3D printers designed by Straight line — is itself a spin-off of MIT.

This article was originally published on : techcrunch.com
Continue Reading

Technology

Introducing the Next Wave of Startup Battlefield Judges at TechCrunch Disrupt 2024

Published

on

By

Announcing our next wave of Startup Battlefield judges at TechCrunch Disrupt 2024

Startup Battlefield 200 is the highlight of every Disrupt, and we will’t wait to search out out which of the 1000’s of startups which have invited us to collaborate can have the probability to pitch to top enterprise capitalists at TechCrunch Disrupt 2024. Join us at Moscone West in San Francisco October 28–30 for an epic showdown where everyone can have the probability to make a major impact.

Get insight into what the judges are in search of in a profitable company as they supply detailed feedback on the evaluation criteria. Don’t miss the opportunity to learn from their expert insights and discover the key characteristics that result in startup success, only at Disrupt 2024.

We’re excited to introduce our next group of investors who will evaluate startups and dive into each pitch in an in-depth and insightful Q&A session. Stay tuned for more big names coming soon!

Alice Brooks, Partner, Khosla Ventures

Alicja is a partner in Khosla’s ventures interests in sustainability, food, agriculture, and manufacturing/supply chain. She has worked with multiple startups in robotics, IoT, retail, consumer goods, and STEM education, and led mechanical, electrical, and application development teams in the US and Asia. She also founded and managed manufacturing operations in factories in China and Taiwan. Prior to KV, Alice was the founder and CEO of Roominate, a STEM education company that helps girls learn engineering concepts through play.

Mark Crane, Partner, General Catalyst

Mark Crane is a partner at General Catalysta enterprise capital firm that works with founders from seed to endurance to assist them construct corporations that may stand the test of time. Focused on acquiring and investing in later-stage investment opportunities equivalent to AuthZed, Bugcrowd, Resilience, and TravelPerk. Prior to joining General Catalyst, Mark was a vice chairman at Cove Hill Partners in Massachusetts. Prior to that, he was a senior associate at JMI Equity and an associate at North Bridge Growth Equity.

Sofia Dolfe, Partner, Index Ventures

Sofia partners with founders who use their unique perspective and private understanding of the problem to construct corporations that drive behavioral change, powerful network effects, and transform entire industries, from grocery and e-commerce to financial services and healthcare. Sofia can also be one of Index projects‘ gaming leads, working with some of the best gaming corporations in Europe, making a recent generation of iconic gaming titles. He spends most of his time in the Nordics, but works with entrepreneurs across the continent.

Christine Esserman, Partner, Accel

Christine Esserman joined Acceleration in 2017 and focuses on software, web, and mobile technology corporations. Since joining Accel, Christine has helped lead Accel’s investments in Blackpoint Cyber, Linear, Merge, ThreeFlow, Bumble, Remote, Dovetail, Ethos, Guru, and Headway. Prior to joining Accel, Christine worked in product and operations roles at multiple startups. A native of the Bay Area, Christine graduated from the Wharton School at the University of Pennsylvania with a level in Finance and Operations.

Haomiao Huang, Founding Partner, Matter Venture Partners

Haomiao from Venture Matter Partners is a robotics researcher turned founder turned investor. He is especially obsessed with corporations that bring digital innovation to physical economy enterprises, with a give attention to sectors equivalent to logistics, manufacturing and transportation, and advanced technologies equivalent to robotics and AI. Haomiao spent 4 years investing in hard tech with Wen Hsieh at Kleiner Perkins. He previously founded smart home security startup Kuna, built autonomous cars at Caltech and, as part of his PhD research at Stanford, pioneered the aerodynamics and control of multi-rotor unmanned aerial vehicles. Kuna was part of the Y Combinator Winter 14 cohort.

Don’t miss it!

The Startup Battlefield winner, who will walk away with a $100,000 money prize, can be announced at Disrupt 2024—the epicenter of startups. Join 10,000 attendees to witness this breakthrough moment and see the next wave of tech innovation.

Register here and secure your spot to witness this epic battle of startups.

This article was originally published on : techcrunch.com
Continue Reading

Technology

India Considers Easing Market Share Caps for UPI Payments Operators

Published

on

By

phonepe UPI being used to accept payments at a road-side sunglasses stall.

The regulator that oversees India’s popular UPI rail payments is considering relaxing a proposed market share cap for operators like Google Pay, PhonePe and Paytm because it grapples with enforcing the restrictions, two people accustomed to the matter told TechCrunch.

The National Payments Corporation of India (NPCI), which is regulated by the Indian central bank, is considering increasing the market share that UPI operators can hold to greater than 40%, said two of the people, requesting anonymity because the knowledge is confidential. The regulator had earlier proposed a 30% market share limit to encourage competition within the space.

UPI has change into the most well-liked option to send and receive money in India, with the mechanism processing over 12 billion transactions monthly. Walmart-backed PhonePe has about 48% market share by volume and 50% by value, while Google Pay has 37.3% share by volume.

Once an industry heavyweight, Paytm’s market share has fallen to 7.2% from 11% late last yr amid regulatory challenges.

According to several industry executives, the NPCI’s increase in market share limits is more likely to be a controversial move as many UPI providers were counting on regulatory motion to curb the dominance of PhonePe and Google Pay.

NPCI, which has previously declined to comment on market share, didn’t reply to a request for comment on Thursday.

The regulator originally planned to implement the market share caps in January 2021 but prolonged the deadline to January 1, 2025. The regulator has struggled to seek out a workable option to implement its proposed market share caps.

The stakes are high, especially for PhonePe, India’s Most worthy fintech startup, valued at $12 billion.

Sameer Nigam, co-founder and CEO of PhonePe, said last month that the startup cannot go public “if there is uncertainty on regulatory issues.”

“If you buy a share at Rs 100 and value it assuming we have 48-49% market share, there is uncertainty whether it will come down to 30% and when,” Nigam told a fintech conference last month. “We are reaching out to them (the regulator) whether they can find another way to at least address any concerns they have or tell us what the list of concerns is,” he added.

This article was originally published on : techcrunch.com
Continue Reading
Advertisement

OUR NEWSLETTER

Subscribe Us To Receive Our Latest News Directly In Your Inbox!

We don’t spam! Read our privacy policy for more info.

Trending