Another eventful day in the world of artificial intelligence. From a massive academic integrity scandal at Brown University to new benchmarks showing Chinese open-source models outperforming Western frontier labs, and growing concerns about AI’s reliability in hiring and medicine — here are the top five AI stories making headlines on June 30, 2026.
1. GLM 5.2 Beats Claude in Cybersecurity Benchmarks
Chinese AI model GLM 5.2 has outperformed Anthropic’s Claude on Semgrep’s “Mythos” cybersecurity benchmark, sparking intense discussion across the AI community. The model, developed by Zhipu AI (zai-org), is a 753-billion-parameter open-weight model available on Hugging Face. It scored higher than Claude at identifying security vulnerabilities in code, with commenters on Hacker News noting that GLM 5.2 is “extremely good at finding vulnerabilities” and, notably, “unlike Opus, I’ve never seen it refuse a command.”
The benchmark tests whether models can identify security bugs that Semgrep’s Mythos static analysis tool already finds — essentially measuring how well LLMs replicate existing tooling. While Semgrep’s results show GLM 5.2 leading, independent developer SwellJoe reports that DeepSeek V4 Pro remains the strongest open model in broader security testing, with “extreme caching performance” making it cheaper than even much smaller models. GLM 5.2’s API pricing is approximately $4 per million output tokens, undercutting Anthropic’s Claude Opus by a wide margin. Multiple HN commenters observed that Chinese models are increasingly competitive at a fraction of the training and inference cost of their US counterparts.
2. HackerRank’s Open-Source ATS: A Resume Screening Lottery
HackerRank open-sourced its AI-powered Applicant Tracking System (ATS) on GitHub, and developer Dan Kinsky put it to the test with alarming results. Running the same resume through the system 100 times produced scores ranging from 66 to 99 out of 100 — a 33-point spread caused entirely by LLM nondeterminism. “If your company’s cutoff sits at 85, I fail 65% of the time. Same exact resume, different luck,” Kinsky wrote.
The tool uses a local Gemma 3:4b model running at temperature 0.1, though even at temperature 0, scores remained inconsistent — a GitHub issue from October 2025 documented scores of 27, 34, 32, 34, 34, and 30 across six consecutive runs at zero temperature. Kinsky identified a deeper structural flaw: 65% of the score depends on open-source contributions and personal projects, heavily favoring candidates with free time over experienced engineers with family obligations. The “experience” category awards 25/25 regardless of seniority — a junior intern and a 30-year principal engineer both max out. “A tool that can’t differentiate isn’t filtering for quality, it’s just filtering. You might as well throw out half the resumes and tell the applicants you don’t fuck with bad luck,” Kinsky concluded. The piece reignited debate about whether LLM-based resume screening violates EU anti-discrimination laws.
3. Using Claude Code for a Second Opinion on an MRI
A developer’s experiment using Claude Code (Anthropic’s Opus model) to analyze their own MRI scan went viral, generating 685 comments on Hacker News. The author, writing at antoine.fi, uploaded their shoulder MRI images and asked Claude for an analysis after receiving what they felt was an inconclusive radiologist report. Claude identified a rotator cuff tear that the original report had not highlighted. The experience prompted a wide-ranging discussion about AI in medical diagnosis.
A practicing radiologist who commented on the piece pushed back sharply: “These models are generally terrible at reading medical images. The amount of public training data on the internet compared to the number of scans a radiologist reads in training is minuscule.” Another radiologist noted that ultrasound — used to check for calcification in the patient’s case — “isn’t a great way to assess for calcification. It’ll find large calcification but easily miss small ones.” The broader debate touched on the asymmetry of trust: patients feel more comfortable asking AI for clarifications than confronting a busy physician, but the risk of over-reliance on black-box models without proper validation remains significant. Several commenters shared personal stories of misdiagnosis, both by humans and by AI, underscoring that the path forward is likely human-in-the-loop rather than full automation.
4. Brown University Professor Exposes Mass AI Cheating Scandal
Professor Roberto Serrano, a 61-year-old blind economist and Harrison S. Kravis University Professor at Brown University, has publicly denounced what he calls “massive AI fraud” in his ECON 1170 mathematical economics course. The case, reported by El País English, is believed to be the largest known academic integrity scandal in Ivy League history. Serrano’s midterm exam — a take-home, closed-book format — yielded an average score of 96 out of 100. Forty students scored a perfect 100. Teaching assistants flagged irregularities: answers contained “unusual passages that coincided with results obtained after running the questions through ChatGPT.”
Serrano did not void the midterm but warned students the final would be in-person. The results were stark: the average dropped to 48 out of 100. Of the 86 students who took the midterm, only 59 showed up for the final. Among the 27 who skipped it, 22 had scored a perfect 100 on the midterm. “The empirical evidence of fraud is overwhelming,” Serrano said. When he reported the case to university leadership, the president offered “absolute silence” and the dean did not comment until Serrano brought it before the Academic Code Committee, where the administration acknowledged it was “a wake-up call.” Serrano, who lost his sight at 17 due to retinal dystrophy, has argued that universities must publicly confront the scale of the problem before AI signals “the end of higher education.” He has eliminated take-home exams and weekly exercises (which could be completed with AI) for the coming academic year.
5. Google Restricts Meta’s Access to Gemini AI Models
Google has begun limiting Meta’s use of its Gemini AI models, according to a report from the Financial Times via CNBC. The restriction appears to be driven primarily by capacity constraints — demand for Gemini’s inference infrastructure has surged — rather than a specific policy dispute between the two tech giants. Meta had been using Gemini across a range of internal applications and product features.
Hacker News commenters noted the irony: Google’s Gemini is not considered state-of-the-art for coding tasks, yet Meta relies heavily on it, possibly for strategic or cost reasons rather than raw performance. Several commenters predicted this will become the norm for access to frontier models. “Computing capacity plus state restrictions plus KYC will be imposed on organizations to get access,” one wrote. “Individuals will be served last on the queue with degraded performance. Once the Chinese models catch up, nobody (at least individuals) will turn back again to frontier labs.” The move underscores the growing bottleneck in AI inference infrastructure, as even hyperscalers struggle to meet demand, and raises questions about how access to frontier AI capabilities will be allocated in an increasingly resource-constrained environment.
Closing Thoughts
From classroom integrity to resume screening, medical diagnosis to cybersecurity — these five stories paint a picture of an AI industry grappling with reliability, equity, and access. The gap between what AI can do and what it should be trusted to do remains the defining question of 2026. We’ll be watching how universities, regulators, and tech companies respond.