Skip to content
  • About

    • About
    • People
    • Get Involved with HAI
    • Support HAI
  • Research

    • Research
    • Fellowship Programs
    • Grants
    • Student Affinity Groups
    • Centers & Labs
    • Research Publications
    • Research Partners
  • Education

    • Education
    • Executive and Professional Education
    • Government and Policymakers
    • K-12
    • Stanford Students
  • Policy

    • Policy
    • Policy Publications
    • Policymaker Education
    • Student Opportunities
  • AI Index

    • AI Index
    • AI Index Report
    • Global Vibrancy Tool
    • People
  • News
  • Events
  • Industry
  • Centers & Labs
news

Assessing the Role of Intelligent Tutors in K-12 Education

Date
April 21, 2025
Topics
Education, Skills
Generative AI

Scholars discover short-horizon data from edtech platforms can help predict student performance in the long term.

Digital learning has become the norm in education, but evaluating its effectiveness remains a challenge. In K-12 settings, the results of a tutoring system or educational game often come long after students have engaged with the tool. Infrequent assessments, such as statewide exams at the end of a school year, make it difficult to identify students who may excel or fail, and slow for researchers to measure the impact of educational software.

Researchers from Stanford, War Child Alliance, and Carnegie Learning investigated whether machine learning models could use student logs from their first few hours of using an educational software tool to predict those students’ final external test outcomes after months of usage. The study, “Predicting Long-Term Student Outcomes from Short-Term EdTech Log Data,” presented at LAK '25: Proceedings of the 15th International Learning Analytics and Knowledge Conference 2025, found that data from just two to five hours of activity with an intelligent tutor or learning game can yield valuable insight about how students will perform in standard external assessments several months later.

“In education, we often are interested in delayed outcomes like end-of-the-year assessments, but it would be useful if we could predict those outcomes using shorter amounts of data from educational software platforms,” says senior author Emma Brunskill, associate professor of computer science and a faculty affiliate of the Stanford Institute for Human-Centered Artificial Intelligence (HAI). “Informed by such predictions, teachers or the software itself could offer more personalized support to students who are struggling and pose new challenges to those who are thriving.”

Finding the Common Features

Prior research in this area has focused on using yearlong data from edtech products to predict outcomes on year-end exams, a method that takes a long time to generate results. Other studies have used a few minutes of student activity to predict short-term outcomes an hour or so later, which doesn’t provide insight into longer-term impact of the technology tools. The Stanford researchers wanted to see if they could get meaningful predictions for year-end performance using data from a relatively short amount of time with an educational game or intelligent tutor. Supported in part by a Stanford HAI Seed Research Grant, the team sought to examine if a short amount of edtech platform usage, on the order of a few hours, could be used with machine learning algorithms to predict student external test outcomes after multiple months of usage. The team used data from three different educational technology tools and student groups to assess how well their approach could be generalized across a variety of platforms and populations. 

The first dataset came from a collaboration with Can’t Wait to Learn (CWTL Reading), a literacy game product designed to support children living in conflict-affected areas. For this study, the parent organization, War Child, shared data from its students in Uganda. The second and third datasets came from two math tutoring systems, iReady and MATHia, both used by middle school students in the United States.

Ge Gao, a postdoctoral scholar in computer science who is affiliated with the AI for Human Impact (AI4HI) group and Stanford AI Lab (SAIL), says it was important to identify common features that could be extracted from log data of each platform, without requiring specific domain knowledge of the tool, student demographics, or prior performance data for the students. Using a decision tree methodology – a technique that splits information into smaller and more telling groups – to figure out which factors matter most for students’ performance, the scholars found that features such as the percentage of times the student succeeded at a problem (their success rate) and the average number of times a student attempted a problem ranked as top criteria that could be generalized across platforms. 

“By focusing on broadly similar features that are likely to be present across many educational platforms, we can evaluate the similarities and differences across settings,” she explains. With the feature set determined, the team was ready to compare the performance of three popular machine learning models in predicting student outcomes for the given time horizon and across the three learning contexts.

Identifying the Extremes

Gao and colleagues highlight several key findings from this study. First, the results show that data from two to five hours of edtech software usage is enough to predict whether a student is likely to fall at the extremes – either in the bottom quintile or top quintile – on a delayed assessment that happens after months of using the tool. Although the machine learning models were not able to predict precise student outcomes on an exam, such as an exact score or percentile placement on an exam, the researchers suggest that being able to identify the lowest and highest performers on an exam is valuable information for software developers and educators alike. “Our findings suggest interesting future directions for creating systems that could alert educators or provide additional support or challenge in the tutoring system,” Gao says.

The team compared the study results to scenarios where students take a pre-assessment. Here, they found that short-term data offers similar predictive power as a separate pre-test score, and occasionally the combination of both offers additional benefit. This matters because, in many settings, it’s not always practical or desirable to give every student a pre-assessment before they start using an edtech tool.

Implications for Developers and Educators

Gao says the team is pleased their approach had fairly high accuracy at predicting the highest and lowest performers. Although further exploration is needed, she believes the predictors enabled by this study could be used to improve personalization in edtech software, as well as to help teachers make better decisions about how they use classroom resources. 

“Our goal is to help teachers intervene at just the right time to encourage and celebrate progress,” she says. “If we can collaborate with the developers of intelligent tutoring systems, they can enhance their products to give teachers more clues about the performance of their students.”

Paper authors: Ge Gao, Amelia Leon, Andrea Jetten, Jasmine Turner, Husni Almoubayyed, Stephen Fancsali, Emma Brunskill

Share
Link copied to clipboard!
Contributor(s)
Nikki Goth Itoi

Related News

A Framework to Report AI’s Flaws
Andrew Myers
Apr 28, 2025
News

Pointing to "white-hat" hacking, AI policy experts recommend a new system of third-party reporting and tracking of AI’s flaws.

News

A Framework to Report AI’s Flaws

Andrew Myers
Ethics, Equity, InclusionGenerative AIPrivacy, Safety, SecurityApr 28

Pointing to "white-hat" hacking, AI policy experts recommend a new system of third-party reporting and tracking of AI’s flaws.

MedArena: Comparing LLMs for Medicine in the Wild
Eric Wu, Kevin Wu, James Zou
Apr 24, 2025
News

Stanford scholars leverage physicians to evaluate 11 large language models in real-world settings.

News

MedArena: Comparing LLMs for Medicine in the Wild

Eric Wu, Kevin Wu, James Zou
HealthcareNatural Language ProcessingGenerative AIApr 24

Stanford scholars leverage physicians to evaluate 11 large language models in real-world settings.

Language Models in the Classroom: Bridging the Gap Between Technology and Teaching
Instructors and students of CS293
Apr 09, 2025
News

Instructors and students from Stanford class CS293/EDUC473 address the failures of current educational technologies and outline how to empower both teachers and learners through collaborative innovation.

News

Language Models in the Classroom: Bridging the Gap Between Technology and Teaching

Instructors and students of CS293
Education, SkillsGenerative AINatural Language ProcessingApr 09

Instructors and students from Stanford class CS293/EDUC473 address the failures of current educational technologies and outline how to empower both teachers and learners through collaborative innovation.

Navigate
  • About
  • Events
  • Careers
  • Search
Participate
  • Get Involved
  • Support HAI
  • Contact Us

Stay Up To Date

Get the latest news, advances in research, policy work, and education program updates from HAI in your inbox weekly.

Sign Up For Latest News

Stanford
University
  • Stanford Home
  • Maps & Directions
  • Search Stanford
  • Emergency Info
  • Terms of Use
  • Privacy
  • Copyright
  • Trademarks
  • Non-Discrimination
  • Accessibility
© Stanford University.  Stanford, California 94305.
Assessing the Role of Intelligent Tutors in K-12 Education | Stanford HAI