/ ML Methodology / ML Thesis

How to Use Machine Learning in Your Thesis at the Undergraduate or Graduate Level

A valid machine learning thesis isn't the most complex one. It's the one with a clear contribution, a well-scoped problem, and deliberate design reasoning behind every decision.

20 min read Undergraduate · Master's · Doctorate April 21, 2026

A machine learning thesis is valid when it meets three conditions: it has a clear and measurable contribution, it applies design reasoning — not just technical execution — and it is grounded in academic literature. Beyond the model itself, what matters is defining a well-scoped problem, validating feasibility with real data, and demonstrating how your solution generates value in a specific context.

This applies whether you're at the undergraduate, master's, or doctoral level. The level changes the depth of the contribution, but the design decisions that determine whether a project is viable are the same.

This guide won't just tell you what to do — it will show you how to make concrete decisions: how to choose a feasible topic, how to validate your idea before writing any code, how to structure the project, and how to defend your technical choices in front of a committee.

Table of contents:


What makes a machine learning thesis valid

This is where most people get confused. There is no single universal definition. In practice, the validity of a thesis depends on the university, the country, and — most importantly — the advisor.

Working with students from different contexts — Peru, Colombia, and Chile, among others — the criteria vary, but there are common patterns that come up again and again.

Your professor's criteria matter more than you think

Beyond the university or country, what truly defines whether your thesis is valid is the specific judgment of the professor you're working with. Before getting too far along, it's essential to understand what he or she considers sufficient — both academically and practically. Two advisors at the same university can have very different expectations about what makes a valid machine learning thesis.

The 3 conditions that do repeat across contexts

1. A clear and measurable contribution

Your thesis must answer: what exactly am I contributing? Simply applying a model is not enough. You need to compare against other approaches, define metrics, and demonstrate improvement or value. In resources like Google Developers, the machine learning process is defined as a complete system that includes problem definition, data, and evaluation — not just the model itself.

2. Design reasoning — what they call innovation

When committees ask for innovation, they usually mean there should be genuine engineering behind your solution. Applying a model directly to a dataset is not enough. What a committee wants to see is that you designed a pipeline, justified your decisions, and adapted the system to the problem. Saying "I'm going to apply YOLO to detect equipment failures" isn't enough if there's no design reasoning behind it.

3. Grounding in the literature

You need at least one or two reference articles to validate your approach and compare results. These articles don't just provide support — they also tell you what has been done before and where the opportunity to contribute lies.

Before going too far, align these criteria with your advisor. The final evaluation is still contextual, and there is no substitute for that conversation.


Finish your project already

You've taken courses… but don't know how to apply it

92% of data professionals unblock their projects by seeing complete solved examples.

53
projects
8
sectors
100%
applied

No sign-up · Instant access

How to choose a machine learning thesis topic without making mistakes

The most common mistake: choosing a problem that's too big

What often happens is that a student reads a paper, wants to replicate it, and also improve on it. But that paper was written by a team of two, three, or five people who spent months or years working in that research area. You don't know how long they've been at it. You're just starting out.

This kind of overreach grows quickly when the problem isn't well-defined from the start, as described in Hidden Technical Debt in Machine Learning Systems.

The key skill: scoping

A good thesis isn't the most complex one. It's the one that's well-defined, executable, and shows deliberate judgment. Before choosing your topic, answer these questions: What can I realistically accomplish in 6 months? What resources do I have? What can I validate quickly? Scoping down doesn't reduce the project's value — quite the opposite: the more tightly scoped the problem, the deeper you can go and the more meaningful your contribution.

When your advisor also wants a large scope

Sometimes the problem doesn't come from the student alone. Some advisors who aren't ML specialists also propose overly ambitious scopes. In those cases, the key skill is negotiation. It's not about contradicting the professor — it's about building together a scope that is rigorous and achievable within the available time. A well-scoped contribution is more convincing than an unfulfilled promise.

A small problem, well solved, is worth more than a large problem left incomplete.


How to know if your thesis idea is viable

An idea is not enough. You need to validate it before committing.

1. Does it have support in the literature?

Search for similar problems in the literature, datasets that have been used, and existing approaches. This gives you context, shows you the boundaries of what has already been done, and helps you identify where there is room to contribute. It also gives you the reference articles you'll need to compare your results against.

2. Is it a problem that can be modeled with ML?

A problem is a good fit for machine learning if it has no deterministic solution, depends on multiple variables, and can be expressed as inputs → model → outputs with a defined target. If the problem can be solved with a deterministic algorithm, machine learning is not the right tool.

3. Validate with an MVP before committing

Before locking in your topic, run a quick one- to two-day test: use a dataset, train a simple model, and observe the results. Today you can do this with notebooks and AI tools. Technical documentation like scikit-learn's recommends evaluating models iteratively from early stages, rather than waiting for a final solution.

If you get a reasonably sensible result with the data you found, that tells you the path is viable. You're not validating the final outcome — you're validating that the problem is solvable with the approach you chose.


Real viability: data, system, and context

This is where many projects fail without realizing it.

1. Data is the primary input

Before any other decision, answer this: Does the data exist? Can I access it? Can I generate or obtain it? Your options are open datasets, your own data, synthetic data, or proxies. Without data, there is no viable model. As shown in the research Data Cascades in High-Stakes AI, many AI system failures don't stem from the model itself, but from cumulative data problems that build up across the project lifecycle.

How to find datasets effectively

A concrete tactic: search Google Scholar for the type of classification or problem you're interested in, together with the word "dataset." Read the articles that come up and check which ones have open data. Prioritize publications from the last three to five years so the datasets haven't been exhausted in the literature. If you find an older dataset but a recent article is still using and citing it, that's also valid — it means it's still relevant.

2. Your thesis is a system, not just a model

Today, a strong thesis means solving a real problem with a concrete use case. Training a model alone isn't enough. For your contribution to be solid, you need to give it an application. For example, two theses might use similar architectures for detecting safety equipment in video, but one could be a real-time alert system and the other a deferred monitoring system. The model may be similar; the contribution is different because the application is different.

This aligns with how machine learning systems are understood in practice, where — as discussed in Hidden Technical Debt in Machine Learning Systems — the model is just one part of a much larger system.

Viability also depends on the infrastructure available to you: will you have access to a real environment to evaluate the system? Can you deploy it or evaluate it another way? Those questions are part of the design from the very beginning.

3. Why an MVP also protects you emotionally

One of the most common blockers in ML theses isn't technical — it's the frustration that comes when results aren't as good as expected. Running a quick test at the start not only validates the technical direction, it also sets realistic expectations. If you already know upfront that the data is messy or the problem requires more preparation work, it won't catch you off guard later when the project is already well underway.


Technical decisions: what they actually evaluate

How to justify your choice of model

It's not about picking the best model in the abstract. It's about comparing several options, establishing a baseline, and justifying your results. Reviewers always look for comparisons. Saying your model performs well isn't enough — you have to show it performs better than alternative approaches and explain why you chose that approach over others.

A simple model will be sufficient when the results are adequate, when the data is well-prepared, or when the problem doesn't require greater complexity. There's no reason to use a complex model if a simple one meets the objective with results you can defend.

Metrics: what actually matters

There is no universal metric. The right metric depends on the problem: in healthcare, what matters most might be minimizing false negatives; in industrial monitoring, it might be balancing precision and recall. The question you will always be asked is which metric is most important and why. Your answer needs to connect the metric to the impact of the problem your thesis is solving.

Interpretability as a competitive advantage

If you can explain your model, you earn academic points and strengthen your argument. If the most important variables in your model can be intuitively explained in terms of the problem, that's a major plus. It's not always possible, but when it is, it's worth a lot. Tools like SHAP or LIME can help you build that interpretation in a rigorous way.


How to avoid getting stuck in your thesis

Define an internal critical path

The thesis plan is an academic deliverable. At many universities it includes timelines, costs, and a consistency matrix that you're asked to submit before you've even fully understood the problem. You have to submit it, but it doesn't always function as a real operational guide for the technical work.

What you also need, in parallel, is an internal critical path: fewer steps, more concrete, focused on the technical decisions that will actually unblock you. That path gives you the big picture you need to move forward with clarity and make better development decisions. The coherence between your title, objectives, hypothesis, variables, and method of analysis is what holds a thesis together — when that coherence is missing, the project gets rejected at the idea stage.

Invest in your background and theoretical framework

This is one of the most time-consuming parts when done well, and also one of the most rewarding. The background and theoretical framework give you the arguments to answer committee questions, allow you to reference prior work and explain what makes yours different, and give you clarity about the problem before you write a single line of code. It's not wasted time — it's the foundation everything else rests on.

Validate before you scale

Many problems come from not testing quickly and from waiting to have a perfect solution before evaluating anything. Run the quick experiment first. With whatever data you can find, a notebook, and AI tools, you can have valuable information within a day or two to know whether the direction is viable. That's what lets you make informed decisions before committing weeks of work.


Quick decision guide

Depending on your situation, here's what matters most:

If you have limited time

Use datasets that already exist. Search Google Scholar for the type of problem you're interested in along with the word "dataset," filter for the last three to five years, and look for those with open data. If you find an older dataset but a recent article still uses it, that's also valid.

If you don't have data

Think about whether you can collect it, generate it synthetically, or find a proxy. Data is the primary input for the model. Without it there's no viable thesis, and rethinking your topic early is far better than discovering that too late.

If you're a beginner or intermediate-level

Avoid too broad a scope. Complexity you can't handle within the available time doesn't add to your thesis — it subtracts from it. A well-scoped project lets you go deeper into the problem and generate a real contribution.

If you want impact

First define where you want the impact to land: on the problem or on the models. You can have a simple model that solves a medical problem with high real-world impact, or you can invest more effort in the architecture and contribute more to the technical state of the art. Ideally both are possible, but they aren't always. Once you decide where the impact goes, review your reference articles, identify their contributions, and try to move in that direction with the scoped problem you've chosen.

Scoping a problem doesn't reduce its value. It makes it solvable.


Frequently asked questions about machine learning theses

How difficult is it to write a thesis using machine learning?

It depends on the scope. A well-scoped project is manageable even in 6 months. The most common mistake is choosing a problem that's too large without accounting for the time and resources available.

Do I need deep learning for my thesis?

No. A simple, well-applied model can be enough. What committees actually evaluate is design reasoning, comparison with other approaches, and clarity of contribution.

How do I choose a machine learning thesis topic?

The topic must be feasible in terms of data, time, and complexity. Before committing, verify that reference articles exist, that the problem can be modeled with ML, and that you can run a quick test with real data.

What matters more: the model or the data?

The data. Without adequate data there is no viable model. In addition, a strong thesis today means building a system with a concrete use case, not just training a model.

How long does a machine learning thesis take?

Between 6 and 12 months depending on the level. Undergraduate programs typically range from 6 months to a year; master's and doctoral programs take longer. A well-scoped project from the start is the single most important factor in finishing on time.

How do I know if a thesis idea can be researched using machine learning?

An idea is researchable if the problem has no deterministic solution, depends on multiple variables, and can be expressed as inputs → model → outputs. You also need to find at least one article that validates the approach and data you can use for an initial test.



If you already have a thesis project underway and need help defining it or getting unstuck, feel free to reach out directly. Often, a single adjustment to the problem design changes everything.

Finish your project already

You've taken courses… but don't know how to apply it

92% of data professionals unblock their projects by seeing complete solved examples.

53
projects
8
sectors
100%
applied

No sign-up · Instant access