/ ML Methodology / Data and Sources
Data for Machine Learning Projects How to find, evaluate, and work with it when it's not ideal
In machine learning, data is not just an input: it determines whether your project is viable or not. The real challenge is not finding datasets; it's knowing how to evaluate them quickly and make decisions when they're not enough.
This is not a list of datasets. It is a guide for making real decisions about data: how to find it, how to evaluate it in hours rather than weeks, what to do when it's not ideal, and when it makes sense to reframe the problem instead of continuing to search.
I have worked with data in projects spanning the cement industry, social sciences, healthcare, and retail. And in all those contexts, the pattern repeats: the problem is rarely the model. It's in how data is understood, prepared, and evaluated before ever reaching the model.
This guide documents what I've learned in that process, including mistakes, real cases, and a couple of my own tools I use before committing to any dataset.
Table of contents:
- When the problem is in the data, not the model
- EDA that actually works versus checklist EDA
- Feature engineering: the step almost everyone skips
- How to search for datasets strategically
- Available does not mean downloadable
- Dataset Novelty Score: how to avoid overused datasets
- How to evaluate a dataset in one or two hours
- The traffic light method for categorizing variables
- What to do when data doesn't exist
- Real case: influencer reach on social media
- Dataset first or problem first
- How to work with limited data
- Quick decision guide
- Frequently asked questions
When the problem is in the data, not the model
There is a very clear signal that appears in projects where the problem is not being correctly identified: you test one model, then another, you tune hyperparameters, repeat the process for days, and nothing improves significantly. At that point it is very likely that you are not facing an algorithm problem. You are facing an input problem.
The amount of data is a factor, but not the most important one. What most often signals that data will cause problems is how it is organized: if it requires many aggregations, complex filters, or deep transformations to make sense, that is an early warning that the path ahead will be difficult.
To detect this quickly, I use simple tree-based models — not as a final model, but as a diagnostic tool. A well-trained decision tree on raw data immediately gives me two things: variable importance and a first consistency check. If variable importance roughly matches what correlations suggest and what domain common sense would indicate, the dataset probably has potential. If there is no consistency across those three sources, the problem is in the data.
This is not a long process. It takes a few hours of work. But those hours prevent weeks of optimization on an input that won't work regardless of which algorithm you use.
EDA that actually works versus checklist EDA
There is something almost no one says clearly about Exploratory Data Analysis: doing it is not enough if you don't know what you're looking for.
What frequently happens, both in academic and professional projects, is that EDA becomes a checklist of steps to complete: distributions, correlations, scatter plots, boxplots. The boxes get checked, results are reported, and you move to the next step. But the real question that should guide the analysis never gets answered: will this data cause problems later?
The difference between doing EDA as a requirement and doing EDA with judgment lies in the questions you ask yourself while doing it. It's not just generating distributions — it's asking why that variable has that shape and whether it makes sense given the problem. It's not just calculating correlations — it's asking whether the relationships that appear are consistent with what you know about the domain or whether something doesn't add up.
The time EDA consumes is not in the code. The code is fast. What takes time is interpretation: understanding what you're seeing, validating it against the problem's context, iterating until you have a clear picture of what you're working with. That part cannot be automated or delegated to a tool.
A well-done EDA does not end with a list of descriptive statistics. It ends with a clear hypothesis about the risks that dataset will present later in the project.
Finish your project already
You've taken courses… but don't know how to apply it
92% of data professionals unblock their projects by seeing complete solved examples.
No sign-up · Instant access
Feature engineering: the step almost everyone skips
The most common mistake I've seen from students working with data is using the dataset exactly as it comes. They have the target identified, they train the model directly, and they wait for results. That works sometimes, but it almost always leaves value on the table.
The reason modern deep models have reduced the need for manual feature engineering is that they learn their own representations when they have a lot of data. But that condition — a lot of data — is not always met. And when working with classical machine learning models, feature engineering can be the difference between a mediocre result and one worth defending.
Real case: ProstateX dataset and the circle feature
I worked with the ProstateX dataset, a set of MRI images for prostate cancer detection. The problem with this dataset is that it contains more information than needed — the full image includes much more than the prostate: surrounding tissue, pelvic bone structure, and other elements that are noise for the model.
The prostate, in most images, appears in the center. What I did was create a feature that relates the pixels in a circle centered on the image to the pixels outside that circle. Instead of passing the full image to the model and asking it to find the relevant pattern on its own, I gave it a contrast measure between the center and the exterior.
The result was significant. That feature became one of the most important in the model according to variable importance — and it makes sense from the domain: when prostate cancer is present, the central structure of the image shows differences relative to the peripheral tissue that a physician would also observe.
What matters here is not the specific technique. It's the logic behind it: you didn't ask the model to find the pattern on its own. You gave it a clue grounded in domain knowledge. That is feature engineering.
There is an important caveat: feature engineering can work against you if it becomes too specific. A feature tightly fitted to one dataset's patterns may improve training results but destroy the model's ability to generalize. The question that should guide every such decision is: does this feature capture something real about the problem, or does it only capture noise from this particular dataset?
How to identify which features are worth creating
The process is straightforward in concept: think about what pattern you believe exists in the data that the model might not detect on its own with the available data, create the feature that represents that pattern, and evaluate whether it improves results. If it doesn't, discard it. If it improves results and makes sense in the domain, keep it.
Feature engineering is where domain knowledge becomes a technical advantage. And in academic projects it is exactly the kind of contribution a committee recognizes as a design criterion, not just model application.
How to search for datasets strategically
The approach that has worked best for me, and that I consistently recommend, is searching Google Scholar for the classification type or task you are interested in, combined with the word "dataset." Read the recent articles that appear, see what datasets they used, and verify which ones are openly available.
This strategy has an advantage that Kaggle or Hugging Face don't offer directly: the articles tell you what has been done with that dataset, what results were obtained, and what preparation it required. That gives you context to evaluate whether it's worth using and what you could contribute that is different.
Kaggle and Hugging Face are valid sources, but they require judgment. Before committing to any dataset from any source, you need to understand how it was built, who collected it, and under what conditions. A dataset is reliable when its documentation clearly explains the collection process, who validated it, and — even better — when there is a published article in an indexed journal backing its construction.
If the documentation is poor, if nobody discusses the dataset, or if the little you find raises doubts about its origin, that is enough of a warning to look elsewhere.
Available does not mean downloadable
This is one of the most time-costly mistakes I have seen repeated: a student or professional finds a dataset that looks perfect, proposes it as the basis of their project or thesis, spends weeks planning, and then discovers they cannot access the data.
The dataset "exists" on a page, in a repository, or in an article, but downloading it requires an institutional account, a formal request to the organization that generated it, an approval process that can take months, or a payment that was not anticipated.
The rule is simple: before proposing any dataset as the basis of your project, download it. Do not assume that because it appears on a list of open sources or because an article cites it as available, you will be able to access it under the conditions you need.
Once you have it downloaded, run a quick test: a basic EDA and a simple tree model. Those two hours of initial work will give you far more valuable information than weeks of planning based on assumptions. If that quick test yields results that are consistent and promising, then it is worth investing more time. If everything becomes complicated with no obvious reason why, it is better to know that before committing.
Dataset Novelty Score: how to avoid overused datasets
In academic projects, using a dataset that has already been extensively worked in the literature is a real problem. Not because it cannot be used, but because it reduces the available contribution space: if that dataset already has dozens of papers with optimized models, what you contribute has to be sufficiently different to justify itself.
To estimate how saturated a dataset is in the literature, I use a metric I call the Dataset Novelty Score. It is a beta tool, but it works well as a first approximation for making this decision.
How the Dataset Novelty Score is calculated
The calculation is simple and done with Google Scholar searches:
- D: number of domain papers published in the last 3 years (e.g., "prostate cancer detection machine learning").
- DS: number of domain papers published in the last 3 years that specifically mention that dataset.
- DNS = 1 − (DS / D)
A DNS close to 1 indicates the dataset is rarely used in recent literature, representing more contribution opportunity. A DNS close to 0 indicates the dataset is very common in the domain and the space to differentiate is narrower.
How to interpret the result
The DNS does not automatically discard a dataset. A dataset with a low DNS can still be valid if your contribution depends not on dataset novelty but on novelty of approach, application, or engineering applied to it. What the DNS does give you is an early signal: if the DNS is low and your contribution is not particularly different technically, that is the moment to look for alternatives before moving forward.
In practice, I use the DNS alongside an evaluation of the most recent article using that dataset: if that article already achieved very high metrics with established techniques, the margin for improvement is small and that must be factored into the project design from the start.
How to evaluate a dataset in one or two hours
When I find a dataset on Google Scholar, Kaggle, or Hugging Face, the process I follow before making any decision about using it has four steps. All of them fit within one or two hours of work.
Step 1: read the description critically
The first step is not downloading the dataset. It is reading how it is described: who generated it, under what conditions it was collected, and what variables it contains. If the description is poor or does not mention the collection process, that is already information: it indicates that the dataset's traceability is low and that relying on it for an academic or professional project is risky.
I also check whether there is a reference article. A dataset backed by a paper in an indexed journal has a level of validation that a repository dataset with no documentation does not. It is not an absolute requirement, but it does change the level of confidence.
Step 2: review the structure
Before running any code, I check whether the dataset is tabular, image-based, or text; whether the target is clearly identified; and whether the variables have names and types that make it clear what they represent. I don't focus on size at first. If a dataset is published and open, it generally has enough data for at least an initial evaluation. What matters most in this step is whether I will quickly understand what the data is about, or whether I will need a lot of work just to get oriented.
Step 3: quick test with a tree model
With the dataset downloaded, I train a tree-based model — either a decision tree or a simple Random Forest — on the data as-is. I am not trying to optimize anything in this step. I am looking for three things: that the model trains without errors, that the variable importance is consistent with the domain, and that the initial results are not completely random.
If the most important variables make sense from the problem perspective, if the distributions show nothing unusual, and if the initial result — even if not great — is at least consistent, then the dataset has potential and it is worth continuing.
Step 4: verify consistency between correlations and variable importance
The last step of the quick evaluation is comparing what correlations suggest with what the tree's variable importance indicates. If both sources point in the same direction and that is consistent with domain common sense, I have enough confidence to invest more time in that dataset. If there are unexplained contradictions, I note them as a warning and evaluate whether they are solvable before moving on.
Signals that make me quickly discard a dataset
Nonexistent or very poor documentation. A dataset in a GitHub repository with no clear Readme and no article or blog post backing it. Difficulty downloading it or access contingent on approvals that could take weeks. And also: no one in recent literature is using it, which may indicate it was already evaluated and discarded for reasons that are not obvious at first glance.
The traffic light method for categorizing variables when you define the problem first
When the problem is defined from the start and the challenge is understanding what data exists that could be useful, the process I follow is different from looking for a ready-made dataset. In that case, the first thing I do is ask that all data related to the problem be brought to the table: format doesn't matter, whether it's a PDF, an Excel file, a relational database, or unstructured text. I want to see everything before filtering.
Once I have that complete picture, I work together with the person who knows the problem or domain — whether that's the client, the advisor, or the subject matter expert — to categorize each data source into three levels using what I call the traffic light method:
- Green: data that is clearly relevant, available, and whose use makes sense for the defined problem.
- Yellow: data that could be useful but requires additional validation. It may be that the timeframe is not ideal, there are doubts about its quality, or it is not yet clear how to integrate it into the problem.
- Red: data that does not make sense to use. It may be because the service that generated it no longer exists, because the timeframe is incompatible with the problem, or because from domain knowledge it does not provide relevant information.
I don't do this process alone. The domain expert's participation is fundamental because there are reasons to classify a data source as red that are not technical: business reasons, historical context, changes in the process that generated the data. Without that perspective, I could end up using data that seems technically correct but does not represent the reality of the problem.
The result of the traffic light method is a clear categorization that reduces the workspace: instead of trying to use everything, you know exactly what you need to work with and why. That accelerates the entire subsequent process.
What to do when data doesn't exist
The absence of data is almost always an absence of access, not a real absence of information. Data generally exists. What does not exist is a direct way to reach it. That distinction completely changes how the problem is approached.
When you do not have access to the data you need, there are three options: find a proxy, generate synthetic data, or reframe the problem. The choice between these three depends on the project's context.
In an academic project, the first question is what the advisor thinks about the three options. That conversation is part of the design process and needs to happen before committing to a direction. In a business project, the question is which of the three options has the best ratio between implementation cost, time required, and expected impact on the objective.
A well-justified proxy is not a second-rate solution. It is a design decision. What it implies is that the original problem is reframed so that the proxy can answer it, without losing the main objective. That requires conceptual work, but it is work that gives the project solidity.
Real case: modeling influencer reach on social media when data is not accessible
The most difficult case I've had in terms of data access was a social sciences project where the objective was to model the influence power of influencers through their followers' behavior toward specific products.
The conceptual problem was well defined: influence is measured indirectly through the comments followers make on social media in response to product mentions. That type of data exists in enormous volumes. The problem is that it was not accessible under the project's conditions.
Meta does not allow comment data extraction for external projects, except through a very restricted API for political content. X, formerly Twitter, began charging for API access under conditions that made the project economically unviable. LinkedIn was not relevant for this type of analysis. The seemingly obvious options were closed.
What we did was redirect toward YouTube. The justification was solid: YouTube functions in many contexts as a purchase-decision search engine, especially in product categories where video content directly influences buying intent. The YouTube API allows access to comments, engagement metrics, and content data openly, within the project's limits.
The result was a well-justified proxy: it was not exactly the original problem, but it addressed the main objective with real, verifiable data. The problem had been reframed in a way that was still academically valid and technically executable.
What this case illustrates is that the solution when data access is unavailable is not to give up or change the topic entirely. It is to understand what part of the problem you can still answer with the data that is available, and to be rigorous in justifying why that proxy is valid.
If this had been a business project instead of an academic one, the decision would have required evaluating whether the cost of obtaining access to the original data — whether by paying for APIs or establishing formal agreements with platforms — made sense relative to the expected impact. Sometimes it does. Sometimes it doesn't.
Dataset first or problem first: when each applies
This is not a question with a single answer. It depends on context and available resources.
Starting with the dataset makes sense when you have limited time, when you are in an academic project where the literature already supports the use of that dataset, or when you strategically want to choose a topic that already has quality data available. In those cases, identifying a well-documented dataset with promising results in recent articles and a high DNS is a legitimate starting point. From there you define what problem you can solve with that data in a way that makes your contribution valid.
Starting with the problem is the right path when you have a clear impact objective, when you are in a business project with a real user, or when the contribution you seek depends on solving a specific problem and not on exploring what can be done with an available dataset. In that case, the dataset comes later: first you understand what you need, then you look for whether it exists or how to build it.
The distinction that matters
What does not work is confusing finding a dataset with having a project. A dataset is an input. A project requires a defined problem, a metric aligned with that problem, and a way to demonstrate that the solution generates value. Having the data is the beginning of the process, not its result.
The reason this confusion is so costly is concrete: if you don't define the problem first, the model's metric and the objective's metric will be misaligned. You will be optimizing the how without being clear on the what. And that translates into duplicated work when you realize late that what you were measuring was not what you needed.
How to work with limited data
Working with limited data is not automatically a problem. It depends on the type of data and the model you want to use.
Tabular data, especially in domains like finance, operations, or structured business data, works well with relatively small datasets when the data is clean and well organized. With 150 well-prepared records you can evaluate classical machine learning models with sufficient confidence — not for deep learning, but for most approaches used in undergraduate and master's academic projects.
The most frequent mistake with limited data is trying to use overly complex models. A complex model with limited data tends to overfit: it learns the specific details of the training dataset instead of generalizable patterns. The result is good training metrics and poor validation metrics.
When results don't improve with limited data, the right question before looking for more data is whether the preprocessing can be improved. Adding more data often doesn't solve the problem if the problem is in how the existing data is prepared. Feature engineering, outlier cleaning, and proper normalization can have more impact than doubling the dataset size.
And if the available data is genuinely insufficient for any reasonable approach, the path is to evaluate whether more can be obtained, whether it can be generated synthetically, or whether the problem needs to be reframed so it is solvable with the data that exists.
Quick decision guide
If you have limited time
Search for existing datasets on Google Scholar using the type of problem you're interested in plus the word "dataset." Filter by the last three to five years. Prioritize those with a published reference article. Download them before committing. Run a quick tree test and verify consistency between correlations and variable importance.
If you have no data
Distinguish between "it doesn't exist" and "I don't have access." Data almost always exists. Look for proxies: what available data approximates the problem you want to solve and allows it to be reframed without losing the main objective. If the project is academic, discuss it with your advisor. If it is a business project, evaluate the cost-benefit of collecting data versus using a proxy.
If you are a beginner or intermediate level
Avoid datasets without clear documentation. Do not use sources you cannot verify. And do not assume a dataset is available until you have downloaded it and run a minimal test.
If results are not improving
Before trying another model, stop and review the data. Check whether there is feature engineering you can apply with domain knowledge. If you have already tested several models and none gives consistent results, the problem is in the input, not the algorithm.
If you want to know whether a project is worth pursuing
Even if results are not very good, if the quick test generates some insight consistent with the problem or with the contribution you are seeking, it is worth continuing to explore. If after three days of testing there is no way to get the data to answer any reasonable version of the problem, that is the moment to reconsider the topic.
Frequently asked questions about data in machine learning projects
How do you know when the problem is in the data and not the model?
When you test several models, tune hyperparameters, and nothing improves significantly. At that point the problem is almost always the input: poorly organized data, inconsistent variables, or a target that does not properly capture what you want to predict.
How can you tell in a few hours whether a dataset is useful for a machine learning project?
Read the dataset description, review the structure and target, download it, and run a simple tree model. If variable importance is consistent with correlations and with domain intuition, the dataset has potential. If there is no consistency without an obvious reason why, that is a warning sign.
What is the Dataset Novelty Score and what is it for?
It is a metric to estimate how saturated a dataset is in recent literature. It is calculated as DNS = 1 − (papers using the dataset in the last 3 years / total papers in the domain in the last 3 years). A high DNS indicates the dataset is rarely used and has more value for academic contributions.
What should you do when there is no dataset for your machine learning problem?
Look for a proxy: a dataset that allows the problem scope to be reframed without losing the main objective. In academic projects, discuss it with your advisor. In business projects, evaluate the cost of collecting data versus the impact of using a well-justified proxy.
Is it better to start a machine learning project with the data or with the problem?
It depends on context. If you have limited time or the project is academic, starting with an available dataset can be strategic. If you're seeking real impact or working in business, defining the problem first is the right path. In all cases, the problem defines the value of the project; data is the means.
What is the most common mistake when working with data in machine learning?
Using data as-is without asking what feature engineering could improve the result. And also: assuming a dataset is available without downloading it before committing to the project.
If you already have data but are not sure how to turn it into a complete project with a clear contribution, you can reach out to me directly. Often the difference between a project that moves forward and one that stalls is in how those first decisions about data are structured.
Finish your project already
You've taken courses… but don't know how to apply it
92% of data professionals unblock their projects by seeing complete solved examples.
No sign-up · Instant access