نوع مقاله : مقاله پژوهشی
موضوعات
عنوان مقاله English
نویسندگان English
Introduction
The increasing scarcity of natural resources and the growing demand for food security have driven agriculture toward innovative production strategies. Among them, greenhouse cultivation has emerged as a sustainable solution, offering up to ten times higher productivity while reducing water consumption to nearly 10% compared with open-field farming. However, its higher investment cost underscores the importance of resource optimization and accurate performance prediction. In recent years, the integration of big data and machine learning (ML) techniques has transformed agricultural management by enabling precise forecasting and decision-making. Unlike traditional statistical models, ML algorithms can handle nonlinear interactions and identify hidden patterns, thereby improving prediction accuracy. In greenhouse systems, models such as artificial neural networks, support vector machines, and random forests have been successfully applied to predict climatic factors and crop yields. This study aims to develop an ML-based model to predict the performance of greenhouse crops in Iran, thereby supporting sustainable agriculture and informed policy-making.
Method
This study developed regression machine learning models to predict greenhouse crop yields (tons per hectare) across various provinces and counties, considering crop types and cultivation area. Data were collected from official agricultural yearbooks (2002–2023) including 12,426 records covering 32 provinces, 474 counties, and 33 crop types. The dataset was preprocessed with label encoding for categorical variables and normalization for numerical features using MinMaxScaler and manual scaling.
Data were split into 90% training and 10% testing sets. Four regression algorithms were implemented: Support Vector Regression (SVR) with RBF kernel, Decision Tree (max_depth=128, min_samples_split=64), Multi-layer Perceptron (two hidden layers with 64 and 32 neurons), and Random Forest (200 trees, max_depth=64). Models were tuned manually for hyperparameters. Performance was evaluated using R², MSE, MAE, RMSE, and standard deviation, and statistical significance was assessed by one-way ANOVA.
Results
After training and evaluating the models, the Random Forest (RF) model demonstrated the best performance with an R² of 0.9598 and the lowest Mean Squared Error (MSE) of 15.6×10⁶, indicating highly accurate predictions. The RF model’s complexity, involving 200 trees and a max depth of 64, resulted in a large model size (184,488 KB) but ensured robust handling of complex, nonlinear relationships in the data. The prediction plot shows excellent alignment between actual and predicted values, confirming its superior accuracy.
The Multi-layer Perceptron (MLP) model achieved the second-best performance with an R² of 0.8878 and MSE of 27.3×10⁶, outperforming SVR and Decision Tree (DT) in accuracy. Its architecture, with two hidden layers comprising 64 and 32 neurons, enabled the learning of nonlinear data patterns, although some prediction errors persisted due to the relatively shallow network depth. The model size was moderate (85 KB), reflecting the trade-off between complexity and computational efficiency. The DT model showed reasonable accuracy with an R² of 0.8864 but recorded a higher MSE of 44.3×10⁶, indicating some inaccurate predictions. Due to its simple tree-based structure, it had the smallest model size (55 KB) and the fastest computation times, making it suitable for scenarios that prioritize speed and smaller storage over absolute accuracy. The Support Vector Regression (SVR) achieved the lowest accuracy, with R² of 0.8546 and the highest MSE of 33.6×10⁶. While SVR can model nonlinear data effectively, its performance in this case was limited by its sensitivity to hyperparameter tuning. The model size was moderate (182 KB), but the prediction plots showed greater divergence from actual values compared to those of other models. Error metrics, including Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and standard deviation (STD), further confirmed the RF model’s strong predictive stability with the lowest MAE and RMSE values. Statistical analysis using one-way ANOVA revealed no significant difference between RF, DT, and actual values at the 90% confidence level, while SVR showed statistically significant deviation, suggesting lower reliability. Two-way ANOVA confirmed that RF and DT had similarly high accuracy, statistically outperforming MLP and SVR.
In conclusion, RF is the most accurate and stable model for predicting greenhouse crop yield, though at the cost of higher computational demands and model size. DT offers a faster and smaller alternative with acceptable accuracy. MLP and SVR lag behind in performance, likely due to less optimal model complexity and parameter tuning. These findings suggest that RF is suitable for detailed predictive tasks and DT when computational resources are limited.
Conclusions
Greenhouses are recognized as a sustainable agricultural method, with yield performance being a key indicator of efficiency. This study evaluated four machine learning models MLP, SVR, Decision Tree (DT), and Random Forest (RF) to predict greenhouse crop yields across Iran using geographic, crop type, and cultivation area data. Among them, RF showed the best performance (R² = 0.9598, MSE = 15.6×10⁶) by effectively modeling complex relationships. DT also performed well, but it had some large prediction errors, resulting in a higher MSE. MLP and SVR exhibited weaker accuracy and significant deviations from actual values. These findings suggest that machine learning models, especially RF, can serve as powerful tools for policymakers and stakeholders in optimizing crop selection, resource management, and yield prediction. The developed RF model enables precise, location-based forecasting that can guide strategic agricultural planning and improve greenhouse productivity nationwide.
Acknowledgements
The authors sincerely extend their gratitude the Sabz Payesh Afra Watergy Systems Optimizers Company, for their kind cooperation and invaluable assistance in facilitating the collection of the necessary data for this study.
Author Contributions
Khashayar Zare: Methodology, Conceptualization, Writing, Data gathering, Data analysis
Ahmad Hosseinnejad: Methodology, Conceptualization, Supervision, Data analysis
Data Availability Statement
"Not applicable".
Ethical Considerations
This section states ethical approval details (e.g., Ethics Committee, ethical code) and confirms adherence to ethical standards, including avoidance of data fabrication, falsification, plagiarism, and misconduct.
Conflict of Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper
Funding Statement
The authors received no specific funding for this research
کلیدواژهها English