Machine Learning / Statistical Data

Examples of machine learning datasets.

Machine learning is used as a general term for computational data analysis: using data to makes inferences and predictions. Interpreted broadly it includes computational statistics, data analytics, data mining and a good portion of data science.

Machine learning algorithms are often categorized as supervised or unsupervised ("data mining").

Supervised machine learning algorithms can apply what has been learned in the past to new data using labeled examples to predict future events. Starting from the analysis of a known training dataset, the learning algorithm produces an inferred function to make predictions about the output values. The system is able to provide targets for any new input after sufficient training. The learning algorithm can also compare its output with the correct, intended output and find errors in order to modify the model accordingly.

In contrast, unsupervised machine learning algorithms are used when the information used to train is neither classified nor labeled. Unsupervised learning studies how systems can infer a function to describe a hidden structure from unlabeled data. The system doesn’t figure out the right output, but it explores the data and can draw inferences from datasets to describe hidden structures from unlabeled data.

Semi-supervised machine learning algorithms fall somewhere in between supervised and unsupervised learning, since they use both labeled and unlabeled data for training – typically a small amount of labeled data and a large amount of unlabeled data. The systems that use this method are able to considerably improve learning accuracy. Usually, semi-supervised learning is chosen when the acquired labeled data requires skilled and relevant resources in order to train it / learn from it. Otherwise, acquiring unlabeled data generally doesn’t require additional resources.

Reinforcement machine learning algorithms is a learning method that interacts with its environment by producing actions and discovers errors or rewards. Trial and error search and delayed reward are the most relevant characteristics of reinforcement learning. This method allows machines and software agents to automatically determine the ideal behavior within a specific context in order to maximize its performance. Simple reward feedback is required for the agent to learn which action is best; this is known as the reinforcement signal.

source

Datasets

There are a variety of machine-learning datasets on the DataHub under the @machine-learning account: https://datahub.io/machine-learning

Seismic Bumps: https://datahub.io/machine-learning/seismic-bumps. This is a classification problem. The data describe the problem of high energy (higher than 10^4 J) seismic bumps forecasting in a coal mine. Data come from two of longwalls located in a Polish coal mine.

Existing collections

UCI Machine Learning Repository. 404 datasets.
OpenML datasets
Kaggle datasets
Academic Torrents
TU Berlin/ MLdata.org
AWS Public Datasets
BigQuery Public Datasets