Datasets for your next modelling project
Looking to do some data modelling? Perhaps you are looking for your next trading strategy, researching a new market for a product launch or simply fine tuning your data wrangling skills.
Whatever you are doing, you are probably going to need some data. So, to get you started, I have put together a brief list of datasets, grouped into a few broad categories including:
- Social media
- Dataset search engines and lists
Governments around the world provide free access to public datasets covering areas such as health, education, finance and more:
Given the source of the data (and depending on your level of cynicism), these datasets can be considered very trustworthy and credible – so they are great for projects that rely on accurate data. Most of the data will be fairly up-to-date, depending on the data collection cycles of each government organisation.
Universities, and other similar academic institutions, provide free datasets to help students, researchers and the general public with data modelling and machine learning projects. A few examples include:
- Stanford – Large Network Dataset Collection
- University of California Irvine – Machine Learning Repository
- Harvard – Dataverse
In addition to the basic government economic data (publicly available), there are loads of service providers that can supply you with finance and economics data (most for a fee):
- Quandl – A free and paid source of data covering stocks, futures, commodities, currencies as well as “alternative” datasets such as transactions from email-based receipts from major retailers.
- Trading Economics – A paid source of economics data from around the world.
- Asset Macro – Over 25 thousand datasets for macroeconomic indicators and financial data.
Looking to do some sentiment analysis or analyse demographics for a new product launch? In addition to mining their data to sell advertising, social networks may also provide access to their data (at some aggregated level).
- Gnip – Twitter’s “enterprise API platform”, providing access to realtime and historical data.
- Datasift – Provides APIs to access data from platforms such as Facebook, Instagram, Youtube, Reddit and more.
- Keyhole – Historical data from Twitter and Instragram for hashtags.
Generally, the services offering access and their APIs to the social media platforms are expense and most likely intended for enterprise customers (e.g. the marketing department of Unilever) with a large social media presence and large budgets to match.
Dataset search and lists
There are a host of search engines and aggregators for datasets – some free, others paid – such as:
- Google Public Dataset Explorer – Google’s search tool for viewing and downloading public datasets.
- Datahub – A free service for searching for datasets on the “CKAN” platform.
- Enigma – Claims that they want to become the “Google of public data”.
- Statista – A searchable library of “statistics and facts” with licensing starting at around $50 per month.
- Zanran – Another search engine for data and statistics.
The following sites also offer a fantastic list of publicly available datasets:
- List of public datasets on GitHub
- KDNuggets – List of datasets for data mining and data science
- Amazon Web Services – List of public datasets
- Kaggle Datasets
The links will hopefully provide you with a starting point for sourcing data for your next project. Many are free and available for personal use, though it is always a good idea to check what limitations may exist on the data you intend to use (i.e in case you have a commercial purpose for the dataset).
PS: An honourable mention has to go to a dataset on GitHub listing over 80,000 UFO sighting reports!