Let’s discover other sites too and see if there are much more suitable options.
# 2 KaggleAnother great place to find free data sets.
Overall, Kaggle is the multifunctional site or it’s better to call it well-known ‘data-science community’ that offers not only variety of externally shared interesting data sets, but also materials for acquiring new knowledge and practicing skills.
Through allowing users to share code with others, Kaggle offers learning best practices within the data space.
Amazing combo, isn’t it?The search here is as simple.
Just open the homepage and look for the search box at the top of the page.
Then, use the “in: datasets” tag.
For example, to get data about medicine, enter “medicine in: datasets” into the search box.
Another nuance you need to know is Kaggle also hosts competitions where you can win real money if you have a top ranking model.
You can download data for either, but you have to sign up for Kaggle and accept the terms of service for the competition.
# 3 FiveThirthyEightFiveThirthyEight is one of the best places I would recommend.
It’s a perfect mixture of significant storage of free data sets and great informative articles dedicated to Data Science.
Frankly speaking, you can simply stop reading my post now and use only this website.
But, actually, I’m kidding cause every place has its own features and possibilities (who knows).
So, all-in-all, FiveThirthyEight is good for lots of interesting information for aspiring data scientists and materials to work with.
They use hard data and statistical analysis to tell stories about politics, sports, societal matters and more.
What you need to know about FiveThirthyEight is the fact this service makes the data sets used in its articles available online on Github and on its own data portal.
The data there ranges from information about which states have the worst drivers to the economic worth of different college majors.
They make a lot of their data open to the public, meaning you can download and play with the source data yourself!# 4 BuzzFeedYou may be surprised why this site is here and for the first glance, it has no relation with data science.
Well, yes, BuzzFeed is a cross-platform digital media company delivering news and entertainment content.
But, the truth is this is multifunctional service that keeps the whole spectrum of interesting and useful options, and as you may guess, free data sets is not an exception.
Personally, for me, BuzzFeed is a great source to search for public datasets for Machine Learning and Data Science on different topics — from top fitness trends and beer recipes to pesticide poisoning rates — are available online.
All of this material you can find on Github.
By the way, BuzzFeed also provides a great portion of other material for aspiring data scientists like analysis, libraries, tools, guides and more.
In other words, you can use it for almost every occasion.
# 5 Data.
govAnother site that is fast and simple — Data.
gov is a large dataset aggregator and the home of the US Government’s open data.
There are 14 different topics (from agriculture, public safety, to local government) so you have high chances to select data set that will be really interesting for you.
What is more, this is a great site for data-driven journalism and story-telling.
The search here is simple, you can browse the data sets directly, without registering.
You can apply extra filters like topic category, location, tags, file format, organizations and more and make your search more effective.
# 6 Socrata OpenDataSocrata OpenData is a portal that contains multiple data sets that can be explored in the browser or downloaded to visualize.
The broad range of information makes it an attractive resource for continuously curious data scientists-practitioners.
However, you need to keep in mind one nuance about this site.
There is bad material curation which means you have to sort through what’s available to find data that’s clean and up-to-date.
By the way, it’s not really big disadvantage cause you can always look at the data in table form right in the browser, and use some built-in visualization tools as well.
# 7 QuandlThis one will probably be more valuable for those who want to try their hands on Machine Learning projects.
The thing is when you work on ML-project you need to clean up data sets to predict a column using information from the other columns of a data set.
Actually, such an action takes a lot of time if to do it on your own.
Thankfully, Quandl is a repository of economic and financial data that presents already cleaned-up data.
What is more, there’s an interesting target column to make predictions for and the other variables have some explanatory power for the target column.
So all-in-all, Quandl will be your perfect choice for testing your machine learning algorithms and don’t waste your time on cleaning data.
Some of this information is free, but many data sets require purchase.
# 8 Reddit or r/datasetsEveryone knows Reddit as a popular social news site, but there is also a section devoted to sharing interesting data sets.
Such discussion boards are called subreddits, or /r/datasets — a place to share, find and discuss datasets.
The scope and quality of these data sets vary a lot since they’re all user-submitted, but they are often very interesting and nuanced.
There are also other subreddits that I find interesting:r/dataisbeautiful — a beautiful name with a strict purpose that offers plenty of discussion on visualizations — be it charts, graphs or maps;r/learnpython — suggested mastering this skill gradually in the learning process; r/learnmachinelearning — evident to keep track of the latest information and discussions.
# 9 UCI Machine Learning RepositoryUCI Machine Learning Repository is clearly the most famous data repository.
It is usually the first place to go if you are looking for datasets related to machine learning repositories.
The datasets include a diverse range of datasets from popular datasets like Iris and Titanic survival to recent contributions like that of Air Quality and GPS trajectories.
The repository contains more than 350 datasets with labels like domain, the purpose of the problem (Classification / Regression).
You can use these filters to identify good datasets for your need.
# 10 Academic TorrentsLast but not least.
Academic Torrents is a not mainstream yet powerful platform for researchers to share data.
According to creators, this site is an attempt to make academic datasets and papers available via BitTorrent.
And the truth is, they fulfill goal on a significantly high level.
So, this is data aggregator that is focused mainly on sharing the data sets from scientific papers.
It consists of two pieces: a site where users can search for datasets, and a BitTorrent backbone which makes sharing data scalable and fast.
It has all kinds of unusual (and often large) data sets, although it can sometimes be tricky to get context on a particular data set without reading the original paper and/or having some expertise in the relevant domains of science.
Wrapping It Up: The Importance of Data SetsTo become an expert in data science is a long way.
It’s not something you can learn overnight.
It’s not something you can learn even in a month!.But you can certainly accelerate this process through doing a little more every day that you usually do.
Don’t fear to go a little bit further and don’t fear to practice your skills here and now.
Just use these websites to rely on when working on data-centric projects.
Much of it is available for free — either through a trial period or entirely open access.
It’s the easiest opportunity to gain experience, so now it’s your turn to just get cracking and do everything right.
Always remember, the best way to learn data science is to apply data science!Good luck!Hope you liked this post.
Feel free to share your ideas, thoughts, and suggestions.
Inspired to learn more?.Check out my blog on Medium and Instagram.