From government, health, and finance data to weather, baseball, and Star Trek, countless collections of free data are available to scratch your analytical itch
The FiveThirtyEight website is devoted to reporting stories with the support of a rich collection of data. When they can, they also share these data sets for you to do your own research. There are past records of their predictions for the major sports leagues, explorations about social attitudes like surveys of men asking what it means to be a man, and, of course, endless polls about upcoming political votes.
The UN agency responsible for helping raise healthy children around the world shares a wide variety of data sets that are useful to anyone with the same goals. The big picture can be found in marquee data sets like The State of the World’s Children 2019 Statistical Tables for those who want to track the change numerically. A more focused visualization can be discovered in tables that explore how iodized salt affects disease or the success of primary education.
Ohio State’s library keeps a web page current with pointers to some of the biggest collections of economic and financial data. There are historical records of US data sets and also some data collected by the World Bank. Some require an academic account and some are free to the public.
America’s sport is blessed by some fans who are adept enough with computers to develop extensive collections of data about the players and the results of their games. Sean Lahman’s database, for instance, contains complete batting and pitching statistics from 1871 through 2019. There are also tables of other details like fielding statistics, managerial changes, and World Series results that may not be complete, but might as well be for the modern era, which in major league baseball begins with the 20th century.
Project Retrosheet was started to assemble play-by-play summaries of all major league games whenever possible, and it is now complete through 1974. If you happen to have access to a scorecard from an earlier game, check the “most wanted” list to see if you can fill in a hole. Chadwick Baseball Bureau maintains a GitHub repo for the data if you prefer.
The Society for American Baseball Research maintains a list of other sources including offerings from commercial entities like FanGraphs, Baseball Reference, and Major League Baseball itself.
If you’re just looking for a particular data set, Google Dataset Search lets you search the entire web for data sets using keywords. The results can be filtered by license, data format, and the time since the last update. Some of the most intriguing data sets are also included in Google’s public data directory, which not only lists the sources but offers some interactive dashboards. The World Bank, for instance, charts fertility versus life expectancy and you can track how this changes over the years with a slider.
Amazon Web Services
AWS users who want data stored in S3 buckets can turn to the Repository of Open Data on AWS, or RODA. There’s wide variety in the thousands of data sets but the highlights tend to be the data sets from sources with which AWS is openly collaborating like the Space Telescope Institute (stars), NOAA (NEXRAD weather radar imagery), and Common Crawl (more than 25 billion web pages). There are several good examples to help you get started analyzing the data using, of course, AWS services like Lambda or Comprehend.
Microsoft also has a number of data sets on Azure. City planners can look for insight in the records from the New York CIty taxi board, which tracks all fares. Economists and traders can look at price records for commodities for insight on inflation and economic changes. All are ready to be analyzed by Microsoft’s machine learning tools.
Some of what we store on Facebook is private because we make it so. Some is shared with friends. Some content is completely open. Facebook supports research on the so-called “Facebook graph” with their Graph API. It’s not the same as downloading the entire data set, but it can be useful for some queries. Just remember that not everyone uses the same privacy settings, so you might not see every person or every post.
The website known for reviews of restaurants, bars, and other public accommodations shares a great deal of the information in a public data set that you can study. There are more than eight million reviews of more than 200,000 establishments just waiting for you or your AI to parse them. They are a good source for training data for natural language processing and machine learning.
Open Data Kit
Not all data reside in easily accessible databases with APIs. An enormous volume of information is embedded in web pages and the data needs to be pried out of them with some clever tools. This so-called web scraping is still a pretty good method, but it can have legal limitations. Some sites ban it in their terms of service and others watch for too many requests from one user and then either cut off the user or slow down the responses.
Tools like Puppeteer make it simpler to spin up one (or many!) headless versions of a web browser, download a web page, extract the right data, and do it again and again. There are now headless versions for most major browsers, thanks to the software testing community that needs to automate the testing process. Web scraping may not always be appropriate, but when it is it can be the fastest way to get the data you need. Nothing is more open than the open web.
This article has been published from the source link without modifications to the text. Only the headline has been changed.