Brief Overall Description of the Dataset:
Yelp is a multi-national corporation headquartered in San Francisco, California. It publishes crowd-sourced reviews about local businesses, as well as online reservation service and food delivery services. Yelp had an average of approximately 142 million monthly unique visitors in Q1 2015 and Yelpers have written over 77 million local reviews.
In addition to an API for developers, yelp provides data in two formats: the Yelp academic dataset and dataset challenge. The academic database contains information on Business (type, id, name, location, stars, review count, category, open, URL), Review (type, business id, user id, stars, text, date, votes) and User (type, id, name, review cout, average stars, votes This data is is currently available for the 250 closest businesses for 30 universities for students and academics to explore and research. The challenge dataset contains 1.6M reviews and 500K tips by 366K users for 61K businesses along with rich attributes data (such as hours of operation, ambience, parking availability) for these businesses, social network information about the users, as well as aggregated check-ins over time for all these users --- for 10 select cities internationally.
Link: http://www.yelp.com
Date Inventory Completed: 6/9/205
Screening
- Is the data collected opinion-based?
- Is the data collection recurring (must be collected at least annually)?
- Is there data available for 2013?
- Is the data collected at the property or housing unit level?
- Can we access the data by August 15th?
Purpose
What is the purpose of the organization collecting the data?
“Yelp was founded in 2004 to help people find great local businesses like dentists, hair stylists and mechanics.”
Why is it collected and how does the organization use it?
Data is collected by yelp to provide insight on local businesses and amenities.
Who else uses the data?
Businesses, citizens, researchers
Who do they sell the data to?
No one, datasets are available to researchers. Business have free access but limited ability to change data.
Method
What is the data collection method?
Online reviews
What is the type of data collected?
Digital
If designed, who created the questions?
What is the raw source of the collected data (prior to any aggregation)?
Location, reviews
Description
What is the general topic of the data (1-2 words)?
Amenities and reviews
What are the earliest and latest dates for which data is available?
Not stated
Is data collected and available periodically?
Yes, continuously
How soon after a reference period ends can a data source be prepared and provided?
Instantaneously
Selectivity
What is the universe (e.g., population) that the data represents ?
Amenities (stores, restaurants, businesses) in the US and select international cities --- as long as individual (owner or consumer) made a page for that site.
Accessibility
How is the data accessed?
API, data download for academic and challenge datasets (Each file is composed of a single object type, one json-object per-line.)
Is it open data?
Partially
- Any legal, regulatory, or administrative restrictions on accessing the data source?
Yelp's official API is restrictive and only returns snippets of the three most recent reviews for a business and prohibits the use of its API for data aggregation and analysis of returned reviews. https://www.yelp.com/developers/documentation/v2/business
Cannot be scraped.
For other two databases: one needs an active Yelp account, access to the Yelp API, and agree to the dataset access agreement to access the dataset.
Terms of Service: http://www.yelp.com/static?p=tos
Cost? - One time or annual or project based payment?
None
Does this dataset appear to meet our needs for the Census study? No
Explanation:
The areas of interest are not covered and the terms of service prohibits web scraping or using API for research.