AzureMachineLearningで使えるデータを見てみた。

Azure Machine Learning では、
サンプルで使えるデータセット(Saved Datasets)が公開されています。

お試しで使えるので、どんなものがあるかを整理してみました。
(2014/11/20時点)

データ数や項目数がデータによって大きな差があることが、今回の整理で分かりました。
これらを使って、近々遊んでみようと思います。


以下のフォーマットで並べてます。

ID Dataset<データセット名> Row<データ数> Columns<項目数> item1 item2...
1 Adult Census Income Binary Classification Dataset 32561 15 age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country income
2 Airport Codes Dataset 365 4 airport_id city state name
3 Automobile price data(Raw) 205 26 symboling normalized-losses make fuel-type aspiration num-of-doors body-style drive-wheels engine-location wheel-base length width height curb-weight engine-type num-of-cylinders engine-size fuel-system bore stroke compression-ratio horsepower peak-rpm city-mpg highway-mpg price
4 Bike Rental UCI dataset 17379 17 instant dteday season yr mnth hr holiday weekday workingday weathersit temp atemp hum windspeed casual register cnt
5 Bill Gates RGB Image 25600 5 X Y R G B
6 Blood donation data 748 5 Recency Frequency Monetary Time Class
7 Book Reviews from Amazon 10000 2 Col1 Col2
8 Breast Cancer data 683 10 Class age menopause tumor-size inv-nodes node-caps deg-malig breast breast-quad irradiat
9 Breast Canser Features 102294 118 Col1
10 Breast Cancer Info 102294 12 Col1
11 CRM Appetency Labels Shared 50000 1 Col1
12 CRM Churn Labels Shared 50000 1 Col1
13 CRM Dataset Shared 50000 230 Var1
14 CRM Upselling Labels Shared 50000 1 Col1
15 Energy Efficiency Regression data 768 10 Relative Compactness Surface Area Wall Area Roof Area Overall Height Orientation Glazing Area Glazing Area Distribution Heating Load Cooling Load
16 Flight Delays Data 2719418 14 Year Month DayofMonth DayOfWeek Carrier OriginAirportID DestAirportID CRSDepTime DepDelay DepDel15 CRSArrTime ArrDelay ArrDel15 Cancelled
17 Flight on-time performance(Raw) 504397 18 Year Quarter Month DayofMonth DayOfWeek Carrier OriginAirportID DestAirportID CRSDepTime DepTimeBlk DepDelay DepDel15 CRSArrTime ArrTimeBlk ArrDelay ArrDel15 Cancelled Diverted
18 Forest fires data 517 13 X Y month day FFMC DMC DC ISI temp RH wind rain area
19 German Credit Card UCI dataset 1000 21 Col1
20 IMDB Movie Titles 16614 2 MovieID MovieName
21 Iris Two Class Data 100 5 Class sepal-length sepal-width petal-length petal-width
22 Movie Ratings 227472 4 UserID MovieID Rating Timestamp
23 Movie Tweets 170285 8 Scraping Time Tweet ID User ID Movie ID Rating Retweet Count Favorite Count Time Zone
24 MPG data for various automobiles 392 9 MPG Cyl Displacement Horsepower Weight Acceleration Year CountryCode Model
25 Named Entity Recognition Sample Articles 2 1 Col1
26 Pima Indians Diabetes Binary Classification dataset 768 9 Number of times pregnant Plasma glucose concentration a 2 hours in an oral glucose tolerance test Diastolic blood pressure(mmHg) Triceps skin fold thickness(mm) 2-Hour serum insulin(muU/ml) Body mass index(weight in kg/(height in m)^2) Diabetes pedigree function Age(years) Class variable(0 or 1)
27 Restaurant customer data 138 19 userID latitude longitude smoker drink_level dress_preference ambience transport marital_status hijos birth_year interest personality religion activity color weight budget height
28 Restaurant feature data 130 21 placeID latitude longitude the_geom_meter name address city state country fax zip alcohol smoking_area dress_code accessibility price url Rambience franchise area other_services
29 Restaurant ratings 1161 3 userID placeID rating
30 Sample Named Entity Recognition Articles - - cannot visualize
31 Steel Annealing multi-class dataset 798 39 family product-type steel carbon hardness temper_rolling condition formability strength non-ageing surface-finish surface-quality enamelability bc bf bt bw/me bl m chrom phos cbond marvi exptl ferro corr blue/bright/varn/clean lustre jurofm s p shape thick width len oil bore packing classes
32 Telescope data 19020 11 fLength fWidth fSize fConc fConcl fAsym fM3Long fM3Trans fAlpha fDist Class
33 Time series Dataset 126 2 time N1725
34 Weather Dataset 406516 26 AirportID Year Month Day Time TimeZone SkyCondition Visibility WeatherType DryBulbFarenheit DryBulbCelsius WetBulbFarenheit WetBulbCelsius DewPointFarenheit DewPointCelsius RelativeHumidity WindSpeed WindDirection ValueForWindCharacter StationPressure PressureTendency PressureChange SeaLevelPressure RecodeType HourlyPrecip Altimeter
35 Wikipedia SP 500 Dataset 466 3 Title Category Text


表のうまい書き方が分からない…