Azure Machine Learning では、
サンプルで使えるデータセット(Saved Datasets)が公開されています。
お試しで使えるので、どんなものがあるかを整理してみました。
(2014/11/20時点)
データ数や項目数がデータによって大きな差があることが、今回の整理で分かりました。
これらを使って、近々遊んでみようと思います。
以下のフォーマットで並べてます。
ID | Dataset<データセット名> | Row<データ数> | Columns<項目数> | item1 | item2... | |||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Adult Census Income Binary Classification Dataset | 32561 | 15 | age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | income | ||||||||||||||||||||||||
2 | Airport Codes Dataset | 365 | 4 | airport_id | city | state | name | |||||||||||||||||||||||||||||||||||
3 | Automobile price data(Raw) | 205 | 26 | symboling | normalized-losses | make | fuel-type | aspiration | num-of-doors | body-style | drive-wheels | engine-location | wheel-base | length | width | height | curb-weight | engine-type | num-of-cylinders | engine-size | fuel-system | bore | stroke | compression-ratio | horsepower | peak-rpm | city-mpg | highway-mpg | price | |||||||||||||
4 | Bike Rental UCI dataset | 17379 | 17 | instant | dteday | season | yr | mnth | hr | holiday | weekday | workingday | weathersit | temp | atemp | hum | windspeed | casual | register | cnt | ||||||||||||||||||||||
5 | Bill Gates RGB Image | 25600 | 5 | X | Y | R | G | B | ||||||||||||||||||||||||||||||||||
6 | Blood donation data | 748 | 5 | Recency | Frequency | Monetary | Time | Class | ||||||||||||||||||||||||||||||||||
7 | Book Reviews from Amazon | 10000 | 2 | Col1 | Col2 | |||||||||||||||||||||||||||||||||||||
8 | Breast Cancer data | 683 | 10 | Class | age | menopause | tumor-size | inv-nodes | node-caps | deg-malig | breast | breast-quad | irradiat | |||||||||||||||||||||||||||||
9 | Breast Canser Features | 102294 | 118 | Col1 | … | |||||||||||||||||||||||||||||||||||||
10 | Breast Cancer Info | 102294 | 12 | Col1 | … | |||||||||||||||||||||||||||||||||||||
11 | CRM Appetency Labels Shared | 50000 | 1 | Col1 | ||||||||||||||||||||||||||||||||||||||
12 | CRM Churn Labels Shared | 50000 | 1 | Col1 | ||||||||||||||||||||||||||||||||||||||
13 | CRM Dataset Shared | 50000 | 230 | Var1 | … | |||||||||||||||||||||||||||||||||||||
14 | CRM Upselling Labels Shared | 50000 | 1 | Col1 | ||||||||||||||||||||||||||||||||||||||
15 | Energy Efficiency Regression data | 768 | 10 | Relative Compactness | Surface Area | Wall Area | Roof Area | Overall Height | Orientation | Glazing Area | Glazing Area Distribution | Heating Load | Cooling Load | |||||||||||||||||||||||||||||
16 | Flight Delays Data | 2719418 | 14 | Year | Month | DayofMonth | DayOfWeek | Carrier | OriginAirportID | DestAirportID | CRSDepTime | DepDelay | DepDel15 | CRSArrTime | ArrDelay | ArrDel15 | Cancelled | |||||||||||||||||||||||||
17 | Flight on-time performance(Raw) | 504397 | 18 | Year | Quarter | Month | DayofMonth | DayOfWeek | Carrier | OriginAirportID | DestAirportID | CRSDepTime | DepTimeBlk | DepDelay | DepDel15 | CRSArrTime | ArrTimeBlk | ArrDelay | ArrDel15 | Cancelled | Diverted | |||||||||||||||||||||
18 | Forest fires data | 517 | 13 | X | Y | month | day | FFMC | DMC | DC | ISI | temp | RH | wind | rain | area | ||||||||||||||||||||||||||
19 | German Credit Card UCI dataset | 1000 | 21 | Col1 | … | |||||||||||||||||||||||||||||||||||||
20 | IMDB Movie Titles | 16614 | 2 | MovieID | MovieName | |||||||||||||||||||||||||||||||||||||
21 | Iris Two Class Data | 100 | 5 | Class | sepal-length | sepal-width | petal-length | petal-width | ||||||||||||||||||||||||||||||||||
22 | Movie Ratings | 227472 | 4 | UserID | MovieID | Rating | Timestamp | |||||||||||||||||||||||||||||||||||
23 | Movie Tweets | 170285 | 8 | Scraping Time | Tweet ID | User ID | Movie ID | Rating | Retweet Count | Favorite Count | Time Zone | |||||||||||||||||||||||||||||||
24 | MPG data for various automobiles | 392 | 9 | MPG | Cyl | Displacement | Horsepower | Weight | Acceleration | Year | CountryCode | Model | ||||||||||||||||||||||||||||||
25 | Named Entity Recognition Sample Articles | 2 | 1 | Col1 | ||||||||||||||||||||||||||||||||||||||
26 | Pima Indians Diabetes Binary Classification dataset | 768 | 9 | Number of times pregnant | Plasma glucose concentration a 2 hours in an oral glucose tolerance test | Diastolic blood pressure(mmHg) | Triceps skin fold thickness(mm) | 2-Hour serum insulin(muU/ml) | Body mass index(weight in kg/(height in m)^2) | Diabetes pedigree function | Age(years) | Class variable(0 or 1) | ||||||||||||||||||||||||||||||
27 | Restaurant customer data | 138 | 19 | userID | latitude | longitude | smoker | drink_level | dress_preference | ambience | transport | marital_status | hijos | birth_year | interest | personality | religion | activity | color | weight | budget | height | ||||||||||||||||||||
28 | Restaurant feature data | 130 | 21 | placeID | latitude | longitude | the_geom_meter | name | address | city | state | country | fax | zip | alcohol | smoking_area | dress_code | accessibility | price | url | Rambience | franchise | area | other_services | ||||||||||||||||||
29 | Restaurant ratings | 1161 | 3 | userID | placeID | rating | ||||||||||||||||||||||||||||||||||||
30 | Sample Named Entity Recognition Articles | - | - | cannot visualize | ||||||||||||||||||||||||||||||||||||||
31 | Steel Annealing multi-class dataset | 798 | 39 | family | product-type | steel | carbon | hardness | temper_rolling | condition | formability | strength | non-ageing | surface-finish | surface-quality | enamelability | bc | bf | bt | bw/me | bl | m | chrom | phos | cbond | marvi | exptl | ferro | corr | blue/bright/varn/clean | lustre | jurofm | s | p | shape | thick | width | len | oil | bore | packing | classes |
32 | Telescope data | 19020 | 11 | fLength | fWidth | fSize | fConc | fConcl | fAsym | fM3Long | fM3Trans | fAlpha | fDist | Class | ||||||||||||||||||||||||||||
33 | Time series Dataset | 126 | 2 | time | N1725 | |||||||||||||||||||||||||||||||||||||
34 | Weather Dataset | 406516 | 26 | AirportID | Year | Month | Day | Time | TimeZone | SkyCondition | Visibility | WeatherType | DryBulbFarenheit | DryBulbCelsius | WetBulbFarenheit | WetBulbCelsius | DewPointFarenheit | DewPointCelsius | RelativeHumidity | WindSpeed | WindDirection | ValueForWindCharacter | StationPressure | PressureTendency | PressureChange | SeaLevelPressure | RecodeType | HourlyPrecip | Altimeter | |||||||||||||
35 | Wikipedia SP 500 Dataset | 466 | 3 | Title | Category | Text |
表のうまい書き方が分からない…