30/3/2019 Introduction to WEKA. The Waikato Environment for Knowledge Analysis
The hype for all this AI stuff is getting severely overblown. But that being said predictive algorithms will be an important tool for scientists in any field in the years to come.
Astronomy is no different.
Neural Networks have been used to classify pulsar candidates in this paper:
- V. Morello, E. D. Barr, M. Bailes, C. M. Flynn, E. F. Keane: “SPINN: a straightforward machine learning solution to the pulsar candidate selection problem”, 2014; arXiv:1406.3627. DOI: 10.1093/mnras/stu1188.
Astronomy datasets are notorious for putting the BIG in BIG Data. Not only are the objects being studied are separated from Earth by large distances but also, the true nature of the signals that they produce are magnificently intense.
The paper states that it as the data mounts up in the future, it will no longer be feasible to process it using a typical 'human cogitator' i.e. Grad Students approach to vet the data. That's where the Data Mining and Machine Learning will comes in.
Twinkle Twinkle little Pulsar, But is that what you really are? |
So i guess it is important for yours truly to get to grips with it. Thats where WEKA comes in.
WEKA is an an environment that can be used to analyze datasets using prebuilt data mining algorithms.
WEKA is an an environment that can be used to analyze datasets using prebuilt data mining algorithms.
We can test the same data set with different data mining algorithms to take a measure of their performance. This is done by finding their percentage of correctly classified and their time taken to build model.
The following is a description of the Classifier/ Data Mining Algorithm/ Learner that we use to analyze the attributes/feature and put the data into class.
Data mining’s eclectic nature fostered this inconsistency in naming—the field encompasses contributions from statistics, artificial intelligence (Machine Learning),and database management;each field has chosen different names for the same concept.
Data mining’s eclectic nature fostered this inconsistency in naming—the field encompasses contributions from statistics, artificial intelligence (Machine Learning),and database management;each field has chosen different names for the same concept.
One-R (One Rule
Naive Bayes (probabilistic Classifier with strong independence assumption)
IBK (K Nearest Neighbour)
J48 (unpruned C4.5 decision tree)
Random Forest (Random Decision Forests)
MLP (Multi Layer Perceptron)
SMO (Support Vector Machines)
Data-set
|
Classifier
|
Percent
correct (%)
|
Time
to Construct (s)
|
Weather-nominal
|
One-R
|
42.8571
|
0.00
|
Naïve
Bayes
|
57.1429
|
0.00
|
|
IBK
|
57.1429
|
0.00
|
|
J48
| 50.000 |
0.00
|
|
Random
Forest
|
71.4286
|
0.02
|
|
MLP
|
71.4386
|
0.03
|
|
SMO
|
64.2857
|
0.01
|
|
Iris
|
One-R
|
92.0000
|
0.01
|
Naïve
Bayes
|
96.0000
|
0.01
|
|
IBK
|
95.3333
|
0.00
|
|
J48
|
96.0000
|
00.0
|
|
Random
Forest
|
95.3333
|
0.06
|
|
MLP
|
97.3333
|
0.14
|
|
SMO
|
96.0000
|
0.04
|
|
Pima Diabetes
|
One-R
|
65.1042
|
0.00
|
Naïve
Bayes
|
76.3021
|
0.005
|
|
IBK
|
70.1823
|
0.00
|
|
J48
(with missing values)
|
73.8281
|
0.02
|
|
J48 (Remove Corrupt Instances) J48 (Padded Corrupt Instances) | 74.6228 74.0885 |
0.01
0.01 |
|
Random
Forest
|
75.7813
|
0.18
|
|
MLP
|
75.3906
|
0.40
|
|
SMO
|
77.3438
|
0.03
|
|
Soybean
|
One-R
|
39.9707
|
0.01
|
Naïve
Bayes
|
92.9722
|
0.01
|
|
IBK
|
91.215
|
0.00
|
|
J48
|
91.5081
|
0.04
|
|
Random
Forest
|
92.9722
|
0.19
|
|
MLP
|
93.4114
|
20.00
|
|
SMO
|
93.8507
|
0.49
|
Repeating the model building resulted in faster build time but no changes in accuracy.
Support Vector Machine can be not bad