Saturday, 30 March 2019

Knowledge Discovery with WEKA

30/3/2019 Introduction to WEKA. The Waikato Environment for Knowledge Analysis

The hype for all this AI stuff is getting severely overblown. But that being said predictive algorithms will be an important tool for scientists in any field in the years to come.

Astronomy is no different.

Neural Networks have been used to classify pulsar candidates in this paper:

  • V. Morello, E. D. Barr, M. Bailes, C. M. Flynn, E. F. Keane: “SPINN: a straightforward machine learning solution to the pulsar candidate selection problem”, 2014; arXiv:1406.3627. DOI: 10.1093/mnras/stu1188.

Astronomy datasets are notorious for putting the BIG in BIG Data. Not only are the objects being studied are separated from Earth by large distances but also, the true nature of the signals that they produce are magnificently intense.

The paper states that it as the data mounts up in the future, it will no longer be feasible to process it using a typical 'human cogitator' i.e. Grad Students approach to vet the data. That's where the Data Mining and Machine Learning will comes in.

Twinkle Twinkle little Pulsar, But is that what you really are?


So i guess it is important for yours truly to get to grips with it. Thats where WEKA comes in.

WEKA is an an environment that can be used to analyze datasets using prebuilt data mining algorithms.

We can test the same data set with different data mining algorithms to take a measure of their performance. This is done by finding their percentage of correctly classified and their time taken to build model.

The following is a description of the Classifier/ Data Mining Algorithm/ Learner that we use to analyze the attributes/feature and put the data into class.

Data mining’s eclectic nature fostered this inconsistency in naming—the field encompasses contributions from statistics, artificial intelligence (Machine Learning),and database management;each field has chosen different names for the same concept. 

One-R (One Rule
Naive Bayes (probabilistic Classifier with strong independence assumption)
IBK (K Nearest Neighbour)
J48 (unpruned C4.5 decision tree)
Random Forest (Random Decision Forests)
MLP (Multi Layer Perceptron)
SMO (Support Vector Machines)

Data-set
Classifier
Percent correct (%)
Time to Construct (s)
Weather-nominal
One-R
42.8571
0.00

Naïve Bayes
57.1429
0.00

IBK
57.1429
0.00

J48
50.000
0.00

Random Forest
71.4286
0.02

MLP
71.4386
0.03

SMO
64.2857
0.01




Iris
One-R
92.0000
0.01

Naïve Bayes
96.0000
0.01

IBK
95.3333
0.00

J48
96.0000
00.0

Random Forest
95.3333
0.06

MLP
97.3333
0.14

SMO
96.0000
0.04




Pima Diabetes
One-R
65.1042
0.00

Naïve Bayes
76.3021
0.005

IBK
70.1823
0.00

J48 (with missing values)
73.8281
0.02
J48 (Remove Corrupt Instances)

J48 (Padded Corrupt Instances)
74.6228
74.0885
0.01


0.01

Random Forest
75.7813
0.18

MLP
75.3906
0.40

SMO
77.3438
0.03




Soybean
One-R
39.9707
0.01

Naïve Bayes
92.9722
0.01

IBK
91.215
0.00

J48
91.5081
0.04

Random Forest
92.9722
0.19

MLP
93.4114
20.00

SMO
93.8507
0.49

Repeating the model building resulted in faster build time but no changes in accuracy.

Support Vector Machine can be not bad

No comments:

Post a Comment

Diaries of an Aspiring Astrophysicist (DAS Astro) Podcast

Diaries of an Aspiring Astrophysicist Episode 1: The last year has been weird Episode 2: Cosmic Collisions and Gravitational Wa...