ScienceEpic Blogs: Knowledge Discovery with WEKA

30/3/2019 Introduction to WEKA. The Waikato Environment for Knowledge Analysis

The hype for all this AI stuff is getting severely overblown. But that being said predictive algorithms will be an important tool for scientists in any field in the years to come.

Astronomy is no different.

Neural Networks have been used to classify pulsar candidates in this paper:

V. Morello, E. D. Barr, M. Bailes, C. M. Flynn, E. F. Keane: “SPINN: a straightforward machine learning solution to the pulsar candidate selection problem”, 2014; arXiv:1406.3627. DOI: 10.1093/mnras/stu1188.

Astronomy datasets are notorious for putting the BIG in BIG Data. Not only are the objects being studied are separated from Earth by large distances but also, the true nature of the signals that they produce are magnificently intense.

The paper states that it as the data mounts up in the future, it will no longer be feasible to process it using a typical 'human cogitator' i.e. Grad Students approach to vet the data. That's where the Data Mining and Machine Learning will comes in.

Twinkle Twinkle little Pulsar, But is that what you really are?

So i guess it is important for yours truly to get to grips with it. Thats where WEKA comes in.

WEKA is an an environment that can be used to analyze datasets using prebuilt data mining algorithms.

We can test the same data set with different data mining algorithms to take a measure of their performance. This is done by finding their percentage of correctly classified and their time taken to build model.

The following is a description of the Classifier/ Data Mining Algorithm/ Learner that we use to analyze the attributes/feature and put the data into class.

Data mining’s eclectic nature fostered this inconsistency in naming—the field encompasses contributions from statistics, artiﬁcial intelligence (Machine Learning),and database management;each ﬁeld has chosen different names for the same concept.

One-R (One Rule

Naive Bayes (probabilistic Classifier with strong independence assumption)

IBK (K Nearest Neighbour)

J48 (unpruned C4.5 decision tree)

Random Forest (Random Decision Forests)

MLP (Multi Layer Perceptron)

SMO (Support Vector Machines)

Data-set	Classifier	Percent correct (%)	Time to Construct (s)
Weather-nominal	One-R	42.8571	0.00
	Naïve Bayes	57.1429	0.00
	IBK	57.1429	0.00
	J48	50.000	0.00
	Random Forest	71.4286	0.02
	MLP	71.4386	0.03
	SMO	64.2857	0.01

Iris	One-R	92.0000	0.01
	Naïve Bayes	96.0000	0.01
	IBK	95.3333	0.00
	J48	96.0000	00.0
	Random Forest	95.3333	0.06
	MLP	97.3333	0.14
	SMO	96.0000	0.04

Pima Diabetes	One-R	65.1042	0.00
	Naïve Bayes	76.3021	0.005
	IBK	70.1823	0.00
	J48 (with missing values)	73.8281	0.02
	J48 (Remove Corrupt Instances) J48 (Padded Corrupt Instances)	74.6228 74.0885	0.01 0.01
	Random Forest	75.7813	0.18
	MLP	75.3906	0.40
	SMO	77.3438	0.03

Soybean	One-R	39.9707	0.01
	Naïve Bayes	92.9722	0.01
	IBK	91.215	0.00
	J48	91.5081	0.04
	Random Forest	92.9722	0.19
	MLP	93.4114	20.00
	SMO	93.8507	0.49

Repeating the model building resulted in faster build time but no changes in accuracy.

Support Vector Machine can be not bad

ScienceEpic Blogs

Saturday, 30 March 2019

Knowledge Discovery with WEKA

No comments:

Post a Comment

Diaries of an Aspiring Astrophysicist (DAS Astro) Podcast

Followers

Report Abuse