next up previous
Next: Monks' problems Up: Experimental results Previous: Thyroid databases

Pima Indians Diabetes Database

The Pima Indian diabetes database, donated by Vincent Sigillito, is a collection of medical diagnostic reports of 768 examples from a population living near Phoenix, Arizona, USA. The paper dealing with this data base [16] uses an adaptive learning routine that generates and executes digital analogs of perceptron-like devices, called ADAP. They used 576 training instances and obtained a classification of 76% on the remaining 192 instances. The samples consist of examples with 8 attribute values and one of the two possible outcomes, namely whether the patient is tested positive for diabetes (indicated by output one) or not (indicated by two). The database now available in the repository has 512 examples in the training set and 256 examples in the test set. The attribute vectors of these examples are:

Attribute Type
Number of times pregnant continuous
Plasma glucose concentration continuous
Diastolic blood pressure (mm Hg) continuous
Triceps skin fold thickness (mm) continuous
2-Hour serum insulin (mu U/ml) continuous
Body mass index [weight in kg/(height in m)\( ^{2} \)] continuous
Diabetes pedigree function continuous
Age (years) continuous

We use this dataset to illustrate the effect of the topology (in terms of the number of bins per attribute) on the generalization ability of the proposed network. With a twelve fold cross-validation and special pre-processing, the test result reported with the dataset is 77.7% using the LogDisc algorithm. Table II summarizes the generalization obtained on our network for the same dataset without any pre-processing. The first column indicates the number of bins used for each attribute and is followed by the classification success percentage for the training and test sets.

Sl. No. No. of bins for each attribute Training data Test data
1 5-5-5-5-14-5- 5-5 82.42 % 72.66 %
2 5-5-5-5-5-30-5-5 82.42 % 72.66 %
3 5-5-5-5-14-30-5-5 84.57 % 75.00 %
4 8-5-5-5-14-30-5-5 83.79 % 76.17 %
5 8-5-5-5-14-30-5-6 84.77 % 76.95 %

As it can be seen, the optimal topology is (8-5-5-5-14-30-5-6) giving a classification accuracy of 76.95%. It may also be noted that the process of optimization of bin number obeys additive property. Thus when attributes five and six uses 14 and 30 bins each, the resulting accuracy is 75 % which is about 2.5 % above that produced by them individually. This means that the optimization of the topology of the network may be automated in parallel on a virtual machine to make the best possible network for a task. Since naive Bayesian networks also support parallel computation of attribute values, this network is well suited for parallel architecture producing high throughput. Future research shall explore the possibility of implementing the network on a Parallel Virtual Machine (PVM).

next up previous
Next: Monks' problems Up: Experimental results Previous: Thyroid databases
Ninan Sajeeth Philip 2007-05-28