Data Mining
CHAPTER 01:
INTRODUCTION
Hepatitis caused by
the hepatitis C virus (HCV) has become one of the main problems associated with
emerging infections. HCV has a high genetic variability, with six main
types and an increasing number of viral subtypes associated with a different response
to treatment and the natural evolution of the disease. The
discovery of HCV allowed the development of diagnostic tests based on
antibodies directed against recombinant viral peptides. These tests They use immunoenzymatic techniques and are widely used
to detect and sometimes quantify specific antibodies in body fluids. The window
period between the infection with HCV and the detection of specific antibodies
It varies from patient to patient. With the current trials, seroconversion
occurs on average seven to eight weeks before the infection manifests itself.
Anti-HCV antibodies can persist throughout life, decrease or gradually
disappear after many years. These antibodies persist indefinitely in patients
who develop chronic infection. Apparent Seroconversion
and/or seroconversion may occur in
immunosuppressed patients, in whom the presence of the infection is confirmed
for the persistence of HCV RNA (Yeh, et.al, 2011
pp.447-448). The main Tests include
enzyme immunoassay and immunoblot. Enzyme immunoassay (serological test
diagnosis): Most of the tests Laboratory uses enzyme-linked immunosorbent
assays (EIA) (commonly referred to as ELISA). It is a cheap test with relatively good sensitivity and specificity.
The screening for HCV infection typically begins with the measurement of
antibodies directed against viral proteins by serum antibodies or plasma or
viral antigens, which they are collected on wells of microtitre plates, using
its corresponding antigen or specific antibody (usually monoclonal), respectively.
Antigen-antibody complexes are revealed in an enzymatic reaction colorimetric
that is, the result is interpreted by comparing the readings of Absorbance with
a defined cut-off value. Due that this absorbance provides a result
quantitative, the test is usually reported as positive or negative. Recently,
some researchers have advocated dividing the optical density into three levels: high positive, positive low
and negative. High positive samples they can be clearly designated as positive
unlikely to appear false positive Samples reported as positive under require
the performance of confirmatory tests. However, there is a lapse in which no
positive test is detected, even though the individual is infected; which
happens when the infection has been recently acquired. To this period is known
as the "window period" (Deng, et.al,
2003 pp.959-960).
The increase in the volume and variety
of information found stored in large digital databases and other sources have grown enormously in the last decades. Much
of this information is historical, that is, it represents transactions or
situations that it has been produced. Apart from its memory function of the
organization, Historical information is used
to explain the past, understand the present and predict future information.
Most decisions of companies, organizations, and
institutions are largely based on measure, in information about past
experiences drawn from sources very diverse, it being necessary to analyze the
data for obtaining useful information. In many situations, the traditional
method of converting data into knowledge consists of an analysis and
interpretation made of a manual way. The
specialist in the field analyzes the data and elaborates a report or hypothesis
that reflects the trends or patterns of the same. For example, a group of
doctors can analyze the evolution of infectious-contagious diseases among the
population to determine the most frequent age range of the people affected.
This knowledge can be used by the competent health authority to establish
vaccination policies. This way of acting is slow, expensive and highly
subjective. In fact, the manual analysis is impracticable in domains where the
volume of Data grows exponentially (Chang,
et.al, 2011 pp.5512-5513). Consequently, many decisions important things
are done, not on the basis of the available data but following the user's own
intuition as they do not have the tools necessary. There are analytical tools
that have been used to analyze the data and have their origin in statistics (Yasin, Jilani and Danish, 2011 pp.5-6). While
these tools are able to infer patterns from the data, the problem is that is somewhat cryptic for non-statisticians and
generally not work well for a large
amount of existing information volume nowadays, in addition, they do not
integrate well with the information systems. In many contexts, such as
business, medicine or science what is interesting is the knowledge that can be
inferred from of the data and, moreover, the ability to use this knowledge. For
example, it is possible to know statistically that 10% of the elderly they have
Alzheimer's. This can be useful, but surely it is much more useful to have a
set of rules than from the background, the habits and other characteristics of
the individual tell us if a patient will have Alzheimer's or not. The problems
and limitations of the lack of tools have made arise the need for a new
generation of techniques to support the extraction of useful knowledge from the
information available, and which are included under the name of data mining.
Mining of data differs from other techniques because it does not obtain
information extensional (data) but intentional (knowledge) and also knowledge it is not a parameterization of any
pre-established model or intuited by the user, but it is a novel and original
model completely extracted by the tool (Acuna
and Rodriguez, 2004 pp. 639-647).
Infection with the hepatitis C virus (HCV)
continues being a frequent cause of chronic liver disease in individuals with
chronic kidney disease (CKD) who receive long-term renal replacement therapy
(RRT). In United States, seroprevalence of anti-HCV antibodies is approximately
5 times higher among patients in treatment with chronic hemodialysis that
between general population (7.8% vs. 1.6%). Although the HCV infection is an
established cause of glomerulonephritis (usually mediated by cryoglobulins),
rarely causes CKD that requires replacement therapy kidney or kidney transplant.
In this population, often the HCV infection is acquired during dialysis. He
risk of infection varies according to the type of TSR, and the prevalence of
hepatitis C occurs in treated patients with maintenance hemodialysis (Gorunescu,
2011).
Data mining can be applied to any type
of information, being the mining techniques different for each of them exist
many types of data (integers, reals, dates, text strings, etc.) and from the
point of view of data mining techniques more usual only interest to distinguish
between two types: numerical (integers or real) and categorical or discrete
(they take values in a finite set of categories). Even considering only these
two types of data, It should be clarified that not all techniques are capable
of working with both types. This data is contained in what is known as a
database, which may be of different nature depending on the type of information
that they store. Some types of databases are: Relational databases are the most
used today in day as a source for data mining techniques. One base relational
data is a collection of relationships (tables) where each table consists of a
set of attributes and can have a large number of tuples, records or rows. Each
tuple represents an object, which is described through the values of its
attributes, and in general, it is characterized by having a unique key that
identifies it univocally from the rest. One of The main characteristics of
relational databases is the existence of an associated schema, that is, the
data they must follow a structure and are, therefore, structured. Through a
query (for example in SQL) we can combine in a single table the information of
several tables that we require for each specific data mining task. The spatial
databases contain related information with physical space in a broad sense (a
city, a mountainous region, a cerebral atlas, etc.). These bases of data
include geographic data, medical images, networks of transportation or traffic
information, etc. where relationships spatial are very relevant. The mining of
data on these databases allows you to find patterns between the data, such as
for example the characteristics of the houses in an area mountainous, the
planning of new public transport lines depending on the distance of the
different areas to the existing lines, etc (Li
and Zhou, 2007 pp.1088-1098).
Given the high prevalence of HCV infection in the HIV
population, systematically carry out the determination of anti-HCV antibodies
at all patients infected with HIV. The detection of anti-HCV will be carried
out with the techniques usual. In general terms, the confirmation of a positive
result through immunoblot techniques (RIBA, LIA) is only necessary in patients
from populations with low risk for hepatitis C.
Therefore, it will not be necessary to
carry out tests of confirmation in the majority of HIV-infected patients
presenting a positive anti-HCV serology, since they probably belong to a
high-risk group of infection by HCV (ADVP, hemophiliacs, etc.) and we can
assume the certainty of exposure to HCV. In patients in whom we suspect they
are in the window period of HCV infection, or in cases of seronegative
hepatitis with high probability of infection by HCV, it is indicated to carry
out tests of direct detection of the virus. To do this, the same techniques
indicated in patients not coinfected with HIV, such as qualitative PCR (Kurosaki, et.al,
2012 pp.607-608). The identification of genotypes and the
quantification of HCV viremia does not differ in as regards techniques and
indications of those performed in non-coinfected patients, limiting
fundamentally to epidemiological studies or, in individual cases, at the time
of or monitor the anti-HCV treatment. The degree of histological lesion and,
specifically, the extent of the fibrosis is consider factors of prediction of
response to treatment. In the coinfected patient because of HIV there is a
greater degree of injury and fibrosis that can condition the response
therapeutic and, therefore, modify the indication of treatment. In addition, in
the patient co-infected, other circumstances that justify the liver injury may
coexist (drugs, alcohol, opportunistic infections, etc.). For all this, the
biopsy was performed Hepatic pretreatment seems even more justified than in the
non-coinfected patient. It can also be useful for making decisions in patients
with anti-HCV treatment when it is poorly tolerated. In general, it is only
contraindicated in the event that the patient has a coagulopathy, or when there
is obvious evidence of disease Liver terminal. In patients with normal
transaminases, although occasionally histological data of chronic hepatopathy
may exist, the prognosis of these patients is In general, it is good that, in principle,
a liver biopsy is not recommended (Yeh, et.al,
2011 pp.447-448).
Background
It is estimated that 3% of the world
population is chronically infected with HCV and are reported every year more
than one million new cases of infection. In The United States (USA) 1.8% of the
population is positive for antibodies against HCV. Of every four seropositive
people, three have viremia, meaning that the active HCV infection is present in
2.7 million people (4) In addition, it is estimated that 30,000 infections occur
acute annuals in this country. Studies based on North American populations
indicate that 40% of the Chronic liver disease is related to HCV; causing
between 8,000 to 10,000 deaths annually and, without a effective intervention,
deaths could triple in the next ten to 20 years. Liver disease in terminal
state (cirrhosis and liver cancer) associated with infection with HCV is the
most frequent indication for liver transplantation among adults in the US and
in the Western Europe. In Latin America there are few studies on seroprevalence
of HCV in the general population. Some of the frequencies found indicate the
following: Venezuela in 1993, in 200 people, a seroprevalence for antibodies
against HCV (anti-HCV) of 1.5%; in Brazil, in 1994 (n = 460) and 1995 (n =
800), a seroprevalence for anti-HCV of 1.4 and 1.2%, respectively; in Mexico in
1991, in 450 healthy children, a seroprevalence for anti-HCV of 0.9% (6) (Deng, et.al, 2003 pp.959-960).
Chronic infection with hepatitis C
virus (HCV) is a health problem at scale worldwide that affects more than 170
million people, which represents a prevalence of the order of 2.5% of the
world's population. More than 53,700 deaths per year can be directly
attributable to HCV, although WHO estimates that more than 308,000 deaths per
year are probably due to liver cancer caused by HCV, together with a
significant proportion of 785,000 deaths due to cirrhosis. Taken together,
these data suggest that HCV is responsible for approximately one million deaths
per year21.
HCV seroprevalence data in world
population show a wide variation, as well Central and Eastern Asia plus North
Africa and the Middle East are estimated to have a high prevalence (> 3.5%),
South and Southeast Asia, Sub-Saharan Africa, Andean Zone, Center and South of
Latin America, The Caribbean, Oceania, Australasia (Australia, New Zealand, New
Guinea, and the neighboring islands of the Pacific), Western, Central and
Eastern Europe have a prevalence moderate (1.5% - 3.5%), while the Asia
Pacific, Latin America Tropical, and America of the North have a low prevalence
(<1.5%) 22 (Chang, et.al, 2011 pp.5512-5513)
. In the European Union, the prevalence of people infected with hepatitis C
varies between different countries that confirm
it, thus the highest prevalence, greater than 2%, occurs in the south of Europe
(Italy, Romania and Spain) 23. In UK,
according to Bruguera, in the review made in 2006, the prevalence of anti-HCV
positive people in the general population is between 1 and 2.6% what It would
mean a number of infected people between 480,000 and 760,000. There are wide
differences geographical areas, concentrating the greatest impact on the most
urbanized communities (between 2.5 and 2.6% in Madrid and Catalonia
respectively) and lower in the less urbanized ones (1.6% in Asturias). The
distribution by age is also heterogeneous showing a curve with two peaks,
indicative of different epidemiological patterns, which depend on the
transmission mechanism most prevalent in each group (Acuna and Rodriguez, 2004 pp. 639-647). The peaks correspond to
the age group between 30 and 45 years, whose infection is attributable to
parenteral drug use, and that of the elderly 65 years, attributable to the
receipt of transfusions before 1990 or to the clinical use of syringes not
sterilized before 1975, when the single-use material was introduced in UK.
The differences by sex are more notable
in the ages between 25 and 45 years, where The prevalence in men is higher,
perhaps because intravenous drug abuse is more frequent in them. It is
estimated that the influence of immigration on the prevalence of hepatitis C in
UK it is potentially high, and depends on the origin of the immigrant
population. The studies carried out on small samples agree with international
standards known, so that Asians (between 11 and 15%) and sub-Saharan (between 8
and 17%) record the highest figures, while those of North Africans are similar
to the autochthonous (1.9%) and those of Latin Americans, lower (0.4%) 24 . In
Galicia, seroprevalence data in blood donors according to the study carried out
by Eiras et al from May 1999 to June 2001, show that 1.35 ‰ donors or 1 ‰
donors according to serology or RNA detection techniques RT-PCR has antibodies
against hepatitis C. The incidence detected in that period was 2.87 per 100,000
persons. The incidence data of hepatitis
C present several limitations since most of Acute HCV infections pass
clinically unnoticed so they are not diagnosed, and In the absence of
indicators of recent infection, acute infections cannot be differentiated from Chronicles in a patient with positive
HCV antibodies (Yasin, Jilani and Danish, 2011
pp.5-6).
The data mining phase is the most
characteristic of the process of knowledge extraction and this phase is often
used to name the whole process. The objective of this phase is to produce new
knowledge that the user can use. This is done by building a model based on the
data collected for this purpose. He model
is a description of the patterns and relationships between the data that can be
used to make predictions, to better understand the data or to explain past
situations. For this it is necessary to take a series of decisions before
starting the process:
·
Determine what type of
mining task is most appropriate.
·
Choose the type of
model.
·
Choose the mining
algorithm that solves the task and obtain the type of model we are looking for the
construction of the model is where the iterative character is best seen of the
data mining process since it will be
necessary to explore alternative models until you find the one that is most
useful for solving the problem.
So, once a model is obtained and from
the results obtained for it, you can build another model using the same
technique but other parameters, or perhaps using other techniques or tools (Gorunescu, 2011).
Ethical considerations
Ethical considerations forms a major element in a
research. The researcher needs to adhere to promote the aims of the research imparting
authentic knowledge, truth and prevention of error. Furthermore, following
ethics enables scholars to deal collaborative approach towards their study with
the assistance of their peers, mentors and other contributors to the study.
This requires values alike accountability, trust, mutual respect and fairness
among all the parties involved in a study. This in turn depends on protection
of intellectual property rights of all the contributors, established through
implementation of ethical considerations. Other ethical considerations in a
research refers to accountability towards general public by protecting the
human or animal subjects used in the study. Similarly appropriate usage of
public funds and gaining of public support is also important.
1.2 Aims and Objectives
Neonates
born to hepatitis C virus (HCV)‐positive mothers are usually not
screened for HCV. Unscreened children may act as active sources for social HCV
transmission, and factors contributing for vertical HCV transmitting still
remained controversial and needed optimization.
We
aimed to investigate the factors contributing for vertical HCV transmission;
the highest HCV prevalence worldwide.
1.3 Research Questions
The objective of this paper is to
predict potentially new HCV-Human protein interactions using a Data
mining technique.
Methodology
This article shows an analysis of various data mining
techniques that may be useful for medical analysts or physicians to accurately
diagnose HCV infection. The main methodology for our work was the study of
publications, journals and reviews in the field of computer science and
engineering, research focused on newer publications. Data source A total of 463
records with 16 health attributes (factors) were obtained from the HCW database
for Egyptian Healthcare Workers (HCW) at the National Hepatitis B Center in
Egypt, where the highest worldwide prevalence of HCVError! The reference source
was not found; the records were divided into two data files: a training data
file (602 records) and a test data file (257 records). To avoid distortion, the
records for each set were randomly selected. The target classification model
was chosen in such a way that the Naïve Bayes algorithm only supports
categorical (discrete) attributes. Both decision tree algorithms and neural
networks support categorical and contiguous attributes. For consistency, only
categorical attributes are used for all three models. Hepatitis C virus
prediction analysis using various data mining techniques 412 All
health attributes have been transformed from numeric to categorical data. The
"HCV_PCR" attribute was identified as a predictable attribute of
"1" in HCV infected patients and "0" in uninfected HCV
patients. The "PID" attribute was used as a key; the rest are input
attributes. Missing, inconsistent, and duplicate data are supposed to be resolved.
1. DATA PREPROCESSING
1.1 Dataset
This study performs experiments on a hepatitis data set. The
data set contains 155 instances distributed between two classes, which have 32
instances and veins with 123 instances. There are 20 attributes, including the
class attribute and 20 missing values. The main objective of the data set is to
predict the presence or absence of hepatitis virus. This data file was
retrieved from the UCI Learning Store.
1.2 Mining Models
Trained models were evaluated for accuracy and efficiency
test data, and the models were validated using the Lift Chart and
Classification Matrix as described in the following section. Model Efficiency
Verification The efficacy of the models was tested by two methods: the elevator
graph and the classification matrix. The goal was to find out which model gave
the highest percentage of correct predictions for diagnosing HCV infected
patients to determine if there is enough information to retrieve formulas in
response to the Predictable attribute, the columns in the trained model were
mapped to the columns in the test data file. Also, the model, predictable
column for "HCV_PCR" and column state to predict HCV infected
patients (predictions of value = 1) were also selected(kurosaki, et al- 2011,
pp-401-409). The X axis shows the percentage of the test dataset used to
compare the predictions, while the Y axis shows the percentage of predicted
values for the specified state. The blue and red lines show results for a
random estimate and ideal.
1.3 Extraction Hepatitis dataset
The hepatitis data set contains data on the screening of
patients with hepatitis. At the beginning, a set of data was pre-processed to
make the mine process more efficient. In our paper, we used the Neural
Connection and Weka tools to compare the performance accuracy of data mining
algorithms for the diagnosis of a hepatitis disease data file. The
pre-processed data set is then used to remove missing values to improve the
classification performance. The selection in the tool describes the status of
the data attributes present in hepatitis.
Attribute information:
1. Class: DIE,
LIVE
2. AGE: 10, 20,
30, 40, 50, 60, 70, 80
3. SEX: male,
female
4. STEROID: no,
yes
5. ANTIVIRALS:
no, yes
6. FATIGUE: no,
yes
7. MALAISE: no,
yes
8. ANOREXIA: no,
yes
9. LIVER BIG: no,
yes
10. LIVER FIRM:
no, yes
11. SPLEEN
PALPABLE: no, yes
12. SPIDERS: no,
yes
13. ASCITES: no,
yes
14. VARICES: no,
yes
15. BILIRUBIN:
0.39, 0.80, 1.20, 2.00, 3.00, 4.00
-- see the
note below
16. ALK PHOSPHATE:
33, 80, 120, 160, 200, 250
17. SGOT: 13, 100,
200, 300, 400, 500,
18. ALBUMIN: 2.1,
3.0, 3.8, 4.5, 5.0, 6.0
19. PROTIME: 10,
20, 30, 40, 50, 60, 70, 80, 90
20. HISTOLOGY: no,
yes
The BILIRUBIN
attribute appears to be continuously-valued.
I checked
this with the donater, BojanCestnik, who replied:
About the
hepatitis database and BILIRUBIN problem I would like to say
the following: BILIRUBIN is continuous attribute (= the
number of it's
"values" in the ASDOHEPA.DAT file is negative!!!);
"values" are quoted
because when speaking about the continuous attribute there
is no such
thing as all possible values. However, they represent so
called
"boundary" values; according to these "boundary"
values the attribute
can be discretized. At the same time, because of the
continious
attribute, one can perform some other test since the
continuous
information is preserved. I hope that these lines have at
least roughly
answered your question.
8. Missing Attribute Values: (indicated by "?")
Attribute
Number: Number of Missing Values:
1: 0
2: 0
3: 0
4: 1
5: 0
6: 1
7: 1
8: 1
9: 10
11:
5
12:
5
13:
5
14:
5
15:
6
16:
29
17:
4
18:
16
19:
67
20:
0
The top red line shows the ideal model; captured 100% of the
target population in HCV infected patients using 8.53% of the test dataset. The
bottom blue line shows a random line that is always over 45 degrees in the
graph. It turns out that if we randomly estimate the result for each case, 50%
of the target population would be captured using a 50% test data set(Narayan,
et al, 2002, pp-5-13). All three model lines (violet, yellow, and green) fall
between the random estimate and the ideal lines, indicating that all three have
enough information to learn patterns in response to the predictable state,
except the purple line at a certain percentage that was moving around a random
puzzler, it means that the training data did not have enough information to
learn the target patterns. Lift graph without predictive value. The steps for
creating an elevator chart are similar to the one above, except that the
predictable column state is empty. It does not include a series of a random
estimate model. He tells how well each model preceded predicting the correct number
of predictable attributes shows the elevator output diagram. The X axis shows
the percentage of the test dataset used to compare predictions, while the Y
axis shows the percentage of predictions that are correct. The blue, purple,
green and red lines show the ideal models, Neural Network, Naïve Bayes and
Decision trees(Elrazek, et al, 2017, pp-529-533). The chart shows the
performance of the models in all possible states. The ideal line (blue) is at
an angle of 45 degrees, indicating that 50% of the test dataset is predicted
correctly if 50% of the test data set is processed.
When the entire population is processed, the decision tree
model is better than the other two, because it has the highest number of
correct predictions (93%) followed by Neural Network (88%) and Naïve Bayes
(85%). If less than 50% of the population is processed, Lift lines for decision
tree and neural network will always be higher than for Naïve Bayes. This shows
that neural networks and decision trees are better at creating a high
percentage of correct predictions than Naïve Bayes. Along the X axis, the Lift
lines for Neural Network and Naïve Bayes overlap(Khairy, et al, 2013, pp-13).
This shows that both models are just as good for the right prediction. When
more than 50% of the population is processed, decision trees and neural
networks appear to be better because they provide a high percentage of correct
predictions than Naïve Bayes. This is because the lift line for naive bayes
will be below the neural network and decision trees. For a certain population
range, decision trees seem to be the best non-neural network and the neural
network appears to be better than Naives Bayes and vice versa. Matrix
Classification: The Matrix Classification displays the frequency of correct and
incorrect predictions. Compares the actual values in the predictive test file
with the trained model and in this example, a test set of 19 HCV-infected
patients and 238 uninfected HCV patients were included. Table 1 shows the
results of the classification matrix for all three models. This represents the
number of cases where each algorithm can predict them correctly and its
performance, where ("1 real" for HCV infected patients,
"real" in uninfected HCV patients)(Zayed, et al, 2013, pp-254-261).
The dominant trees appear to be most effective because they have the highest
percentage of correct predictions (84.21%) in HCV infected patients, followed
by neural networks (57.89%) and Naïve Bayes (52.63%). Decisive trees also
appear to be most effective in predicting patients without heart disease
(93.68%) compared to the other two models
1.3Evaluation of Mining Goals
Four exploration goals based on the HCV data set survey and
the objectives of this research were defined. They were evaluated on the basis
of trained models. The results show that all three models have achieved the
goals, indicating that they can be used to provide decision support to physicians
for diagnosing patients and discovering medical factors associated with HCV
infection. Four objectives are listed and discussed below. Goal 1: Identify
Significant Implications and Links in Health Resources Related to Predictable
Status - HCV Infection: The Decision Trends and Naïve Bayes Dependency Viewer
shows the results of the most important and the smallest (weakest) medical
predictors. The browser is particularly useful if there are many predictable
attributes show that HCV_ELISA is the most important factor in HCV infection in
both models. Other important factors include AST and Schisto(El-serag, et al,
2014, pp-1249-1255). The decision tree model shows the weakest
"gender" factor, while the Naïve Bayes model is the weakest factor of
residence. Decision trees seem to be better than Naïve Bayes, because it gives
meaning to multiple input attributes. Physicians can use this information to
further analyze the strengths and weaknesses of the health properties
associated with HCV infection.
Goal 2: Identify the impact and relationship between health
attributes in relation to the predictable state - HCV infection: The impact and
relationship between health attributes in relation to HCV infection is only
found in the decision tree viewer. The highest probability (86.67%) is that HCV
infected patients are found in the relationship between the attributes (nodes):
"HCV ELISA = 1 and AST ne = 42 and Job = 11 and Schisto = 1" and
Neadlestick = "0 "With this information, physicians can perform a
medical examination of these five attributes instead of all attributes to
potential patients who are likely to be diagnosed with HCV infections, thereby
reducing health costs, administrative costs, and diagnostic time. , 60%) are
found in this attribute: "HCV ELISA ne =" 1 ". The relationship
between attributes in HCV-infected patients is also given(kurosaki, et al-
2011, pp-401-409). The results show that the attribute "HCV ELISA ne =
'1" has the highest impact (99.40%). The lowest score (13.33%) is found in
the attributes: "HCV ELISA = 1 and AST ne = 42 and Job = 11 and Schisto =
1 and Neadlestick = 0. Additional information such as patient identification
and health profiles based on selected nodes can also be obtained through the
drill function. Physicians can use the tree viewer's decision to perform
further analysis.
Antibodies to hepatitis C in the blood (HCV_ELISA = 1) and
97,02% are not infected with HBV (HBsAg_ELISA = 1)(Narayan, et al, 2002,
pp-5-13). Other important characteristics include: high probability in the
rural area of the residence (residence = 2), patients are male (gender = 1),
patients are not infected with schistosomiasis (Schisto = 0), etc. The
characteristics of patients with uninfected HCV with high probability
HBsAg_ELISA = 0, meaning they are not infected with HBV, have no hepatitis C
antibodies in their blood (HCV_ELISA = 0), etc. These results can be further
analyzed:
Goal 4: Identify attribute values that differentiate nodes
that favor and disadvantage predictable conditions: (1) HCV infected patients
(2) HCV-infected patients. This query can be answered by analyzing the results
of the Naïve Bayes and Neural Network attributes. The browser provides
information on the impact of all attribute values related to the predictable
state. The Naïve Bayes model shows the most important attribute value that
prefers HCV infected patients: "HCV_ELISA = 1". Other attributes that
support HCV infection include "AST = 31.4", "AST = 22.7",
etc. Attributes such as "HCV_ELISA = 0", "AST = 31",
"AST = 42" etc. we’re not infected with HCV. Neural network
model shows that the most important attribute value that prefers HCV-infected
patients is "AST = 34.7"(Elrazek, et al, 2017, pp-529-533). Other attributes
that favor HCV infection include ALT = 64.6, AST = 27.94, Age = 54, etc.
Attributes such as Age = 28, ALT = 47.14, etc. also favor predictable status
uninfected HCV.
1.5Comparison between Three Data Mining Techniques Performance
Using three sets of data of different sizes Finally, we want
to make sure that the size of the database affects the accuracy of the
classification technique used to predict HCV infection or not, that is, the
three classification techniques that are Naïve Bayes, decision trees and
Hepatitis C virus prediction analysis using various techniques data mining 219
and neural networks applied to three HCV databases of different sizes, then
determines the accuracy and efficiency of each technique using the elevator
graph(Khairy, et al, 2013, pp-13).
Decision trees are the most effective technique for
detecting HCV infection for a different size of data set, as it has the highest
percentage of correct predictions. Naive
Bayes accuracy increases as the data file size increases and the size of the
dataset affects the accuracy of the classification techniques used to detect
HCV infectionand the algorithm of the C4.5 algorithm was introduced by Quinlan
to invoke classification models, also referred to as decision trees, from the
observed data. In the monitored data file, each record contains the same data
structure. Data can have any number of attributes or pairs of values(Zayed, et
al, 2013, pp-254-261). One of these attributes represents the category of the
record. The problem is to build a decision tree based on observations of
non-category attributes that correctly predict the value of the category
attribute. The category attribute can have values like {true, false} or {predicted,
not predicted} or {success, failure} or something like this. In any case, one
of its values will be a failure. If there are an equally probable possible
messages, the p probability of each is 1 / n and the information transmitted by
the message is -log(p) = log(n). In general, if we are given a probability
distribution P = (p1, p2,.., pn) then the Information conveyed by this
distribution, also called the Entropy of P, is: I(P) = -(p1*log(p1) +
p2*log(p2) + .. + pn*log(pn)). If a set T of records are partitioned into
disjoint exhaustive classes C1, C2, .., Ck on the basis of the value of the
categorical attribute, then the information needed to identify the class of an
element of T is Info(T) = I(P), where P is the probability distribution of the
partition (C1, C2, .., Ck): P = (|C1|/|T|, |C2|/|T|,.., |Ck|/|T|) First T is
partitioned on the basis of the value of a non-categorical attribute X into
sets T1,T2,..,Tn then the information needed to identify the class of an
element of T becomes the weighted average of the information needed to identify
the class of an element of Ti, i.e. the weighted average of Info(Ti): Info(X,T)
= Sum for i from 1 to n of |𝑇𝑖|/|𝑇|
* Info(Ti) and the quantity Gain(X,T) defined as Gain(X,T) = Info(T) -
Info(X,T) This is the difference between the information needed to identify the
T-element and the information needed to identify the T-element after obtaining
an X attribute value, i.e., the gain of information as a result of the X
attribute(El-serag, et al, 2014, pp-1249-1255). Therefore, we can predict which
information offers more information gain than any other information. This gain
term is used to attribute attributes and to create decision trees where each
attribute is the largest gain among attributes that are not yet considered a
root path. The intent of this arrangement is twofold: (i) create small decision
trees to identify the records after a few questions. (ii) Align the expected
minimum process that represents the records that are considered. Therefore, the
C4.5 algorithm can be used to identify the most important attribute that is the
key risk factor associated with HCV infection among IDUs in India as mentioned
above, the observed drug user data set contains four hundred and forty records
of different values for nine different non-categorical attributes.
2.Research process
The process consisted of several phases: First Stage
Discussion the data set was divided into two parts: training (80%) and testing
(20%) to guarantee the accuracy of the experimental result and improve
credibility(kurosaki, et al- 2011, pp-401-409).
Preliminary processing in stage 2 prior to evaluation, the data in the training database must be
pre-processed. Pre-processing included data purification, i.e., Ensure that the data does not
contain missing values, noise (contain errors, deviations) and discrepancies
(discrepancies of units used). Several approaches are available for this
purpose(Narayan, et al, 2002, pp-5-13). In this study, missing values were
replaced by the median value, as this method was commonly used by many
researchers. After pre-treatment, a complete set of data was obtained and used
for experiments. Step 3 Classification At this stage, classification is used to
classify data into predefined category categories. The "class" in the
classification is an attribute or property in the data set in which users are
most interested. It is defined as the dependent variable in the statistics.
Classification algorithm creates a classification model consisting of
classification rules. In our study, the classification can be used to define
the diagnosis and prognosis of hepatitis based on symptoms and medical
conditions. At this stage, two steps were taking place, which consisted of
training and testing. The first step is the training you used to compile the
classification model by analyzing training data containing class labels(Elrazek,
et al, 2017, pp-529-533). The second step is testing. It examines the
classifier using test data for the accuracy in which the test data contains a
class label or its ability to classify unknown prediction objects. In this
paper we mainly deal with naive Bayes, Naive Bayes updatable, FT tree, KStar,
J48, LMT, neural network. Stage 4 Comparison of accuracy and statistical
results at this stage, we discussed and compared percentage accuracy and
statistical results among the algorithms(Khairy, et al, 2013, pp-13).
2.1 Other Method
The second technique used in this experiment is the rough
theory of sets, using Weka to analyze the hepatitis data set. The data was
trained and divided into two; training and testing. These data were then
discretized to group data that has continuous values in attributes. There are
only a few discretionary processes that are; Logical logic, entropy, Naive and
Semi naive. After the discretization process, generating rules are used, i.e.
Reduction. Reductions are techniques that exclude unused attributes and create
a minimal subset of attributes for the decision table and it is proven that the
Gross Set technique is the best technique used to analyze hepatitis data. This
gave the highest percentage of precision(El-serag, et al, 2014, pp-1249-1255).
The best classification algorithm used in the coarse set is Naïve Bayes, which
is based on Bayes' rule.
Ranking algorithm performance metrics with 10 cross
validation and number of functions selected by element selection methods and the
highest accuracy values were found with the Naive Bayes and the Eva Decision
Decisions with the Gain Attribute Eval attribute, the One-R Attribute Eval and
the Relief Attribute Eval. For example, the accuracy of Naïve Bayes and the
decision table with these element selection methods is 0.853, the highest value
in Table II. In addition, the accuracy of Naive Bayes on the consistency of
Subsevval is also 0.853. Likewise, Naïve Bayes with One-R Attribute Eval and
Relief Attribute Eval have the highest values to invoke.
Comments
Post a Comment