Data Mining

CHAPTER 01: INTRODUCTION


Hepatitis caused by the hepatitis C virus (HCV) has become one of the main problems associated with emerging infections. HCV has a high genetic variability, with six main types and an increasing number of viral subtypes associated with a different response to treatment and the natural evolution of the disease. The discovery of HCV allowed the development of diagnostic tests based on antibodies directed against recombinant viral peptides. These tests They use immunoenzymatic techniques and are widely used to detect and sometimes quantify specific antibodies in body fluids. The window period between the infection with HCV and the detection of specific antibodies It varies from patient to patient. With the current trials, seroconversion occurs on average seven to eight weeks before the infection manifests itself. Anti-HCV antibodies can persist throughout life, decrease or gradually disappear after many years. These antibodies persist indefinitely in patients who develop chronic infection. Apparent Seroconversion and/or seroconversion may occur in immunosuppressed patients, in whom the presence of the infection is confirmed for the persistence of HCV RNA (Yeh, et.al, 2011 pp.447-448). The main Tests include enzyme immunoassay and immunoblot. Enzyme immunoassay (serological test diagnosis): Most of the tests Laboratory uses enzyme-linked immunosorbent assays (EIA) (commonly referred to as ELISA). It is a cheap test with relatively good sensitivity and specificity. The screening for HCV infection typically begins with the measurement of antibodies directed against viral proteins by serum antibodies or plasma or viral antigens, which they are collected on wells of microtitre plates, using its corresponding antigen or specific antibody (usually monoclonal), respectively. Antigen-antibody complexes are revealed in an enzymatic reaction colorimetric that is, the result is interpreted by comparing the readings of Absorbance with a defined cut-off value. Due that this absorbance provides a result quantitative, the test is usually reported as positive or negative. Recently, some researchers have advocated dividing the optical density into three levels: high positive, positive low and negative. High positive samples they can be clearly designated as positive unlikely to appear false positive Samples reported as positive under require the performance of confirmatory tests. However, there is a lapse in which no positive test is detected, even though the individual is infected; which happens when the infection has been recently acquired. To this period is known as the "window period" (Deng, et.al, 2003 pp.959-960).
The increase in the volume and variety of information found stored in large digital databases and other sources have grown enormously in the last decades. Much of this information is historical, that is, it represents transactions or situations that it has been produced. Apart from its memory function of the organization, Historical information is used to explain the past, understand the present and predict future information. Most decisions of companies, organizations, and institutions are largely based on measure, in information about past experiences drawn from sources very diverse, it being necessary to analyze the data for obtaining useful information. In many situations, the traditional method of converting data into knowledge consists of an analysis and interpretation made of a manual way. The specialist in the field analyzes the data and elaborates a report or hypothesis that reflects the trends or patterns of the same. For example, a group of doctors can analyze the evolution of infectious-contagious diseases among the population to determine the most frequent age range of the people affected. This knowledge can be used by the competent health authority to establish vaccination policies. This way of acting is slow, expensive and highly subjective. In fact, the manual analysis is impracticable in domains where the volume of Data grows exponentially (Chang, et.al, 2011 pp.5512-5513). Consequently, many decisions important things are done, not on the basis of the available data but following the user's own intuition as they do not have the tools necessary. There are analytical tools that have been used to analyze the data and have their origin in statistics (Yasin, Jilani and Danish, 2011 pp.5-6). While these tools are able to infer patterns from the data, the problem is that is somewhat cryptic for non-statisticians and generally not work well for a large amount of existing information volume nowadays, in addition, they do not integrate well with the information systems. In many contexts, such as business, medicine or science what is interesting is the knowledge that can be inferred from of the data and, moreover, the ability to use this knowledge. For example, it is possible to know statistically that 10% of the elderly they have Alzheimer's. This can be useful, but surely it is much more useful to have a set of rules than from the background, the habits and other characteristics of the individual tell us if a patient will have Alzheimer's or not. The problems and limitations of the lack of tools have made arise the need for a new generation of techniques to support the extraction of useful knowledge from the information available, and which are included under the name of data mining. Mining of data differs from other techniques because it does not obtain information extensional (data) but intentional (knowledge) and also knowledge it is not a parameterization of any pre-established model or intuited by the user, but it is a novel and original model completely extracted by the tool (Acuna and Rodriguez, 2004 pp. 639-647).
 Infection with the hepatitis C virus (HCV) continues being a frequent cause of chronic liver disease in individuals with chronic kidney disease (CKD) who receive long-term renal replacement therapy (RRT). In United States, seroprevalence of anti-HCV antibodies is approximately 5 times higher among patients in treatment with chronic hemodialysis that between general population (7.8% vs. 1.6%). Although the HCV infection is an established cause of glomerulonephritis (usually mediated by cryoglobulins), rarely causes CKD that requires replacement therapy kidney or kidney transplant. In this population, often the HCV infection is acquired during dialysis. He risk of infection varies according to the type of TSR, and the prevalence of hepatitis C occurs in treated patients with maintenance hemodialysis (Gorunescu, 2011).
Data mining can be applied to any type of information, being the mining techniques different for each of them exist many types of data (integers, reals, dates, text strings, etc.) and from the point of view of data mining techniques more usual only interest to distinguish between two types: numerical (integers or real) and categorical or discrete (they take values ​​in a finite set of categories). Even considering only these two types of data, It should be clarified that not all techniques are capable of working with both types. This data is contained in what is known as a database, which may be of different nature depending on the type of information that they store. Some types of databases are: Relational databases are the most used today in day as a source for data mining techniques. One base relational data is a collection of relationships (tables) where each table consists of a set of attributes and can have a large number of tuples, records or rows. Each tuple represents an object, which is described through the values of its attributes, and in general, it is characterized by having a unique key that identifies it univocally from the rest. One of The main characteristics of relational databases is the existence of an associated schema, that is, the data they must follow a structure and are, therefore, structured. Through a query (for example in SQL) we can combine in a single table the information of several tables that we require for each specific data mining task. The spatial databases contain related information with physical space in a broad sense (a city, a mountainous region, a cerebral atlas, etc.). These bases of data include geographic data, medical images, networks of transportation or traffic information, etc. where relationships spatial are very relevant. The mining of data on these databases allows you to find patterns between the data, such as for example the characteristics of the houses in an area mountainous, the planning of new public transport lines depending on the distance of the different areas to the existing lines, etc (Li and Zhou, 2007 pp.1088-1098).
Given the high prevalence of HCV infection in the HIV population, systematically carry out the determination of anti-HCV antibodies at all patients infected with HIV. The detection of anti-HCV will be carried out with the techniques usual. In general terms, the confirmation of a positive result through immunoblot techniques (RIBA, LIA) is only necessary in patients from populations with low risk for hepatitis C.
Therefore, it will not be necessary to carry out tests of confirmation in the majority of HIV-infected patients presenting a positive anti-HCV serology, since they probably belong to a high-risk group of infection by HCV (ADVP, hemophiliacs, etc.) and we can assume the certainty of exposure to HCV. In patients in whom we suspect they are in the window period of HCV infection, or in cases of seronegative hepatitis with high probability of infection by HCV, it is indicated to carry out tests of direct detection of the virus. To do this, the same techniques indicated in patients not coinfected with HIV, such as qualitative PCR (Kurosaki, et.al,  2012 pp.607-608). The identification of genotypes and the quantification of HCV viremia does not differ in as regards techniques and indications of those performed in non-coinfected patients, limiting fundamentally to epidemiological studies or, in individual cases, at the time of or monitor the anti-HCV treatment. The degree of histological lesion and, specifically, the extent of the fibrosis is consider factors of prediction of response to treatment. In the coinfected patient because of HIV there is a greater degree of injury and fibrosis that can condition the response therapeutic and, therefore, modify the indication of treatment. In addition, in the patient co-infected, other circumstances that justify the liver injury may coexist (drugs, alcohol, opportunistic infections, etc.). For all this, the biopsy was performed Hepatic pretreatment seems even more justified than in the non-coinfected patient. It can also be useful for making decisions in patients with anti-HCV treatment when it is poorly tolerated. In general, it is only contraindicated in the event that the patient has a coagulopathy, or when there is obvious evidence of disease Liver terminal. In patients with normal transaminases, although occasionally histological data of chronic hepatopathy may exist, the prognosis of these patients is In general, it is good that, in principle, a liver biopsy is not recommended (Yeh, et.al, 2011 pp.447-448).

 

Background

It is estimated that 3% of the world population is chronically infected with HCV and are reported every year more than one million new cases of infection. In The United States (USA) 1.8% of the population is positive for antibodies against HCV. Of every four seropositive people, three have viremia, meaning that the active HCV infection is present in 2.7 million people (4) In addition, it is estimated that 30,000 infections occur acute annuals in this country. Studies based on North American populations indicate that 40% of the Chronic liver disease is related to HCV; causing between 8,000 to 10,000 deaths annually and, without a effective intervention, deaths could triple in the next ten to 20 years. Liver disease in terminal state (cirrhosis and liver cancer) associated with infection with HCV is the most frequent indication for liver transplantation among adults in the US and in the Western Europe. In Latin America there are few studies on seroprevalence of HCV in the general population. Some of the frequencies found indicate the following: Venezuela in 1993, in 200 people, a seroprevalence for antibodies against HCV (anti-HCV) of 1.5%; in Brazil, in 1994 (n = 460) and 1995 (n = 800), a seroprevalence for anti-HCV of 1.4 and 1.2%, respectively; in Mexico in 1991, in 450 healthy children, a seroprevalence for anti-HCV of 0.9% (6) (Deng, et.al, 2003 pp.959-960).
Chronic infection with hepatitis C virus (HCV) is a health problem at scale worldwide that affects more than 170 million people, which represents a prevalence of the order of 2.5% of the world's population. More than 53,700 deaths per year can be directly attributable to HCV, although WHO estimates that more than 308,000 deaths per year are probably due to liver cancer caused by HCV, together with a significant proportion of 785,000 deaths due to cirrhosis. Taken together, these data suggest that HCV is responsible for approximately one million deaths per year21.
HCV seroprevalence data in world population show a wide variation, as well Central and Eastern Asia plus North Africa and the Middle East are estimated to have a high prevalence (> 3.5%), South and Southeast Asia, Sub-Saharan Africa, Andean Zone, Center and South of Latin America, The Caribbean, Oceania, Australasia (Australia, New Zealand, New Guinea, and the neighboring islands of the Pacific), Western, Central and Eastern Europe have a prevalence moderate (1.5% - 3.5%), while the Asia Pacific, Latin America Tropical, and America of the North have a low prevalence (<1.5%) 22 (Chang, et.al, 2011 pp.5512-5513) . In the European Union, the prevalence of people infected with hepatitis C varies between different countries that confirm it, thus the highest prevalence, greater than 2%, occurs in the south of Europe (Italy, Romania and Spain) 23. In UK, according to Bruguera, in the review made in 2006, the prevalence of anti-HCV positive people in the general population is between 1 and 2.6% what It would mean a number of infected people between 480,000 and 760,000. There are wide differences geographical areas, concentrating the greatest impact on the most urbanized communities (between 2.5 and 2.6% in Madrid and Catalonia respectively) and lower in the less urbanized ones (1.6% in Asturias). The distribution by age is also heterogeneous showing a curve with two peaks, indicative of different epidemiological patterns, which depend on the transmission mechanism most prevalent in each group (Acuna and Rodriguez, 2004 pp. 639-647). The peaks correspond to the age group between 30 and 45 years, whose infection is attributable to parenteral drug use, and that of the elderly 65 years, attributable to the receipt of transfusions before 1990 or to the clinical use of syringes not sterilized before 1975, when the single-use material was introduced in UK.
The differences by sex are more notable in the ages between 25 and 45 years, where The prevalence in men is higher, perhaps because intravenous drug abuse is more frequent in them. It is estimated that the influence of immigration on the prevalence of hepatitis C in UK it is potentially high, and depends on the origin of the immigrant population. The studies carried out on small samples agree with international standards known, so that Asians (between 11 and 15%) and sub-Saharan (between 8 and 17%) record the highest figures, while those of North Africans are similar to the autochthonous (1.9%) and those of Latin Americans, lower (0.4%) 24 . In Galicia, seroprevalence data in blood donors according to the study carried out by Eiras et al from May 1999 to June 2001, show that 1.35 ‰ donors or 1 ‰ donors according to serology or RNA detection techniques RT-PCR has antibodies against hepatitis C. The incidence detected in that period was 2.87 per 100,000 persons. The incidence data of hepatitis C present several limitations since most of Acute HCV infections pass clinically unnoticed so they are not diagnosed, and In the absence of indicators of recent infection, acute infections cannot be differentiated from Chronicles in a patient with positive HCV antibodies (Yasin, Jilani and Danish, 2011 pp.5-6).
The data mining phase is the most characteristic of the process of knowledge extraction and this phase is often used to name the whole process. The objective of this phase is to produce new knowledge that the user can use. This is done by building a model based on the data collected for this purpose. He model is a description of the patterns and relationships between the data that can be used to make predictions, to better understand the data or to explain past situations. For this it is necessary to take a series of decisions before starting the process:
·        Determine what type of mining task is most appropriate.
·        Choose the type of model.
·        Choose the mining algorithm that solves the task and obtain the type of model we are looking for the construction of the model is where the iterative character is best seen of the data mining process since it will be necessary to explore alternative models until you find the one that is most useful for solving the problem.
So, once a model is obtained and from the results obtained for it, you can build another model using the same technique but other parameters, or perhaps using other techniques or tools (Gorunescu, 2011).

Ethical considerations
Ethical considerations forms a major element in a research. The researcher needs to adhere to promote the aims of the research imparting authentic knowledge, truth and prevention of error. Furthermore, following ethics enables scholars to deal collaborative approach towards their study with the assistance of their peers, mentors and other contributors to the study. This requires values alike accountability, trust, mutual respect and fairness among all the parties involved in a study. This in turn depends on protection of intellectual property rights of all the contributors, established through implementation of ethical considerations. Other ethical considerations in a research refers to accountability towards general public by protecting the human or animal subjects used in the study. Similarly appropriate usage of public funds and gaining of public support is also important.

1.2 Aims and Objectives

Neonates born to hepatitis C virus (HCV)positive mothers are usually not screened for HCV. Unscreened children may act as active sources for social HCV transmission, and factors contributing for vertical HCV transmitting still remained controversial and needed optimization.
We aimed to investigate the factors contributing for vertical HCV transmission; the highest HCV prevalence worldwide.

1.3 Research Questions

The objective of this paper is to predict potentially new HCV-Human protein interactions using a Data mining technique.






Methodology
This article shows an analysis of various data mining techniques that may be useful for medical analysts or physicians to accurately diagnose HCV infection. The main methodology for our work was the study of publications, journals and reviews in the field of computer science and engineering, research focused on newer publications. Data source A total of 463 records with 16 health attributes (factors) were obtained from the HCW database for Egyptian Healthcare Workers (HCW) at the National Hepatitis B Center in Egypt, where the highest worldwide prevalence of HCVError! The reference source was not found; the records were divided into two data files: a training data file (602 records) and a test data file (257 records). To avoid distortion, the records for each set were randomly selected. The target classification model was chosen in such a way that the Naïve Bayes algorithm only supports categorical (discrete) attributes. Both decision tree algorithms and neural networks support categorical and contiguous attributes. For consistency, only categorical attributes are used for all three models. Hepatitis C virus prediction analysis using various data mining techniques 412 All health attributes have been transformed from numeric to categorical data. The "HCV_PCR" attribute was identified as a predictable attribute of "1" in HCV infected patients and "0" in uninfected HCV patients. The "PID" attribute was used as a key; the rest are input attributes. Missing, inconsistent, and duplicate data are supposed to be resolved.

1. DATA PREPROCESSING

1.1 Dataset

This study performs experiments on a hepatitis data set. The data set contains 155 instances distributed between two classes, which have 32 instances and veins with 123 instances. There are 20 attributes, including the class attribute and 20 missing values. The main objective of the data set is to predict the presence or absence of hepatitis virus. This data file was retrieved from the UCI Learning Store.

1.2 Mining Models

Trained models were evaluated for accuracy and efficiency test data, and the models were validated using the Lift Chart and Classification Matrix as described in the following section. Model Efficiency Verification The efficacy of the models was tested by two methods: the elevator graph and the classification matrix. The goal was to find out which model gave the highest percentage of correct predictions for diagnosing HCV infected patients to determine if there is enough information to retrieve formulas in response to the Predictable attribute, the columns in the trained model were mapped to the columns in the test data file. Also, the model, predictable column for "HCV_PCR" and column state to predict HCV infected patients (predictions of value = 1) were also selected(kurosaki, et al- 2011, pp-401-409). The X axis shows the percentage of the test dataset used to compare the predictions, while the Y axis shows the percentage of predicted values ​​for the specified state. The blue and red lines show results for a random estimate and ideal.
1.3 Extraction Hepatitis dataset
The hepatitis data set contains data on the screening of patients with hepatitis. At the beginning, a set of data was pre-processed to make the mine process more efficient. In our paper, we used the Neural Connection and Weka tools to compare the performance accuracy of data mining algorithms for the diagnosis of a hepatitis disease data file. The pre-processed data set is then used to remove missing values ​​to improve the classification performance. The selection in the tool describes the status of the data attributes present in hepatitis.
Attribute information:
     1. Class: DIE, LIVE
     2. AGE: 10, 20, 30, 40, 50, 60, 70, 80
     3. SEX: male, female
     4. STEROID: no, yes
     5. ANTIVIRALS: no, yes
     6. FATIGUE: no, yes
     7. MALAISE: no, yes
     8. ANOREXIA: no, yes
     9. LIVER BIG: no, yes
    10. LIVER FIRM: no, yes
    11. SPLEEN PALPABLE: no, yes
    12. SPIDERS: no, yes
    13. ASCITES: no, yes
    14. VARICES: no, yes
    15. BILIRUBIN: 0.39, 0.80, 1.20, 2.00, 3.00, 4.00
        -- see the note below
    16. ALK PHOSPHATE: 33, 80, 120, 160, 200, 250
    17. SGOT: 13, 100, 200, 300, 400, 500,
    18. ALBUMIN: 2.1, 3.0, 3.8, 4.5, 5.0, 6.0
    19. PROTIME: 10, 20, 30, 40, 50, 60, 70, 80, 90
    20. HISTOLOGY: no, yes

    The BILIRUBIN attribute appears to be continuously-valued.  I checked
this with the donater, BojanCestnik, who replied:

      About the hepatitis database and BILIRUBIN problem I would like to say
the following: BILIRUBIN is continuous attribute (= the number of it's
      "values" in the ASDOHEPA.DAT file is negative!!!); "values" are quoted
because when speaking about the continuous attribute there is no such
thing as all possible values. However, they represent so called
      "boundary" values; according to these "boundary" values the attribute
can be discretized. At the same time, because of the continious
attribute, one can perform some other test since the continuous
information is preserved. I hope that these lines have at least roughly
answered your question.

8. Missing Attribute Values: (indicated by "?")
     Attribute Number:    Number of Missing Values:
                    1:    0
                    2:    0
                    3:    0
                    4:    1
                    5:    0
                    6:    1
                    7:    1
                    8:    1
                    9:    10
              10:    11
                           11:    5
                           12:    5
                           13:    5
                           14:    5
                           15:    6
                           16:    29
                           17:    4
                           18:    16
                           19:    67
                           20:    0
The top red line shows the ideal model; captured 100% of the target population in HCV infected patients using 8.53% of the test dataset. The bottom blue line shows a random line that is always over 45 degrees in the graph. It turns out that if we randomly estimate the result for each case, 50% of the target population would be captured using a 50% test data set(Narayan, et al, 2002, pp-5-13). All three model lines (violet, yellow, and green) fall between the random estimate and the ideal lines, indicating that all three have enough information to learn patterns in response to the predictable state, except the purple line at a certain percentage that was moving around a random puzzler, it means that the training data did not have enough information to learn the target patterns. Lift graph without predictive value. The steps for creating an elevator chart are similar to the one above, except that the predictable column state is empty. It does not include a series of a random estimate model. He tells how well each model preceded predicting the correct number of predictable attributes shows the elevator output diagram. The X axis shows the percentage of the test dataset used to compare predictions, while the Y axis shows the percentage of predictions that are correct. The blue, purple, green and red lines show the ideal models, Neural Network, Naïve Bayes and Decision trees(Elrazek, et al, 2017, pp-529-533). The chart shows the performance of the models in all possible states. The ideal line (blue) is at an angle of 45 degrees, indicating that 50% of the test dataset is predicted correctly if 50% of the test data set is processed.
When the entire population is processed, the decision tree model is better than the other two, because it has the highest number of correct predictions (93%) followed by Neural Network (88%) and Naïve Bayes (85%). If less than 50% of the population is processed, Lift lines for decision tree and neural network will always be higher than for Naïve Bayes. This shows that neural networks and decision trees are better at creating a high percentage of correct predictions than Naïve Bayes. Along the X axis, the Lift lines for Neural Network and Naïve Bayes overlap(Khairy, et al, 2013, pp-13). This shows that both models are just as good for the right prediction. When more than 50% of the population is processed, decision trees and neural networks appear to be better because they provide a high percentage of correct predictions than Naïve Bayes. This is because the lift line for naive bayes will be below the neural network and decision trees. For a certain population range, decision trees seem to be the best non-neural network and the neural network appears to be better than Naives Bayes and vice versa. Matrix Classification: The Matrix Classification displays the frequency of correct and incorrect predictions. Compares the actual values ​​in the predictive test file with the trained model and in this example, a test set of 19 HCV-infected patients and 238 uninfected HCV patients were included. Table 1 shows the results of the classification matrix for all three models. This represents the number of cases where each algorithm can predict them correctly and its performance, where ("1 real" for HCV infected patients, "real" in uninfected HCV patients)(Zayed, et al, 2013, pp-254-261). The dominant trees appear to be most effective because they have the highest percentage of correct predictions (84.21%) in HCV infected patients, followed by neural networks (57.89%) and Naïve Bayes (52.63%). Decisive trees also appear to be most effective in predicting patients without heart disease (93.68%) compared to the other two models

1.3Evaluation of Mining Goals

Four exploration goals based on the HCV data set survey and the objectives of this research were defined. They were evaluated on the basis of trained models. The results show that all three models have achieved the goals, indicating that they can be used to provide decision support to physicians for diagnosing patients and discovering medical factors associated with HCV infection. Four objectives are listed and discussed below. Goal 1: Identify Significant Implications and Links in Health Resources Related to Predictable Status - HCV Infection: The Decision Trends and Naïve Bayes Dependency Viewer shows the results of the most important and the smallest (weakest) medical predictors. The browser is particularly useful if there are many predictable attributes show that HCV_ELISA is the most important factor in HCV infection in both models. Other important factors include AST and Schisto(El-serag, et al, 2014, pp-1249-1255). The decision tree model shows the weakest "gender" factor, while the Naïve Bayes model is the weakest factor of residence. Decision trees seem to be better than Naïve Bayes, because it gives meaning to multiple input attributes. Physicians can use this information to further analyze the strengths and weaknesses of the health properties associated with HCV infection.
Goal 2: Identify the impact and relationship between health attributes in relation to the predictable state - HCV infection: The impact and relationship between health attributes in relation to HCV infection is only found in the decision tree viewer. The highest probability (86.67%) is that HCV infected patients are found in the relationship between the attributes (nodes): "HCV ELISA = 1 and AST ne = 42 and Job = 11 and Schisto = 1" and Neadlestick = "0 "With this information, physicians can perform a medical examination of these five attributes instead of all attributes to potential patients who are likely to be diagnosed with HCV infections, thereby reducing health costs, administrative costs, and diagnostic time. , 60%) are found in this attribute: "HCV ELISA ne =" 1 ". The relationship between attributes in HCV-infected patients is also given(kurosaki, et al- 2011, pp-401-409). The results show that the attribute "HCV ELISA ne = '1" has the highest impact (99.40%). The lowest score (13.33%) is found in the attributes: "HCV ELISA = 1 and AST ne = 42 and Job = 11 and Schisto = 1 and Neadlestick = 0. Additional information such as patient identification and health profiles based on selected nodes can also be obtained through the drill function. Physicians can use the tree viewer's decision to perform further analysis.
Antibodies to hepatitis C in the blood (HCV_ELISA = 1) and 97,02% are not infected with HBV (HBsAg_ELISA = 1)(Narayan, et al, 2002, pp-5-13). Other important characteristics include: high probability in the rural area of ​​the residence (residence = 2), patients are male (gender = 1), patients are not infected with schistosomiasis (Schisto = 0), etc. The characteristics of patients with uninfected HCV with high probability HBsAg_ELISA = 0, meaning they are not infected with HBV, have no hepatitis C antibodies in their blood (HCV_ELISA = 0), etc. These results can be further analyzed:
Goal 4: Identify attribute values ​​that differentiate nodes that favor and disadvantage predictable conditions: (1) HCV infected patients (2) HCV-infected patients. This query can be answered by analyzing the results of the Naïve Bayes and Neural Network attributes. The browser provides information on the impact of all attribute values ​​related to the predictable state. The Naïve Bayes model shows the most important attribute value that prefers HCV infected patients: "HCV_ELISA = 1". Other attributes that support HCV infection include "AST = 31.4", "AST = 22.7", etc. Attributes such as "HCV_ELISA = 0", "AST = 31", "AST = 42" etc. we’re not infected with HCV. Neural network model shows that the most important attribute value that prefers HCV-infected patients is "AST = 34.7"(Elrazek, et al, 2017, pp-529-533). Other attributes that favor HCV infection include ALT = 64.6, AST = 27.94, Age = 54, etc. Attributes such as Age = 28, ALT = 47.14, etc. also favor predictable status uninfected HCV.

1.5Comparison between Three Data Mining Techniques Performance

Using three sets of data of different sizes Finally, we want to make sure that the size of the database affects the accuracy of the classification technique used to predict HCV infection or not, that is, the three classification techniques that are Naïve Bayes, decision trees and Hepatitis C virus prediction analysis using various techniques data mining 219 and neural networks applied to three HCV databases of different sizes, then determines the accuracy and efficiency of each technique using the elevator graph(Khairy, et al, 2013, pp-13).
Decision trees are the most effective technique for detecting HCV infection for a different size of data set, as it has the highest percentage of correct predictions.  Naive Bayes accuracy increases as the data file size increases and the size of the dataset affects the accuracy of the classification techniques used to detect HCV infectionand the algorithm of the C4.5 algorithm was introduced by Quinlan to invoke classification models, also referred to as decision trees, from the observed data. In the monitored data file, each record contains the same data structure. Data can have any number of attributes or pairs of values(Zayed, et al, 2013, pp-254-261). One of these attributes represents the category of the record. The problem is to build a decision tree based on observations of non-category attributes that correctly predict the value of the category attribute. The category attribute can have values ​​like {true, false} or {predicted, not predicted} or {success, failure} or something like this. In any case, one of its values ​​will be a failure. If there are an equally probable possible messages, the p probability of each is 1 / n and the information transmitted by the message is -log(p) = log(n). In general, if we are given a probability distribution P = (p1, p2,.., pn) then the Information conveyed by this distribution, also called the Entropy of P, is: I(P) = -(p1*log(p1) + p2*log(p2) + .. + pn*log(pn)). If a set T of records are partitioned into disjoint exhaustive classes C1, C2, .., Ck on the basis of the value of the categorical attribute, then the information needed to identify the class of an element of T is Info(T) = I(P), where P is the probability distribution of the partition (C1, C2, .., Ck): P = (|C1|/|T|, |C2|/|T|,.., |Ck|/|T|) First T is partitioned on the basis of the value of a non-categorical attribute X into sets T1,T2,..,Tn then the information needed to identify the class of an element of T becomes the weighted average of the information needed to identify the class of an element of Ti, i.e. the weighted average of Info(Ti): Info(X,T) = Sum for i from 1 to n of |𝑇𝑖|/|𝑇| * Info(Ti) and the quantity Gain(X,T) defined as Gain(X,T) = Info(T) - Info(X,T) This is the difference between the information needed to identify the T-element and the information needed to identify the T-element after obtaining an X attribute value, i.e., the gain of information as a result of the X attribute(El-serag, et al, 2014, pp-1249-1255). Therefore, we can predict which information offers more information gain than any other information. This gain term is used to attribute attributes and to create decision trees where each attribute is the largest gain among attributes that are not yet considered a root path. The intent of this arrangement is twofold: (i) create small decision trees to identify the records after a few questions. (ii) Align the expected minimum process that represents the records that are considered. Therefore, the C4.5 algorithm can be used to identify the most important attribute that is the key risk factor associated with HCV infection among IDUs in India as mentioned above, the observed drug user data set contains four hundred and forty records of different values ​​for nine different non-categorical attributes.

2.Research process

The process consisted of several phases: First Stage Discussion the data set was divided into two parts: training (80%) and testing (20%) to guarantee the accuracy of the experimental result and improve credibility(kurosaki, et al- 2011, pp-401-409).
Preliminary processing in stage 2 prior to evaluation, the data in the training database must be pre-processed. Pre-processing included data purification, i.e., Ensure that the data does not contain missing values, noise (contain errors, deviations) and discrepancies (discrepancies of units used). Several approaches are available for this purpose(Narayan, et al, 2002, pp-5-13). In this study, missing values ​​were replaced by the median value, as this method was commonly used by many researchers. After pre-treatment, a complete set of data was obtained and used for experiments. Step 3 Classification At this stage, classification is used to classify data into predefined category categories. The "class" in the classification is an attribute or property in the data set in which users are most interested. It is defined as the dependent variable in the statistics. Classification algorithm creates a classification model consisting of classification rules. In our study, the classification can be used to define the diagnosis and prognosis of hepatitis based on symptoms and medical conditions. At this stage, two steps were taking place, which consisted of training and testing. The first step is the training you used to compile the classification model by analyzing training data containing class labels(Elrazek, et al, 2017, pp-529-533). The second step is testing. It examines the classifier using test data for the accuracy in which the test data contains a class label or its ability to classify unknown prediction objects. In this paper we mainly deal with naive Bayes, Naive Bayes updatable, FT tree, KStar, J48, LMT, neural network. Stage 4 Comparison of accuracy and statistical results at this stage, we discussed and compared percentage accuracy and statistical results among the algorithms(Khairy, et al, 2013, pp-13).

2.1 Other Method

The second technique used in this experiment is the rough theory of sets, using Weka to analyze the hepatitis data set. The data was trained and divided into two; training and testing. These data were then discretized to group data that has continuous values ​​in attributes. There are only a few discretionary processes that are; Logical logic, entropy, Naive and Semi naive. After the discretization process, generating rules are used, i.e. Reduction. Reductions are techniques that exclude unused attributes and create a minimal subset of attributes for the decision table and it is proven that the Gross Set technique is the best technique used to analyze hepatitis data. This gave the highest percentage of precision(El-serag, et al, 2014, pp-1249-1255). The best classification algorithm used in the coarse set is Naïve Bayes, which is based on Bayes' rule.
Ranking algorithm performance metrics with 10 cross validation and number of functions selected by element selection methods and the highest accuracy values ​​were found with the Naive Bayes and the Eva Decision Decisions with the Gain Attribute Eval attribute, the One-R Attribute Eval and the Relief Attribute Eval. For example, the accuracy of Naïve Bayes and the decision table with these element selection methods is 0.853, the highest value in Table II. In addition, the accuracy of Naive Bayes on the consistency of Subsevval is also 0.853. Likewise, Naïve Bayes with One-R Attribute Eval and Relief Attribute Eval have the highest values ​​to invoke.





Comments

Popular posts from this blog

Health Conscious