1. IntroductionDiscovering regularities in multivariate data sets or databases is one of the main goals of exploratory data analysis and pattern recognition methods [1], [2]. Discovering trends in temporal databases is a particularly interesting problem with many important applications. An adequate standardization is usually needed before applying data analysis tools. In a standardized clinical data representation, patients are represented in the form of feature vectors with the same number of numerical components (features) or as points in a multidimensional feature space. A particular pattern in clinical data is represented as a distinct set of feature vectors or as a special cloud of points in a feature space. The regression analysis plays a prominent role in data exploration [3]. The regression model may describe a dependence of one feature on a selected set of other features. The ranked regression method can also serve a similar purpose [4], [5], [6]. The ranked models are particularly useful when values of the dependent feature cannot be measured precisely or directly and additional information about feature vectors is available only in the form of ranked relations within selected pairs of these vectors. Such ranked relations can be treated as a priori knowledge about linear sequential patterns hidden in data. In this context, inducing the linear ranked model from the ranked pairs can be treated as a pattern recognition problem. The induced ranked model can also be used for prognosis or decision support purposes. The method of inducing linear ranked models from a set of feature vectors and ranked relations within selected pairs of these vectors was proposed in the previous papers [4], [5]. This method is based on the minimization of convex and piecewise-linear (CPL) criterion functions. Properties of this approach in the context of modeling a causal sequence of liver diseases are analysed in the presented paper. Feature vectors from hepathological database of the system Hepar and additional medical knowledge in the form of a causal sequence of liver diseases were used in designing ranked linear transformation [7]. 2. Feature vectors and oriented dipolesWe are taking into consideration a data set C built from m feature vectors xj=[xj1,...xjn]T which were numbered in a fixed manner | (1) |
The vectors xj belong to the n-dimensional feature space F[n] (xj F[n]). The component (feature) xji of the vector xj is a numerical result of the i-th examination (i=1,...,n) of a given patient or event Oj (j=1,...,m). The feature vectors xj are of a mixed type if they represent different types of diagnostic measurements (xi {0,1}) or (xi R)). Let the symbol " " means the relation "follows" which is fulfilled within ranked pairs {xj, Xj'} (j<j') of the feature vectors xj and xj' with the indices (j, j') from some set : | (2) |
The relation xj xj' between the feature vectors xj and xj' means that the vector xj' follows the vector xj in some sequence. This relation should be determined on the basis of some additional information about set of pairs (not necessary all) of the vectors xj. For example, medical doctors can compare two patients with the same disease and decide that one of them is in a more serious condition. As another example, it can be stated that one of two students is more talented than another one. In the paper we analyse the problem of designing such transformations of the feature vectors xj on the (ranked) line y=wTx which preserve the relation " " (2) as precisely as possible  | (3) |
where w=[w1,...,wn]T is the weight vector. The family of the relations (2) defines the sequential pattern S(x) of the vectors xj in the feature space F[n] (xj F[n]). Definition 1: The sequential pattern S(x) is linear in the feature space F[n] if and only if there exists such n-dimensional weight vector w (w Rn) that the below implication holds: | (4) |
where J is a set of indices (j, j') of ranked pairs of vectors {xj, xj'} (j<j'). The procedure of the linear patterns discovering and the ranked line designing can be based on the concept of positively and negatively oriented dipoles {xj, xj'} (j<j') [4], [5]. Definition 2: The ranked pair {xj, xj'} (j<j') of the feature vectors xj and xj' constitutes the positively oriented dipole {xj, xj'} ((j,j' J+), if and only if xj xj'. | (5) |
Definition 3: The ranked pair {xj, xj'} (j<j') of the feature vectors xj and xj' constitutes the negatively oriented dipole {xj, xj'} ((j,j' J-), if and only if xj' xj. | (6) |
Definition 4: The line y(w)=wTx (3) is completely ranked in accordance with the dipoles {xj, xj'} (j<j') orientations if and only if  | (7) |
where J+ and J- are sets of indices (j, j') of the positively and negatively oriented dipoles {xj, xj'} (j<j'), and J+ J-=J, J+ J-=Ø. 3. Designing ranked models through minimization of a CPL criterion functionLet us introduce the positive set R+ and the negative set R- of the differential vectors rjj'=(xj'-xj) on the basis of the sets of indices J+ (6) and J- (7).  | (8) |
We examine the possibility of separating the sets R+ and R- by the hyperplane H(w) passing through the origin 0 of the feature space. | (9) |
Definition 5: The sets R+ and R- (8) are separable by some hyperplane H(w) (9) if and only if the below inequalities hold  | (10) |
If all the above inequalities are fulfilled for some vector w, then the hyperplane H(w) (9) separates the sets R+ and R- (8). Lemma 1: The linear transformation y(w)=wTx (3) is completely ranked (7) in accordance with dipoles' {xj, xj'} orientations if and only if the hyperplane H(w) (9) separates (10) the sets R+ and R- (8). Proof: If the hyperplane H(w) (9) separates the sets R+ and R- (8), then the ranked relations (7) hold on the line y(w)=wTx (3). On the other hand, the fulfilling of the ranked relations (7) guarantees that the inequalities (10) hold. Designing a separating hyperplane H(w) (9) could be carried out through the minimization of the convex and piecewise linear (CPL) criterion function similar to the perceptron criterion function Φ(w) [2]. For this purpose let us introduce the positive φjj'+(w) and the negative φjj'-(w) penalty functions:  | (11) |
| (12) |
The criterion function Φ(w) is the weighted sum of the penalty functions φjj'+(w)and φjj'-(w):  | (13) |
where γjj' (γjj'>0) is a positive parameter (price) related to the dipole {xj, xj'} (j<j'). Φ(w) (13) is the convex and piecewise linear (CPL) function as the sum of the penalty functions φjj'+(w) and φjj'-(w) of the same kind. The basis exchange algorithms, similar to the linear programming, allow to find the minimum of such function efficiently, even in the case of large, multidimensional data sets R+ and R- (8) [7]:  | (14)
|
The optimal parameter vector w* and the minimal value Φ* of the criterion function Φ(w) (13) can be applied to a variety of data ranking problems. In particular, the vector w* defining the best ranked line y(w)=wTx (3) can be found this way. The minimal value Φ* (14) of the criterion function Φ(w) (13) can be used to measure the linearity of the sequential patterns S(x) (Def. 1) in a given feature space F[n]. Lemma 2: The minimal value Φ* (14) of the criterion function Φ(w) (13) is equal to zero if and only if the sequential pattern S(x) (Def. 1) is linear. Proof: If there exists such a vector w* that the ranking of the points yj(w*) on the line (3) is fully consistent (7) with the relations " ", then the sets R+ and R- (8) can be separated (10) by the hyperplane H(w*) (9). In this case, the minimal value Φ* of the perceptron criterion function Φ(w) (13) is equal to zero, as it results from the pattern recognition theory [1]. On the other hand, if the minimal value Φ* (14) of the criterion function Φ(w) (13) is equal to zero in the point w*, then the values φjj'+(w) and φjj'-(w) of all the penalty functions (11) and (12) have to be equal to zero. This means that the sets R+ and R- (8) can be separated (10) by the hyperplane H(w*) (9). As a result, the ranking of the points yj(w*) on the line (3) is fully consistent (7) with the relations (4) and (5). 4. Causal sequence of learning setsLet us assume that a clinical database contains descriptions of m patients Oj(k) (j=1,...m) labeled in accordance with their clinical diagnosis ωk(k=1,...K). Each patient Oj(k) is represented by n-dimensional feature vector xj(k). The feature vector xj(k) represents the j-th patient Oj(k) linked to the k-th disease ωk. The learning set Ck contains mk labeled feature vectors xj(k) that are linked to the k-th disease (class) ωk. | (15) |
where Ik is the set of indices j of mk feature vectors xj(k) labeled to the class ωk. We assume that the learning set Ck have been formed in a learning causal sequence:  | (16) |
where symbol "Ck-1→Ck" means that "disease ωk appears after ωk-1" or "disease ωk-1 is a cause of ωk". The consistent indexing of the sets Ck and the diseases ωk has been used in the sequence (16). This means that: | (17) |
The causal relation "Ck→Ck'" (17) between learning sets Ck and Ck' can be used for determining the causal ranked relation " " (2) between feature vectors xj(k) (xj(k) Ck) and xj' (k') (xj'(k) Ck) (15):  | (18) |
or | (19) |
Let us remark that there is no ranked relation " " (2) between feature vectors xj(k) and xj'(k) from the same set . We can assume that the indices j of the feature vectors xj(k) are consistent with the learning sets Ck (15). This means that the set C1 contains m1 first feature vectors xj(k), the set C2 contains m2 next vectors xj(k), and so on. As a consequence, the following relation of consistent indexing holds: | (20) |
Lemma 3: In the case of consistent indexing (20), the linear transformation y(w)=wTx (3) is completely ranked (7) if and only if the set R+ (8) of the differential vectors rjj'=(xj'-Xj) is situated on the positive side of some hyperplane H(w) (9):  | (21) |
Proof: The relations (19) and (20) guarantee that all dipoles {xj, xj'} are positively oriented (Def. 2) and the negative set R-(8) is empty (R-=ø). As a result, the vector w defines such hyperplane H(w) which linearly separates (10) the sets R+ and R-(8). This means that the assumptions of the Lemma 1 are fulfilled. Let us assume that the set R+ (8) contains all positively oriented dipoles {xj(1), xj'(2)} (j<j') (5), that can be generated from two learning sets C1 and C2 (15) in accordance with the relation C1→C2 (16) and consistent indexing (20), | (22) |
The set R+ (22) is complete if it contains all positively oriented dipoles {xj(1), xj'(2)} (5) that can be created from two learning sets C1 and C2 (15) with the relation C1→C2 (16). Definition 6: Two learning sets C1 and C2 (15) are linearly separable if and only if the below inequalities hold  | (23) |
where θ (θ R1) is a threshold. The above parameters (w,θ) define the hyperplane H(w,θ) in the feature space, where:  | (24) | It is possible to relate the linear separability of two learning sets C1 and C2 (15) with the complete ranking of the linear transformation y(w)=wTx (3). Theorem 1: The linear transformation y(w)=wTx (3) is completely ranked (21) in accordance with the complete set R+ (22) if and only if there exists such threshold θ' that the hyperplane H(w,θ') (24) separates two learning sets C1 and C2 (15). Proof: If the line y(w)=wTx (3) is fully ranked (7), then  | (25) |
Let us define the positive θ+(w) and the negative θ-(w) thresholds on the line y(w)=wTx:  | (26) |
 | (27) |
The below inequality results from the relation (25) | (28) |
The threshold θ' can be defined as follows  | (29) | It can be directly verified that the hyperplane H(w,θ') (24) separates the learning sets C1 and C2 (15). On the other hand, the hyperplane H(w,θ') (24) that separates sets C1 and C2 (15), determines a linear transformation y(w)=wTx (3) which is completely ranked (21). As it results from the definition of the linear separabilty (23), each element rjj'= xj'(2) - xj(1) of the set R+(22) fulfills the relation (21). 5. Causal sequence of liver diseasesThe database of the system Hepar contains descriptions of patients with variety of chronic liver diseases ωk (k=1,...K) [7]. The feature vectors x in the database of Hepar are the mixed, qualitative-quantitative type. They contain both symptoms and signs (xi {0,1}) as well as the numerical results of laboratory tests (xi R). About 200 different features xi describe one patient case in this system. For the purpose of these computations, each patient has been described by the feature vector xj(k) composed of 62 features xi chosen as a standard by medical doctors. The following K=7 groups of patients Ck (15) have been extracted from the Hepar database: | C1. Non hepatitis patients | - 16 patients | | | C2. Hepatitis acuta | - 8 patients | | | C3. Hepatitis persistens | - 44 patients | | | C4. Hepatitis chronica activa | - 95 patients | | | C5. Cirrhosis hepatitis compensata | - 38 patients | (30) | | C6. Cirrhosis decompensata | - 60 patients | | | C7. Carcinoma hepatis | - 11 patients | | | | ----------- Total: 272 patients | |
The data sets Ck (30) have been formed as the causal learning sequence (16) in accordance with medical knowledge. The ranked relation " " (2) between feature vectors xj(k) (xj(k) Ck) and xj'(k') (xj'(k') Ck') (15) has been defined (18) on the basis of the causal sequence (16). This ranked relation allowed to define both oriented dipoles {xj, xj'} (5) (6) as well as the positive set R+ and the negative set R- (8) of the differential vectors rjj'(xj'-xj). The sets R+ and R- have been used in the definition of the convex and piecewise linear (CPL) criterion function Φ(w) (13). The optimal parameter vector w* (14) constituting the minimum of the function Φ(w) (13) defines the ranked linear model (3) that can be used for prognosis purposes: | (31) |
The solution of the feature selection problem allows to determine the most important features xi influencing the future of a given patient x0 and to neglect the unimportant features xi. The feature selection problem can also be based on the minimization of the convex and piecewise linear (CPL) criterion function Φ(w) (13) [7]. The linear model (31) fulfills the ranked relation (4) for a great part of feature vectors xj: | (32) |
As a result, the causal sequence (16) of the learning sets Ck (30) is preserved in a great part by the ranked model (31). In accordance with the equation (31), each learning set Ck (30) is transformed in the set Ck' of the points yj(k) on the ranked line:  | (33) |
The sets Ck' can be characterized by mean values μk and variances σk2, where | (34) |
and
 | (35) |
The results of computations based on the model (31) of data sets Ck (30) are summarized in the below Table 1:
Table 1 The mean values μk and variances σk2 of the sets Ck' (33). Data sets Ck' (33) | Number of patients mk | Mean value μk | Variance σk2 (σk) | C1' | 16 | -1.02 | 0.46 (0.68) | C2' | 8 | -0.58 | 0.57 (0.76) | C3' | 44 | 0.12 | 1.1 (1.05) | C4' | 95 | 0.89 | 1.46 (1.21) | C5' | 38 | 2.11 | 2 (1.41) | C6' | 60 | 3.02 | 2.2 (1.48) | C7' | 11 | 3.78 | 0.62 (0.79) |
Let us consider an additional linear scaling y'=αy+β of the model y=(w*)Tx (31) in order to improve the interpretability of its prognostic applications. | (36) | where α and β are the scaling parameters. We can remark that the ranked implications (32) do not depend on the linear scaling (36) of the model. This means that  | (37) |
The parameters α and β have been fixed through minimization of the sum (α,β) of the differences |k-α(w*)Txj(k)+β| for all the sets Ck (30) and all the feature vector xj(k). | (38) |
where Ik is the set of indices j of the feature vectors xj(k) from the set Ck (30). Let us remark that (α,β) is the convex and piecewise linear (CPL) function. The basis exchange algorithms also allow to find efficiently the parameters α* and β* constituting the minimum of the function (α,β). Some results of the scaled model evaluation are shown in the Table 2 and on the Fig. 1. Table 2 The mean values μk' and variances σk'2 of the sets Ck' (33) obtained from the ranked model (31) after scaling (36) with the optimal parameters α* and β*. Data sets Ck' (33) | Number of patients mk | Mean value µk' | Variance σk2 (σk) | C1' | 16 | 1.41 | 0.64 (0.8) | C2' | 8 | 1.93 | 0.79 (0.89) | C3' | 44 | 2.75 | 1.51 (1.23) | C4' | 95 | 3.65 | 1.99 (1.41) | C5' | 38 | 5.08 | 2.74 (1.65) | C6' | 60 | 6.14 | 3.02 (1.74) | C7' | 11 | 7.03 | 0.85 (0.92) |

Figure 1 Graphical presentation the mean values μk' and variancesσk'2 of the sets Ck' (33) obtained from the ranked model (31) after scaling (36) with the optimal parameters α* and β*.
The linear ranked model y=α*(w*)Tx+β* can be used in the diagnosis support of a new patient x0. The location of the point y0=α*(w*)Tx0+β* on the ranked line (30) constitutes a valuable characteristic of the patient and his perspectives. In the case the scaled model (Fig. 1), we can expect that the point y0 representing a new patient x0 with the k-th disease ωk will be situated near the index k. 6. Concluding remarksThe linear ranked models can be induced from data sets Ck (15) on the basis of additional medical knowledge in the form of the causal sequence (16) of diseases ωk (k=1,...,K). The ranked relation " " (2) between feature vectors xj(k) from different learning sets Ck and Ck' has been defined (18) on the basis of the causal sequence (16). This ranked relation allowed to define both the oriented dipoles {xj, xj'} (5) (6) as well as the positive set R+ and the negative set R- (8) of the differential vectors . The sets R+ and R- (8) have been used in the definition of the convex and piecewise linear (CPL) criterion function Φ(w) (13). The optimal parameter vector w* (14), which is the minimum point of the function Φ(w) (13) defines the ranked linear model (31) that can be used for the purpose of prognosis. The prognostic model (31) can be improved through linear scaling (36). An example of the CPL criterion function (α,β) for choosing the scaling parameters α and β is provided by the equation (38). The feature selection problem allows to determine the most important features xi influencing significantly the future of a given patient and to neglect unimportant features. The feature selection problem can be solved through the minimization of a modified CPL criterion function Φ(w) (13) [6], [7]. The ranked model of liver diseases (31) could be applied in screening procedures in the search for potentially ill patients eligible for further investigations and therapy. The ranked model (31) can also be specified for risk prognosis for individual patients. This work was partially supported by the KBN grant 3T11F01130, by the grant 16/St/2007 from the Institute of Biocybernetics and Biomedical Engineering PAS, and by the grant W/II/1/2007 from the Białystok University of Technology. Bibliography | [1] | Duda O. R., Hart P. E., Stork D. G.: Pattern Classification, J. Wiley, New York, 2001. | | [2] | Fukunaga K.: Introduction to Statistical Pattern Recognition, Academic Press 1972. | | [3] | Johnson R. A., Wichern D. W.: Applied Multivariate Statistical Analysis, Prentice-Hall Inc., Englewood Cliffs, New York, 1991. | | [4] | Bobrowski L., Łukaszuk T.: Ranked Linear Modeling in Survival Analysis, pp. 61-67 in: Lecture Notes of the ICB Seminars: Statistics and Clinical Practice, ed. by L. Bobrowski, J. Doroszewski, N. Victor, IBIB PAN, Warsaw, 2005. | | [5] | Bobrowski L.: Ranked Modelling with Feature Selection Based on the CPL Criterion Functions, in: Machine Learning and Data Mining in Pattern Recognition, eds. P. Perner et al., Lecture Notes in Computer Science vol. 3587, Springer Verlag, Berlin, 2005. | | [6] | Bobrowski L.: Eksploracja danych oparta na wypukłych i odcinkowo-liniowych funkcjach kryterialnych (Data mining based on convex and piecewise linear (CPL) criterion functions) (in Polish), Białystok Technical University, 2005. | | [7] | Bobrowski L., Wasyluk H.: Diagnosis Support rules of the Hepar system, pp. 1309-1313 in: MEDINFO 2001, eds: V. L. Petel, R. Rogers, R. Haux, IOS Press, Amsterdam, 2001. | | [8] | Bobrowski L.: Design of Piecewise Linear Classifiers from Formal Neurons by Some Basis Exchange Technique, Pattern Recognition, 24(9), pp. 863-870, 1991. |
|