Descriptioll of the data
Descriptive statistics for the CACM I collection are given in Table 2 on page 13. Vectors for the seven concept types show considerable variation, as they range from zero to four digit numbers. All show relatively high values for CV (the coefficient of variation, from 235.0 to 674.4) and all are highly positively skewed (from 4.22 to 17.3). This is due to the high values that resulted from the inner product calculations when concept types in the query obtained good match with a document. Kurtosis measures were all large (from 8.8 to 381.5) due to peakedness caused by the high propor-tion of low or zero values in most concept types that were produced when there was a poor match.
As seen in Table 3 on page 14 the variables in the ISII collection were similar. The numbers range from 0 to 1188. l and had high skewness (from 3.4 to 5.8) and high kurtosis (from 17.l to 45.4). Their CV’s were somewhat lower ( 163.8 to 356. 7). Although linear regression and the F test are robust, these data are rather variable and quite different from the kind of examples usually shown in textbooks on regression analysis. Some of the variation is due to the extremes in values, but much of the variability is due to the sparseness of the QI D • DID array. Evidence of the sparseness
is seen in Table 2 on page 13 and Table 3 on page 14 regarding length, the column that gives the number of non-zero values for each vector. In fact, five of the seven concept types for CACMl have non-zero values in from 8.8% (AUT) to 17.9% (BBC) of the data records. In the ISil col-lection one of the three subvcctors has only 13% of its values that are non-zero (AUT). An at-tempt to compensate for the sparseness is discussed under “Linear regressions on the CACM 1 collection” on page 19. To try to minimize the effect of the high variability in the data, a natural log transfonnation was used on all variables and is reported in Table 4 on page 15 and Table 5 on page 16 for the CACMl and ISII collections respectively. As can be seen in these two tables, the transfonnation docs make the distributions considerably more symmetrical and reduces ex-tremes among the various measures.
In the CACM 1 collection, skewness (0 is normal) was reduced to more acceptable levels (-.009 to 2.53), kurtosis (0 is nonnal) was reduced similarly (-.44 to 10.2) and CV ( 100 is normal) was tighter (52.3 to 265.0). In the ISII collection, skewness dropped (-.37 to 2.6), kurtosis declined (-1.4 to 5.1 ), and CV was lowered (39.5 to 270.2). Thus, in both collections, the distributions of the con-cept type variables are more evenly mati::hed and closer to nonnal than for the raw data.
Histograms of raw data and log data further illustrate the effects of the log transformation on the data. I Iistograms in Figure 1 on page 17 show the impact of the log transformation on TRM in the ISi l collection (sample data). The large value of 4.8 for skewness in the raw data is shown by the long positive tail. I lowcver, the log transfonned data show no tail in either direction, as skewness has been reduced to -.154. Not all of the histograms show such dramatic improvement as those seen in f’igurc 1 on page 17, but all of the concept types do show improved distributions. Figure 2 on page 18, which is of raw and log transformed CRC from the CACMl collection (all rl.!cords) is an example of a variable with only modest improvement.
Li11ear regressions 011 the CACM/ col/ectioll
SAS Procedure GLM was used to run full models. of all concept types as independent variables and ranked relevance judgment as the dependent variable. These runs were performed with the raw data and log transformed data and are summarized in Table 6 on page 22. However, the proportion of nonrelevant to relevant documents was too high, more than 9 to 1. This problem of unbalanced groups was partly responsible for the low coefficient of determination (RSQ) for the raw data (.387) and for the log data (.396). In order to improve the proportion of relevant versus nonrelevant re-cords, most of the nonrelevant documents were randomly discarded leaving a data set that had equal proportions of relevant versus nonrelevant records (766 total records). This modestly im-proved the RSQ of the raw data to .445 and considerably improved the RSQ of the log transformed data to .627, as can also be seen in Table 6 on page 22. The sparseness of many of the concept type subvectors is also probably contributing to the relatively low RSQ (see “Description of the data” on page 11), but nothing could be done about that.
Similar runs were made using binary relevance as the dependent variable. The results of these re-gressions are given in Table 7 on page 23. The same kind of improvement in RSQ that \\’.as seen in Table 6 on page 22 was found by discarding most of the nonrelevant records and balancing the relative number of relevant versus nonrell!vant documents. I Iowever, the improvement between raw data and log data is a little greater, by approximately l % to 4’%. Furthermore, the binary rel-evance data with the log transformed indcpen<lent variables gave a better RSQ than the ranked relevance (.6659 versus .6274).
A plot of predicted scores versus residuals for the best log sample model with binary relevance data is displayed in Figure 3 on page 24 and shows a fair degree of closeness for relevant documents ( 1.0) and considerable spread for nonrelevant documents. A similar plot for ranked relevance data is shown in Figure 4 on page 25, but here the relevant values are divided into values from 1.0 to 4.0.
Again, the relevant documents show less spread than the nonrelevant.
The concept type variables for each regression run were ranked by their Type III Sum of Squares (SAS, 1985), which gives the sum of squares for each variable independently of its order in the re· gression model. From the rankings, some two and four variable models were chosen and run using the same two dependent variables for all records and for the sample set of records. The coefficients, RSQ’s, and rankings are also provided in Table 7 on page 23 and Table 6 on page 22 for binary and ranked relevance data respectively.
The two variable model (TRM and LNK) using raw data and all records gave 86% of the RSQ of the seven variable model for both dependent variables. For sample raw data, 83% and 86% of the seven variable RSQ was obtained. For log transformed data, all record regressions with the same independent and dependent variables gave 79% and 81 % of the original seven variable RSQ (ranked.and binary relevances respectively). However, the sample of log data gave 98% of the seven variable RSQ for both ranked and binary relevance data. In fact the two variable model for log transformed independent variables and binary relevance data gave a higher RSQ than any of the other seven variable mo<lcls. The four variable model (AUT, CRC, TRM, and LNK) gave modest improvement, but clearly most of the variance is accounted for by TRM and LNK.
All possible two-way interactions were tested (using proc GLM in SAS) on ranked relevance and binary relevance data. Several were found to be significant at the .05 level. Table 8 on page 26 shows interactions for binary and ranked relevances for log sample data, the best models. For ex-ample, there is a significant interaction between AUT and TRM using the reduced sample log data for ranked relevance. This makes sense as authors may write more than one article in the same subject area using common terms and few articles in other areas with different terms. The amount of variance explained by this interaction was relatively small, Jess than 3% of that of TRM. I Iowever, this interaction accounts for more variance than the three lowest ran.king concept types, which did not add much predictive ability to the seven variable model (see Table 6 on page 22)either. Other interactions seemed reasonable, but also did not account for much variance. In fact, all of the interactions together only raised RSQ from .6274 to .6432. Additionally, some inter-actions had negative coeflicients, which indicated inverse relationships with the other variables. Furthennore, as SMART is not currently programmed to use interactions, they were not used on subsequent regression runs. A subset (based on the largest two-way interactions and the concept types) of the possible three-way interactions was tested with none being significant at the .05 level.
011 the /SII collection
Regressions on the ISi l collection followed the same general pattern as those for the CACM l collection, but there were only three concept types (TRM, AUT, and COC) and only binary rele-vance data. Full models on raw data gave (sec Table 9 on page 28) RSQ’s of .2356 and .2812 for all records and a sample respectively. Full models of log transformed data gave RSQ’s of .3162 (all records) and .481 (sample). A plot of predicted scores versus residuals for the best log sample model is displayed in Figure 5 on page 29 and shows a fair degree of closeness for relevant documents ( 1.0) and considerable spread for nonrelcvant documents. As shown in Table 9 on page 28, the variables TRM and COC were ranked one and two by Type III Sums of Squares. Thus, they were used in two model runs which had RSQ’s of 96% to 99.5% of the full model RSQ’s.
All possible two-way interactions were done and all were significant at the .05 level, but they ac-counted for little variance (see Table IO on page 30, which has interactions for the best model, log sample data). For example, the regression for log sample improved from .481 to .4978. However, the cocflicicnt for AUT bec~unc negative and one of the interactions (TRM with LNK) was also negative. As for the CACM I collection, coeJiicients for the interaction terms were not used with SMART.
Additional regressiou teclz11iq11es
Other rebrression techniques, including all possible regressions (using procedure REG in SAS) and logistic regression (using procedure LOGIST in SAS) were tried. In these runs different coefficients were obtained, but the same four predictor variables were most important in the models (TRM, AUT, LNK, and CRC). They did not give improved RSQ, but rather were much worse with RSQ’s of approximately half that of the models described earlier. Additionally, the logistic re-gression program gave an intercept, which was not desirable for use in SMART.
Use of regression coejjicie11ts in Sill/ART for CACMJ a11d IS/I collectio11s
Table 11 on page 34, Table 12 on page 35, and Table 13 on page 36 show the average precision values for base runs and precision values for the coeflicient runs for CACM l and ISI l collections. Base runs are retrievals, which used the concept type values, but which give equal weights to the various concept types. In the coefficient runs the various concept type values are wieghted by the regression cocllicients.
Some minor improvement was found in most runs. For example, in the ISI l collection, precision with coefficients from the sample log regressions showed a 2.9% improvement for the two concept type run and a .7% improvement for the three concept type run. Also for ISll runs, both raw data and log data in the two and three concept type runs showed very small improvement ( .4 % – 1.1 % ), while cocflicients from a sample of raw data were worse (.4% – l.2% ). The best precision was obtained using coefficients developed from log data with all records included.
1.1 Probabilistic retrieval
1.2 Information retrieval in SMART
1.3 Research goals
2.1 Division of the collections
2.2 Description of the collections and vectors
2.3 Descriptive analysis of the data
2.4 Obtaining regression cocflicients for use in SMART
2.5 Testing the usefulness of the coeflicients as weights in SMART
2.6 Threshold techniques
3.1 Description of the data
3.2 Linear regressions on the CACM 1 collection
3.3 Linear regressions on the ISII collection
3.4 Additional regression techniques
3.5 Use of regression coefficients in SMART for CACM I and ISll collections
3.6 Use of regression coefficients in SMART for CACM2 and ISl2 collections
3.7 Use of threshold techniques
4.1 Usefulness of concept types as shown by linear regressions
4.2 Ranked relevances versus binary relevances
4.3 Thesholds as aids in regression
4.4 Improvement of retrieval using cocflicients
4.5 Conclusions and implications for further research
GET THE COMPLETE PROJECT
Regression Analysis of Extended Vectors to Obtain Coefficients for Use in Probabilistic Information Retrieval Systems