A day in the life of web searching: an exploratory study [PDF]

Web study of US-based Excite and Norwegian-based Fast Web search logs, exploring variations in user searching related to

3 downloads 6 Views 192KB Size

Recommend Stories


An Exploratory Study
I cannot do all the good that the world needs, but the world needs all the good that I can do. Jana

an exploratory study
Sorrow prepares you for joy. It violently sweeps everything out of your house, so that new joy can find

an exploratory study
We can't help everyone, but everyone can help someone. Ronald Reagan

An Exploratory Study
Where there is ruin, there is hope for a treasure. Rumi

An exploratory study
We can't help everyone, but everyone can help someone. Ronald Reagan

A Day in the Life of an LSP
Happiness doesn't result from what we get, but from what we give. Ben Carson

a brief, exploratory study
Every block of stone has a statue inside it and it is the task of the sculptor to discover it. Mich

An Exploratory Study of Upper-Secondary Mathematics
Don't be satisfied with stories, how things have gone with others. Unfold your own myth. Rumi

An Exploratory Study of Positive Psychology Interventions
Pretending to not be afraid is as good as actually not being afraid. David Letterman

Exploratory Subject Searching in Library Catalogs
Don't ruin a good today by thinking about a bad yesterday. Let it go. Anonymous

Idea Transcript


COVER SHEET

Ozmutlu, Seda and Spink, Amanda and Ozmutlu, Huseyin C. (2004) A Day In The Life Of Web Searching: An Exploratory Study . Information Processing and Management: an International Journal 40(2):pp. 319-345.

Copyright 2004 Elsevier Accessed from: https://eprints.qut.edu.au/secure/00004666/01/IPM-DailyWeb.pdf

1

A DAY IN THE LIFE OF WEB SEARCHING: AN EXPLORATORY STUDY

Seda Ozmutlu Department of Industrial Engineering Uludag University Gorukle Kampusu, Bursa, 16059, Turkey Tel: (90-224) 442-8176 Fax: (90-224) 442-8021 E-mail: [email protected]

Amanda Spink* School of Information 510 IS Building, 135 N. Bellefield Avenue Pittsburgh PA 15260 Tel: (412) 624-5230 Fax: (412) 624-5231 Email: [email protected]

Huseyin C. Ozmutlu Department of Industrial Engineering Uludag University Gorukle Kampusu, Bursa, 16059, Turkey Tel: (90-224) 442-8176 Fax: (90-224) 442-8021 E-mail: [email protected]

* To whom all correspondence should be addressed.

2

ABSTRACT Understanding Web searching behavior is important in developing more successful and cost-efficient Web search engines. We provide results from a comparative time-based Web study of US-based Excite and Norwegian-based Fast Web search logs, exploring variations in user searching related to changes in time of the day. Findings suggest: (1) fluctuations in Web user behavior over the day, (2) user investigations of query results are much longer, and submission of queries and number of users are much higher in the mornings, and (3) some query characteristics, including terms per query and query reformulation, remain steady throughout the day. Implications and further research are discussed.

3

INTRODUCTION Search engines are one of the most frequently used tools to retrieve information from the Web. This paper reports findings from a comparative time-based analysis of search queries submitted by US-based Excite and Norwegian-based Fast Web users. A time-based analysis, which investigates patterns of Web user behavior with respect to different hours of the day, can provide valuable _", "~", "\", "/", "". This step was performed by replacing all occurrences of non-letter characters with a space. Elimination of the character "." provided an interesting finding. The terms "com", "www" and "http" are in the top six terms among most common 10 terms. The high occurrence of the terms "com", "www" and "http" can not be detected unless “.” are replaced with a space. Web page searches are usually submitted in the form "http://www.######.com". Since the mid part of the web addresses vary significantly, each search query for web pages will be counted as a unique term unless the search is for the exact same web page. The second step of the term analysis involved determining the pure search terms. After elimination of the characters, the list of most frequently used terms included words such as “http”, “com”, “and”, “or”, etc. Such terms do not really reflect any information about the content of the search query; hence they have been

18

eliminated from the study. Considering these guidelines, the results of the top terms analysis is as below. Excite Top Terms Analysis This portion of the analysis presents the top term analysis for the Excite dataset. The most frequently used ten pure terms in the Excite dataset with respect to hours of the day is reported in Table 8. [Place Table 8 Here]

The analysis of the most frequently used Excite terms did not yield any specific results. We have not detected any trend in the content of the Web searches with respect to hours of the day. Fast Top Terms Analysis This portion of the analysis presents the top term analysis for the Fast dataset. Since the Fast dataset expands over 24 hours, it enables an analysis of terms used during the day and night hours. The most frequently used ten pure terms in the Fast dataset with respect to hours of the day is reported in Tables 9 and 10. [Place Table 9 Here] [Place Table 10 Here] There seems to be a slight increase in use of sexual terms during the night hours compared to the day hours. There are a few sexual terms in top ten pure terms during night and early morning hours, whereas there are almost no sexual terms among the top ten terms between 9:00 AM and 3:00 PM. Sexual terms are replaced by other terms such as business or computer related terms, etc. during work hours. Consequently, it could be concluded that the topics of Web searches could change based on time of the day.

19

Type of Excite Queries Another term analysis relates to the type of Excite queries, i.e. whether a query is unique, has been modified from the previous query or is the next page for the preceding query. The number of unique, next page and modified queries and their percentages with respect to total queries are given in Table 11. The chart for the percentages of the queries can be seen in Figure 13. [Place Table 11 Here] [Place Figure 13 Here] The unique queries form about 50% of the total number of queries submitted in an hour, while the next page queries and modified queries form about 35 % and 15 %, respectively, of the total number of queries in an hour. The percentage of both unique and next page queries and the modified queries do not show a significant change as the day progresses. The range for the percentage to the total number of queries in a specific hour is 45.7% - 49.1% for unique queries, 30.7% - 40.5% for next page queries and 12.2 % 20.1% for modified queries. Hence, as the day progresses the quality of the queries stay the same not only in terms of the number of terms per query; but also in terms of the type of queries.

Markov Analysis of Excite Queries This portion of the study analyzes hourly changes in the Markov matrix for the queries of the user sessions of the Excite data log. Initially, the frequencies of transitions from one type of query to another type of query within a session are investigated,

20

followed by the limiting probabilities for different types (states) of queries, i.e. unique query, next page and modified query states. Tables 12 and 13 show the hourly frequency matrices for transitions between unique, next page and modified query states (and from these states to the end state). [Place Table 12 Here] [Place Table 13 Here] The number of transitions from a certain type of query to another type of query is investigated, such as from unique queries to unique queries, from unique queries to modified queries, etc. The rows of each portion of the table show the previous query state of consecutive queries, whereas the columns show the subsequent query state of consecutive queries. Within the same row and column format, Tables 12 and 13 also show the initial ratio of the transition of queries from one state to another. The initial ratios for transitions from state i to state j is calculated by dividing the number of transitions from state i to state j to the total number of the queries originating from state i. There is not much difference in the transition of Excite queries originating from states P and M to other states with respect to hours of the day. Hence, no matter what the hour is, if a user is already in a next page query or has submitted a modified query, the probability of going to other states stays more or less the same. Regardless of the hours of the day, about 10-15% of next page queries are followed by unique queries, about 50%60% followed by more next page queries and about 5%-10% followed by modified queries. Similarly for unique queries, regardless of the hours of the day, 10%-15% are followed by more unique queries, 25%-30% followed by next page queries and 10%-15% followed by modified queries. The change in hours of the day causes a slight change in

21

the transition of queries originating from the modified queries state. However, the change does not have a specific pattern, so it might be deduced that the slight change is random. Once a user views the next page, he/she will continue looking at the following pages of results with over 50% chance, and with approximately 25% chance they will end the session. In other words, once they begin looking at the next pages, with over 75% chance they will not modify the query or submit a new query. In addition, generally, users submit unique queries before ending the session. The second portion of the Markov analysis emphasizes limiting probabilities for unique, next page and modified query states (for this portion of the study the end state is ignored to be able to calculate the limiting probabilities of the other three states U, M and P). The limiting probabilities of queries show the proportion of time that the Markov chain visits the respective query states over an extended period of time, in other words the limiting probabilities provide the average number of times users select each query type. The long-term ratios also provide information on the popularity of the query types over a long term. The limiting probabilities ( π j 's, j ≥ 0) for unique, next page and modified queries are calculated using the initial ratios (Pij’s, the probabilities of going state j given the system is in state i (i, j = 1,2,3), which are given in Table 8) as follows (whereas j = 1 for unique query state, j = 2 for next page state, and j = 3 for modified query state): ∞

π j = ∑ π i Pij , i =0



∑π j =0

j

=1

j≥0

22

The hourly limiting probabilities for unique queries, next page queries and modified queries are given in Table 14 and Figure 14. [Place Table 14 Here] [Place Figure 14 Here] The long-term ratio of Excite unique queries has a declining trend as the day progresses. This finding actually verifies the findings of the previous sections of the results. Web users tend to spend less effort on Web searching as the day progresses. Instead of showing the effort of making new queries, Web users tend to continue the already submitted queries through next page queries or submit slightly modified queries. Paired t-tests have been performed in order to compare the set of long-term ratios for different types of queries within different hours. For example, the statistical significance of the difference between the long-term ratios for unique queries, next page queries and modified queries for 9:00 AM-10:00 AM and the same set of long-term ratios for 10:00 AM-11:00 AM is tested. The testing is performed through conventional paired t-testing and has been repeated for all possible combinations of the hours of the day. Separate paired t-tests have been applied for pairs of 9:00-10:00 and 10:00-11:00, 9:0010:00 and 11:00-12:00,….……, and 15:00-16:00 and16:00-17:00.The t values for the paired t-tests are given in Table 15.

[Place Table 15 Here] Table 15 shows the t values for paired t-testing of long-term ratios of query types among hours are very low. At the 95% confidence level, none of the t values demonstrate statistical significance. Hence, it can be concluded that the difference of the set of limiting probabilities for unique, next page and modified queries from any hour of the

23

day to any other hour is statistically insignificant. However, as mentioned previously, differently from next page and modified queries, the limiting probability of unique queries seems to be on the decline as the day progresses. The top 20 web user session types might provide a deeper insight to the Markov analysis for query types and are provided in Table 16 [Place Table 16 Here] Clearly, Excite sessions with single query (type “U”) are the most common type in all hours, and the ratios of session types UU, UP, UM (sessions with two queries) are the following most common session types. The distribution of other session types differs with respect to the hours of the day. However, the number of sessions with more than three queries is insignificant compared to number of sessions with one or two queries. Therefore, hourly changes in the types of sessions with three or more queries are not a significant finding. The ratio of UU and UM are declining compared to the use of session type UP within the top four Excite session types. In other words, users are more reluctant to submit new queries or modify their original queries as the day progresses.

DISCUSSION Our findings suggest that Web user behavior might fluctuate from the beginning of the day to the end of a day. There are sharp changes in some characteristics of Web querying based on time of the day. More search queries are submitted to Excite and Fast during the morning hours, than during the afternoon or evening hours. The session and query durations have also decreased as the day progressed from the morning hours.

24

While the technical characteristics of the Web search queries are affected as the day progresses, the quality of Web search queries may remain almost the same from the beginning of the day to the end of the day. For both the Excite and Fast datasets, the terms per query analysis did not reveal any interesting changes in the quality of Web searches within different time frames. There is also no significant difference in the most frequently used terms during the day for the Excite dataset, whereas sexual terms seems to be more prevalent in the evening hours for the Fast dataset. In addition, in the Excite dataset, query reformulation, ratio of unique, modified and next-page queries remained steady through different hours of the day. A summary of the changes in Excite and Fast datasets from morning hours to afternoon and evening hours can be seen in Table 17. [Place Table 17 here] In addition, the Markov analysis of queries with respect to hours did not yield any specific trend between the hours of the day. The behavior of Web user searches in terms of transition from one type of query to another in consecutive queries within a session seems to stay the same as well as the long-term ratios of unique, modified and next page queries. This finding further supports the fact that while the technical characteristics of the Web search queries, such as query and session arrivals and durations are affected as the day progresses, the quality of the search sessions stays the same both in terms of search terms and in terms of the structure of the queries. Overall, Web searching needs were at their highest levels during morning hours and decreased into the afternoon hours and evening hours. There is a strong indication that the other characteristics of Web search might also vary based on the time of the day. The analysis on the remaining characteristics of the Web user inquiry sessions is proposed as

25

the extension of this research. We believe that time sensitive Web search engine development will provide more intelligent and cost efficient systems by redesigning the allocation of their sources and user interfaces.

26

CONCLUSION Our research suggests changes in user Web searching based on the time of the day. Overall, some characteristics of Web search queries, such as arrival and durations of sessions and queries, are at their highest during the morning hours and decrease later into the day. Other characteristics, such as the quality of queries in terms of number of terms per query and reformulation of queries, stay the same throughout the day. Our findings and the analysis of further data sets can be useful to Web search engines in reconstructing their search structure and reallocating their resources with respect to different time frames.

ACKNOWLEDGMENT The authors thank Doug Cutting, Jack Xu and Soo Young Rieh from [email protected] and Per G. Auran from Fast.com for providing the Web query data sets.

REFERENCES Bhatia, S. K., & Deogun, J. S. (1998). Conceptual clustering in information retrieval. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 28(3), 427-436. Bilinkis, I. & Mikelsons A. (1992). Randomized signal processing, Prentice Hall, New York. Brewington, B. E., & Cybenko, G. (2000). How dynamic is the Web? Proceedings of the 9th World Wide Web Conference, May 2000, Amsterdam, Netherlands. Ling, C. X., & Li, C. (1998). Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, 73-79.

27

Mann, N.R., Schafer, R.E. & Singpurwalla N.D. (1974). Methods for statistical analysis of reliability and life data. John Wiley & Sons, New York Ozmutlu, H. C., & Spink, A. (2001). Time-based analysis of search data logs. Proceedings of Internet Computing’01 Conference on Internet Computing, June 25-28, 2001, Vol.1, pp. 41-46. Ozmutlu, H. C., Spink, A., & Ozmutlu, S. (2002). Analysis of large data logs: An application of Poisson sampling to Excite Web queries. Information Processing and Management, 38(3), 473-490. Silverstein, C., Henzinger, M., Marais, H., &Morris, M. (1999). Analysis of a very large Web search engine query log, ACM SIGIR Forum, 33(3). Spink, A., Bateman, J., & Jansen, B. J. (1999). Searching heterogeneous collections on the Web: A survey of Excite users. Internet Research: Electronic Networking Applications and Policy, 9(2), 117-128. Spink, A., Jansen, B. J., & Ozmutlu, H. C. (2000). Use of query reformulation and relevance feedback by Web users. Internet Research: Electronic Networking Applications and Policy, 10(4), 317-328. Spink, A., Jansen, B. J., Wolfram, D., & Saracevic, T. (2002). From e-sex to ecommerce: Web search changes. IEEE Computer, 35(3), 133-135. Spink, A., Wolfram, D., Jansen, M.B.J. and Saracevic, T. (2001). Searching the Web: The public and their queries. Journal of the American Society for Information Science and Technology, 52(3): 226–234 Wolff, R.W. (1982). Poisson arrivals see time averages. Operations Research, 30 (2): 223-231.

28

Table 1. Mean queries per user session, number of session arrivals and number of query arrivals with respect to hours of the day - Excite Query Set Hour of the Day 9:00-10:00 10:00-11:00 11:00-12:00 12:00-13:00 13:00-14:00 14:00-15:00 15:00-16:00 16:00-17:00

Mean Queries Per Session 3.9 3.2 3.06 2.6 2.7 2.7 2.9 2.2

Number of Hourly Session Arrivals 174 150 143 141 130 120 105 101

Number of Hourly Query Arrivals 679 486 437 367 358 333 304 224

29

No. of queries per s es s ion

Figure 1. Mean queries per user session with respect to hours of the day - Excite Query Set. 4 3 .5 3 2 .5 2 1 .5 1

Ho u r s

30

No.of Ses s ions No.of Queries

Figure 2. Session and query arrivals with respect to hours of the day - Excite Query Set. 800 700 600 500 400 300 200

session arrivals query arrivals

100 0

Hours

31

Table 2: The mean queries per user session, total number of session arrivals and total number of query arrivals with respect to hours of a day - Fast Query Set Hour of the day

Mean Queries per Session

12:00-1:00 1:00-2:00 2:00-3:00 3:00-4:00 4:00-5:00 5:00-6:00 6:00-7:00 7:00-8:00 8:00-9:00 9:00-10:00 10:00-11:00 11:00-12:00 12:00-13:00 13:00-14:00 14:00-15:00 15:00-16:00 16:00-17:00 17:00-18:00 18:00-19:00 19:00-20:00 20:00-21:00 21:00-22:00 22:00-23:00 23:00-24:00

8.9 11.1 9.7 6.6 11.4 9.8 12.2 12.2 11.08 10.5 13.8 11.7 9.3 10.6 10.1 9.3 10.6 10.7 10.6 9.9 11.5 8.6 8.07 7.6

Number of Hourly Session Arrivals 33 32 25 18 14 12 13 35 65 56 57 46 53 48 56 55 47 39 56 46 44 48 40 26

Number of Hourly Query Arrivals 295 358 244 119 160 118 159 430 720 589 789 539 495 513 570 516 500 392 596 459 507 416 323 200

32

Number of queries

Figure 3: Query arrivals with respect to hours of a day - Fast Query Set 900 800 700 600 500 400 300 200 100 0

Hour

33

Figure 4: Session arrivals with respect to hours of a day - Fast Query Set.

Number of s es s ions

70 60 50 40 30 20 10 0

Ho u r

34

Mean number of queries per s es s ion

Figure 5: Mean queries per user session with respect to hours of a day - Fast Query Set. 14 13 12 11 10 9 8 7 6

Hour

35

Table 3. Mean session durations with respect to hours of the day (in seconds) - Excite Query Set. Hour of the Day 9:00-10:00 10:00-11:00 11:00-12:00 12:00-13:00 13:00-14:00 14:00-15:00 15:00-16:00 16:00-17:00

Mean Session Duration 3988.6 3487.5 2167.7 1914.7 1317.2 1183.6 756.2 328.4

36

Mean duration per s es s ion

Figure 6. Mean session durations with respect to hours of the day (in seconds) - Excite Query Set. 4000

3000 2000

1000 0

Hour

37

Table 4. Mean query durations with respect to hours of the day (in seconds) - Excite Query Set. Hour of the Day 9:00-10:00 10:00-11:00 11:00-12:00 12:00-13:00 13:00-14:00 14:00-15:00 15:00-16:00 16:00-17:00

Mean Duration Per Query 979.2 975.6 676.7 635.4 433.3 344.5 228 138.8

38

Figure 7. Mean query durations with respect to hours of the day (in seconds) - Excite Query Set.

Mean duration per query

1200 1000 800 600 400 200 0

Ho u r s

39

Table 5: The mean session and query durations with respect to hours of the day- Fast Query Set. Hour of the Day

12:00-1:00 1:00-2:00 2:00-3:00 3:00-4:00 4:00-5:00 5:00-6:00 6:00-7:00 7:00-8:00 8:00-9:00 9:00-10:00 10:00-11:00 11:00-12:00 12:00-13:00 13:00-14:00 14:00-15:00 15:00-16:00 16:00-17:00 17:00-18:00 18:00-19:00 19:00-20:00 20:00-21:00 21:00-22:00 22:00-23:00 23:00-24:00

Mean Duration Per Query (seconds) 709.3 1588.4 1628.8 1214.4 1181.1 1475.8 1343.7 1347.5 1300.3 1304.8 944.8 577.7 839.9 565.2 714.5 595.8 427.3 694.02 554.1 365.1 277.08 456.9 476.3 691.7

Mean Duration Per Session (seconds) 5994.8 17261.4 15509.09 7215.4 12317.4 14222.1 15091.8 16633.4 13636.09 12879.3 12574.6 6473.7 7424.8 5973.1 6929.6 5722.8 4501.6 6999.6 5400.8 3427.3 2651.9 3577.6 3456.7 5335.8

40

Mean duration per query

Figure 8: The mean query durations with respect to hours of the day - Fast Query Set. 1800 1600 1400 1200 1000 800 600 400 200 0

Hour

41

Mean duration per s es s ion

Figure 9: Mean session durations with respect to hours of the day- Fast Query Set. 20000 18000 16000 14000 12000 10000 8000 6000 4000 2000 0

Hour

42

Table 6. Mean terms per query and mean changes in terms used in consecutive queries with respect to hours of the day - Excite Query Set. Hour of the Day 9:00-10:00 10:00-11:00 11:00-12:00 12:00-13:00 13:00-14:00 14:00-15:00 15:00-16:00 16:00-17:00

Mean Terms Per Query 2.3 2.4 2.5 2.6 2.5 2.4 2.5 2.2

Mean Changes in Terms Per Query in Consecutive Queries 0.4 0.8 0.5 0.5 0.5 0.4 0.4 0.3

43

2.8

2.6

query

Mean number of terms per

Figure 10. Mean terms per query with respect to hours of the day - Excite Query Set.

2.4

2.2

2

Hour

44

Figure 11. Mean changes in the number of terms in consecutive Excite queries with respect to hours of the day - Excite Query Set.

0 .8

0 .6

query

Mean number of terms per

1

0 .4

0 .2

0

Ho u r

45

Table 7: Mean terms per query with respect to hours of the day - Fast Query Set. Hour of the Day 12:00-1:00 1:00-2:00 2:00-3:00 3:00-4:00 4:00-5:00 5:00-6:00 6:00-7:00 7:00-8:00 8:00-9:00 9:00-10:00 10:00-11:00 11:00-12:00 12:00-13:00 13:00-14:00 14:00-15:00 15:00-16:00 16:00-17:00 17:00-18:00 18:00-19:00 19:00-20:00 20:00-21:00 21:00-22:00 22:00-23:00 23:00-24:00

Mean Terms Per Query 2.3 2.5 2.6 2.8 2.5 3.1 1.8 2.4 2.4 2.4 2.6 2.5 2.6 2.3 2.7 2.4 2.3 2.4 2.6 2.4 2.4 2.1 2.5 2.6

46

Figure 12: Mean terms per query with respect to hours of the day - Fast Query Set. 3 .3

Terms per query

3 .1 2 .9 2 .7 2 .5 2 .3 2 .1 1 .9 1 .7 1 .5

Ho u r

47

Table 8. List of the most frequently used ten terms with respect to hours of the day Excite Query Set.

9:00-10:00 Search Term Number botanical shampoo hotel mpeg pokemon City Free illegal photos schoolgirls

45 45 16 15 13 12 12 10 10 10

Hours of the Day 10:00-11:00 11:00-12:00 Search Search Term Number Term Number Free Web Maids Music Christmas

galleries themes black bugs fetish

14 14 13 13 12 11 11 10 10 10

braindumps windows cna banking risks sex pc anywhere demo free

21 21 18 16 13 12 11 10 10 9

12:00-13:00 Search Term Number international

site clubs free Stoughton travel Middle school web Canon

16 14 13 13 13 13 12 12 11 10

13:00-14:00 Search Term Number

14:00-15:00 Search Term Number

15:00-16:00 Search Term Number

16:00-17:00 Search Term Number

Free New

16 13

Pictures Rape

34 33

sex millenium

23 14

15 10

Pics bands Steel Beach jones Lee Myrtle

13 10 10 9 9 9 9

Banners CRAFTS Angels Free Quartet String Angel

19 12 11 10 10 10 9

14 12 10 10 10 10 10

SC

9

Clipart

9

pc poems FREE playstation Rat Terrier toro snowthrow ers

free nude Mortal kombat home dams downey usa built DCCC

9

last

5

9 7 6 6 6 5 5

48

Table 9. List of the most frequently used ten pure terms with respect to hours of the day Fast Query Set.

24:00-1:00 Search Term

Number

black pantyhose ardilla clipper cars used girls fat crazy mixed-up

26 25 19 18 17 17 14 12 11 11

4:00-5:00 Search Term

Number

Hours of the Day 1:00-2:00 2:00-3:00 Search Search Term Number Term Number

3:00-4:00 Search Term Number

animal in testing bear gay the hedland port condoms on

channel country music free interacial job nude of according gospel

20 20 20 16 16 16 15 15 14 14

7:00-8:00 Search Term Number

banners powerdvd Maw taurus testnews wasserstrah lpumpen bilder nikki nova tattoo

lolitas download h 2 capi

21 18 15 12 12

mp3 linux rotor gas oil

12 12 12 11 11

visual i sex directory gallery

14 9 9 6 6

celebrity wanted photos gis nude

12 12 10 9 7

nude parent good programz so

6 6 4 4 4

Number

freud lucian pictures webvirgins isles photographs

25 25 23 21 19 19

scilly takraf asphalt in

19 19 17 17

17 17 17 10 9 8 8 8 7 7

6:00-7:00 Search Term Number

29 22 17 17 12

Search Term

28 12 11 9 8 8 7 6 6 6

5:00-6:00 Search Term Number

preteen girl is what cattleya

8:00-9:00

berlin managerial nose of plastic surgery lln airport pengantar the

27 15 11 11 9 9 7 7 7 7

9:00-10:00 Search Term Number

10:00-11:00 Search Term Number

11:00-12:00 Search Term Number

mp3 halford and of pl n immobili en gql treatment sally

24 24 21 19 17 17

download uk full version appz shemale

45 35 34 32 28 24

uk free business leads medway acdsee

31 28 18 18 18 16

17 15 14 14

nantwich r corporate governance

23 20 19 19

donny osmond 800 201xp

16 16 15 15

49

Table 10. List of the most frequently used ten pure terms with respect to hours of the day - Fast Query Set. 12:00-13:00 Search Term Number

Search Term

2000 windows server mp3 appz printing a lba mode of

film free umbrella conveyor omen the folding in modular dnv

29 28 27 26 21 19 18 17 17 17

16:00-17:00 Search Term Number

Search Term

index manual fence electric kiedis anthony designer archive noble barnes

mcmahon shane car mp3 lolita print pro thumb agua care

24 22 22 22 20 20 18 18 16 16

34 24 23 22 21 21 19 15 15 13

17:00-18:00

20:00-21:00 Search Term Number

Search Term

ciego del la noche terror madrid a for half lords

girls webcam banks outer pizza recipe bicicletas trigger bitters copperplate

40 40 40 40 40 18 17 17 17 17

Hours of the Day 14:00-15:00 Search Number Term Number

13:00-14:00

Number 17 17 16 16 15 14 14 14 13 13

21:00-22:00 Number 51 21 19 19 19 19 17 17 16 16

university editor registry Tips Carr Cory Of barron S 3d

35 29 29 29 20 20 18 17 17 14

15:00-16:00 Search Term Number de compiler ayuntamiento sexshare and yahoo glockenspiel norma stitz ibo

34 28 27 23 20 20 19 19 19 17

18:00-19:00 Search Term Number

19:00-20:00 Search Term Number

and free de garters pictures transsexual resume uniform c lavoratori

sex ficken free schule the fax internet der loudspeaker via

51 31 20 18 17 17 15 15 14 14

27 23 23 23 23 20 20 19 18 18

22:00-23:00 Search Term Number

23:00-12:00 Search Term Number

and of eltron sex adult cards greeting seventeen club luis

blade runner battery intelligent powerbook public 3 decq driver gratification

24 23 21 20 18 16 16 16 15 15

13 13 10 10 10 10 9 9 9 9

50

Table 11. Number of unique, modified and next page queries and their percentages to total number queries with respect to hours of the day - Excite Query Set. Hour of the Day 9:00-10:00 10:00-11:00 11:00-12:00 12:00-13:00 13:00-14:00 14:00-15:00 15:00-16:00 16:00-17:00

Unique Queries Number % 47% 321 48% 236 48% 213 49% 180 49% 176 46% 156 45% 139 55% 124

Next Page Queries Number % 40% 275 38% 186 35% 157 31% 116 30% 110 38% 128 40% 123 32% 72

Modified Queries Number % 83 12% 64 13% 67 15% 71 19% 72 2% 49 14% 42 13% 28 12%

51

Figure 13. Percentage of unique, next page and modified queries to total number queries with respect to hours of the day - Excite Query Set.

Perc entage of queries

0.6 0.5 0.4

Unique 0.3

Nex t Page

0.2

Modif ied

0.1 0

Hour

52

Table 12. The frequency matrix for transition from one type of query to another

and the initial ratios for the transitions (U: Unique queries, P: Next Page queries, M: Modified Queries) - Excite Query Set. 9:00 – 10:00 to from

Number of Query Transitions U P M END

U P M END

to from

46 27 13 0

82 91 13 0

27 20 17 0

31 30 9 0

63 75 19 0

34 17 16 0

Number of Query Transitions U P M

U P M END

to from

41 17 25 0

Number of Query Transitions U P M

U P M END

to from

96 156 23 0

Number of Query Transitions U P M

U P M END

to from

93 37 17 0

23 11 5 0

45 51 20 0

32 13 26 0

Number of Query Transitions U P M

U P M END

29 13 4 0

47 48 15 0

29 15 28 0

Initial Ratios of Queries to U P M from 91 U 0.28972 0.299065 0.127726 65 P 0.134545 0.567273 0.061818 18 M 0.204819 0.277108 0.301205 1 END 0 0 0 10:00 – 11:00 Initial Ratios of Queries END to U P M from 81 U 0.194915 0.347458 0.114407 48 P 0.145161 0.489247 0.107527 21 M 0.203125 0.203125 0.265625 1 END 0 0 0 11:00 – 12:00 Initial Ratios of Queries END to U P M from 85 U 0.14554 0.295775 0.159624 35 P 0.191083 0.477707 0.10828 23 M 0.134328 0.283582 0.238806 1 END 0 0 0 12:00 – 13:00 Initial Ratios of Queries END to U P M from 80 U 0.127778 0.25 0.177778 41 P 0.094828 0.439655 0.112069 20 M 0.070423 0.28169 0.366197 1 END 0 0 0 13:00 – 14:00 Initial Ratios of Queries END to U P M from 71 U 0.164773 0.267045 0.164773 34 P 0.118182 0.436364 0.136364 25 M 0.055556 0.208333 0.388889 1 END 0 0 0

END 0.283489 0.236364 0.216867 1

END 0.34322 0.258065 0.328125 1

END 0.399061 0.22293 0.343284 1

END 0.444444 0.353448 0.28169 1

END 0.403409 0.309091 0.347222 1

53

Table 13. The frequency matrix for transition from one type of query to another and the initial ratios for the transitions (U: Unique queries, P: Next Page queries, M: Modified Queries) - Excite Query Set. 14:00 – 15:00 Initial Ratios of Queries to U P M from U 18 46 20 72 U 0.115385 0.294872 0.128205 P 11 73 14 30 P 0.085938 0.570313 0.109375 M 7 9 15 18 M 0.142857 0.183673 0.306122 END 0 0 0 1 END 0 0 0 15:00 – 16:00 Number of Query Transitions Initial Ratios of Queries to U P M END to U P M from from U 19 46 19 55 U 0.136691 0.330935 0.136691 P 13 70 6 34 P 0.105691 0.569106 0.04878 M 2 7 17 16 M 0.047619 0.166667 0.404762 END 0 0 0 1 END 0 0 0 16:00 – 17:00 Number of Query Transitions Initial Ratios of Queries to U P M END to U P M from from U 14 34 16 60 U 0.112903 0.274194 0.129032 P 3 30 7 32 P 0.041667 0.416667 0.097222 M 6 8 5 9 M 0.214286 0.285714 0.178571 END 0 0 0 1 END 0 0 0 to from

Number of Query Transitions U P M END

END 0.461538 0.234375 0.367347 1

END 0.395683 0.276423 0.380952 1

END 0.483871 0.444444 0.321429 1

54

Table 14. Hourly long-term ratios for unique queries, next page queries and modified queries – Excite Query Set. Hour of the Day 9:00-10:00 10:00-11:00 11:00-12:00 12:00-13:00 13:00-14:00 14:00-15:00 15:00-16:00 16:00-17:00

Unique (πU) 0.253447 0.241728 0.236007 0.144447 0.157657 0.155253 0.142427 0.144913

Types of Queries Next Page (πP) 0.592505 0.555489 0.545937 0.562393 0.493626 0.605526 0.639886 0.651248

Modified (πM) 0.154048 0.202783 0.218055 0.293159 0.348717 0.239221 0.217687 0.203838

55

Figure 14. Hourly long-term ratios for unique queries, next page queries and modified queries - Excite Query Set 0 .7

Long-term ratios

0 .6 0 .5

U

0 .4

P

0 .3

M 0 .2 0 .1 0

Ho u r

56

Table 15. Paired t-testing results for comparing the long-term ratios of different query types with respect to hours - Excite Query Set. Hour Hour 9:0010:00 10:0011:00 11:0012:00 12:0013:00 13:0014:00 14:0015:00 15:0016:00

10:0011:00 2.791 06

11:0012:00

12:0013:00

13:0014:00

5.712 -06

1.949 -06

-1.97 -06

1.529 -05

1.32 -06

-3.59 -06

-9.493 07

-5.79 -06 -9.16 -06

14:0015:00 -2.58 06 -4.8 -06 -7.79 06 -9.81 06 8.5 -07

15:0016:00 2.393 -08 -1.301 06 -3.467 06 -3.2 -06 2.4 -06 7.986 -06

16:0017:00 3.42 406 2.069 06 -5.25 08 8.425 07 4.313 06 1.349 05 2.501 05

57

Table 16. Top twenty session types with respect to hours of the day (U: Unique queries, P: Next Page queries, M: Modified Queries) - Excite Query Set. Type of Session U UP UU UM UPP UUU UUP UMM UPPP UMU UUUU UPPPP UPU UPM UUM UPPUM UPPMP UUPPP UPUU UPPPPPP

9:0010:00 50 28 12 6 6 5 4 3 2 2 2 1 1 1 1 1 1 1 1 1

10:0011:00 56 17 12 7 3 1 2 2 4 1 1 3 1 1 1 1 1 1 0 0

11:0012:00 58 14 7 10 2 2 1 2 1 2 0 0 5 1 1 1 0 0 2 0

Hours of the Day 12:0013:0013:00 14:00 66 55 15 12 10 8 10 10 5 5 1 1 0 1 1 3 2 1 0 0 0 1 0 1 2 1 1 2 1 0 0 0 0 0 0 0 1 0 0 1

14:0015:00 58 15 4 8 1 1 0 0 5 1 0 1 1 2 0 0 0 0 1 1

15:0016:00 45 13 5 5 3 2 3 5 1 0 0 0 3 1 1 0 0 0 0 0

16:0017:00 49 18 2 4 3 3 0 1 1 0 0 0 1 2 0 0 0 0 0 0

58

Table 17. Summary of changes in percentage in various query and session characteristics from morning hours to later in the day, i.e. afternoon, evening and night hours - Excite and Fast query sets. Query or session characteristic Arrival of queries Arrival of sessions Duration of queries Duration of sessions Mean queries per session Mean terms per query Most frequently used terms

Excite Query Set 67% decrease from morning to afternoon 42% decrease from morning to afternoon 86% decrease from morning to afternoon 92% decrease from morning to afternoon 43% decrease from morning to afternoon No significant changes No significant changes

Fast Query Set 75% decrease from morning to evening and night 50% decrease from morning to evening and night 70% decrease from morning to evening and night 70% decrease from morning to evening and night No significant changes No significant changes Sexual terms among top terms in evening hours; versus no sexual terms among top terms in morning hours

59

Poisson sampling can be applied in two different cases: continuous time sampling and discrete time sampling. For continuous time sampling, selection of the next sample point is comparatively easy. The random timing of the next sample is generated according to an exponential distribution with parameter λ (interarrival time of the next sample x ~ Exp (λ)). The formulation for the random number generator for exponential distribution can be derived from the cumulative density function (cdf) of the exponential distribution, given in Equation 1.

F (x) =



x

−∞

1− e − λ x ,

x≥0

0,

x 1

(2)

The other case of the Poisson sampling, discrete time sampling is used where the stochastic process under observations has discrete arrivals. For discrete stochastic arrival processes, sampling is done by randomly generating a number u ~ Uniform (0,1) and then find the corresponding n, the number of arrivals to skip

60

before the next sample, using Poisson Process with parameter λ > 0, {N(t), t ≥ 0}. Note that the inter-arrival times of samples are distributed according to Poisson process, not the inter-arrival times of the process from where the samples are taken. The probability mass function of the Poisson process is given in Equation 3.

λ k exp( − λ ) F(x ) = , k!

λ > 0 , k = 0 , 1, K ,

(3)

However, the analytical inverse of the Equation 3 is not available. Therefore the following algorithm is used to generate the Poisson variate n (Mann, et al, 1974) Step 1: Set j = 0 and yj = u0, where uj ~ Uniform (0,1), j = 0,1,… , Step 2: If yj ≤ exp(-λ), return n=j and terminate. Step 3: j = j +1, and yj=ujyj-1 Goto Step 2 As in the continuous sampling case, another random n is generated using the algorithm stated above. The Excite and FAST Web query sessions arrive according to a discrete stochastic process. Although, there is no available data study on the type of stochastic process that web query sessions follow, the sampling strategy is not affected due to the fourth property of Poisson sampling. The data used in this study has time stamps for each query entry, however it is not sensitive enough to determine the stochastic arrival process. The smallest time unit of the time stamps was seconds, and on average, there were 31.8 arrivals in each second. One can argue that if the sampling time units are set in seconds, the arrival process can be considered as continuous time. Consequently, continuous time sampling becomes applicable. However, this discussion is not addressed in this study. To be on the safe side, we will apply discrete time Poisson sampling for the analysis of the data set.

Smile Life

When life gives you a hundred reasons to cry, show life that you have a thousand reasons to smile

Get in touch

© Copyright 2015 - 2024 PDFFOX.COM - All rights reserved.