Using Query Reformulation to Compare Learning Behaviors in Web Search Engines

setembro 6, 2019 § Deixe um comentário

search

M. Tibau, S. W. M. Siqueira, B. Pereira Nunes, T. Nurmikko-Fuller and R. F. Manrique, “Using Query Reformulation to Compare Learning Behaviors in Web Search Engines,” 2019 IEEE 19th International Conference on Advanced Learning Technologies (ICALT), Maceió, Brazil, 2019, pp. 219-223.
doi: 10.1109/ICALT.2019.00054
keywords: {query reformulation;query states;searching as learning;Web search engine;exploratory search;knowledge-intensive process},
URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8820932&isnumber=8820810

Abstract: Web search engines have gained importance as tools capable of connecting informal and self-learning with formal learning by aiding individuals in retrieving relevant information through the formulation and modification of their queries. Understand the differences between query states and their transitions becomes increasingly important, as doing so makes the optimization of search engines’ results according to educational uses and needs possible. This paper introduces the ESKiP Taxonomy of Query States, a classification framework validated in an experiment involving two different query log datasets. It enables the comparison between the behaviors of users in search for knowledge (learners) and users performing transactional or factual searches in Web search engines.

 

Anúncios

Exploratory Search as a Knowledge-intensive Process

maio 3, 2019 § Deixe um comentário

conceito-de-marketing-de-motores-de-busca_1325-486

Abstract: This paper presents an exploratory search model capable of assisting the visualization of search patterns and clarifying best practices associated to users’ decision-making process, with implications in areas related to information retrieval, humancomputer interaction, Web searching and educational technology. The Exploratory Search Knowledge-Intensive Process model considers tasks and search activities as part of a chain of actions that help clarify the reasons why a subject is searched. It also supports the visualization on how the information retrieved is used to define decision criteria about which data is worth extracting, to draw inferences, and to create a shortcut to understanding.

Marcelo Tibau, Sean W. M. Siqueira, Fernanda Baião and Bernardo Pereira Nunes. 2018. Exploratory Search as a Knowledge-intensive Process. Euro American Conference on Telematics and Information Systems (EATIS ’18), November 12–15, 2018, Fortaleza, Brazil, 8 pages. https://doi.org/10.1145/3293614.3293618

A Correlation Index Between Two Different Text and Web Resource Classification Systems

maio 3, 2019 § Deixe um comentário

GUID-A74116EB-D038-4642-B116-DD2DEAE2FFD3-web

Abstract: Classifying content on the Web has been a common subject of research, since the amount of available data on the Web, especially in text format, grows every day. In this paper it is proposed a correlation index to measure how close a classification system based on Wikipedia categorization is of a service provided by Watson IBM that has the same purpose: text and resourceclassification on the Web.

Almeida, Rubia; Siqueira, Sean W.M.; Tibau, Marcelo; Queiroz, Jackson. A Correlation Index Between Two Different Text and Web Resource Classification Systems. EATIS ’18 Proceedings of the Euro American Conference on Telematics and Information Systems. Article No. 43. Fortaleza, Brazil — November 12 – 15, 2018. ACM New York, NY, USA ©2018 https://www.doi.org/10.1145/3293614.3293652

Investigating users’ decision-making process while searching online and their shortcuts towards understanding

setembro 9, 2018 § Deixe um comentário

 

Resultado de imagem para web search vector

Abstract: This paper presents how we apply Exploratory Search KiP model, a model capable of assisting the visualization of search patterns and identifying best practices associated to users’ decision-making processes, to log analysis and how it helps understand the process and decisions taken while carrying out a search. This study aims to model searches performed through web search tools and educational resources portals, to enable a conceptual framework to support and improve the processes of learning through searches. Applying the model to log analysis, we are able: (1) to see how the information retrieved is used to define decision criteria about which data are worth extracting; (2) to draw inferences and shortcuts to support understanding; (3) to observe how the search intention is modified during search activities; and, (4) to analyze how the purpose that drives the search turned into real actions.

Tibau, M. ; Siqueira, S.W.M. ; Nunes, B. P. ; Bortoluzzi, M. ; Marenzi, I. ; Kemkes, P. . Investigating users’ decision-making process while searching online and their shortcuts towards understanding. In: 17th International Conference on Web-based Learning, 2018, ChiangMai. Advances in Web-Based Learning (ICWL 2018), 17th International Conference, 2018.

DOI: doi.org/10.1007/978-3-319-96565-9_6

Modeling exploratory search as a knowledge-intensive process

setembro 9, 2018 § Deixe um comentário

Depositphotos_67963611_l-2015

Abstract: Searching as Learning and Information Seeking require exploratory search to be modeled for supporting learning. The present paper introduces a model of exploratory search that was applied on web searching in language teacher education, which promoted its evolution and validation, and enabled a visualization of search pattern and learning process. This model was able to help clarify best practices associated to users’ decision-making process regarding suitable and not suitable information and to capture the relevance of context variables, personal skills and expertise that users utilize as filters for the search.

Tibau, M., Siqueira, S.W.M., Nunes, B.P., Marenzi, I., Bortoluzzi, M.: Modeling exploratory search as a knowledge-intensive process. In: 2018 Proceedings of the 18th IEEE International Conference on Advanced Learning Technologies (ICALT 2018), Mumbai. IEEE, New York (2018).

DOI: 10.1109/ICALT.2018.00015

 

A summarization of Rio de Janeiro’s 2018 summer

abril 13, 2018 § Deixe um comentário

This summarization is an adaptation from Edward Tufte’s illustration displayed at his “Visual Display of Quantitative Information” book. The original illustration comes from The New York Times (Fig 1).

tuftesOriginal

Fig 1: Edward Tufte’s illustration of New York City’s 2003 weather

Mine’s was created using R Programming packages dplyr and tidyr to preprocess and summarize a dataset collected from Average Daily Temperature archive website provided by University of Dayton.
The chart per se (Fig 2) was created using package ggplot2. Temperature is in Fahrenheit.

tufteWeatherChartRio

Fig 2: Adaptation from the original Tufte’s chart to Rio de Janeiro’s weather (summer 2018)

In my adaptation, the time series in light brown represents the average temperatures (max and min) from 1995 to 2017, while the dark brown represents the mean temperature for each day along with a 95% confidence interval.
From analyzing the chart is possible to see that from January 1st to March 20 2018, we had 34 days in Rio de Janeiro as the hottest since 1995 and 1 day as the coldest. The period represented accounts for the South Hemisphere summer.

Exploratory analysis and regression model helping fight Zika

março 2, 2017 § Deixe um comentário

title.jpg

Alerta Zika! was a collaborative event to explore the potential of data and technology to improve responses to the Zika virus (more information here). The Inter-American Development Bank organized it with the support of several partners including Rio de Janeiro City Hall and some major Universities based in the city. From December 2nd to 3rd 2016, about 10 registered teams explored the epidemiological, environmental and social factors to understand and explain the progress of this disease. It was one and a half day of hard work to sum up on the efforts to fight the Zika disease in Rio de Janeiro. We gain access to the dataset with all the cases of Zika, Dengue and Chikungunya registered in Rio de Janeiro city during 2015 and 2016. In order to know the data, our team started to ‘play’ with the dataset and check the variables. In doing so, we fancied about the Zika’s evolution pattern and its role during the outbreak at the early months of 2016.

Our hypothesis was that the disease propagation pattern and their correlations throughout time, city areas and weather could be used as an indicator to show where and when the disease spreads and help the city officials decide the best ways to allocate resources. We set as our goal then, to create a Rio de Janeiro map with a historical evolution of the Zika disease throughout time and temperature. During several conversations with representatives from the municipal health secretary, we wondered whether a social development indicator could provide insights about the spreading pattern. We decided to include HDI (Human Development Indicator) – known in Brazil as IDH – as a social parameter.

We then defined as our target variables the coordinates (latitude and longitude), the dates that the cases occurred, temperature over the seasons and social development indicators of Rio regarding income, education level and longevity. We set as our preliminary tasks the creation of a grid comprising the Rio de Janeiro city map and a data frame to aggregate the variables subset from different datasets. Our first goal as we performed an exploratory analysis was to explore the shape of the distributions. The grid helped us to check where the cases were located; exploring an area of about 400 meters, which is the mosquito range, as well as to cluster the patients’ cases in broader areas.  It allowed us too to check how far the disease spread throughout the city and to identify the areas where most of the cases took place.

Performing a time-series analysis, we were able to identify a correlation between temperature and number of cases. In this point, to understand the mosquito life cycle is valuable. The aedes aegypti flourish in a temperature variation going from 23-Celsius degree to 28-Celsius degree (about 73 to 82-Fahrenheit). A few degrees below or above this threshold does not necessary kill the mosquito but makes the environment more uncomfortable to its development hence retarding its evolution. From the egg to inoculate the virus in an individual, there’s a 20 to 25 days period, so the previous month mosquito is responsible for the current month patient. As it can be seen at the plots bellow, comparing the disease cases through the city by month and the temperature variation per day of the previous month, the outbreak during March and April (plots 3 and 4) follows a perfect condition observed throughout February and March, where the 23 to 28-Celsius threshold was observed during most of the days. The red circles correspond to the areas with the majority of cases.

Plot 1     

img1.jpg

Plot 2

img2.jpg

Plot 3

img3.jpg

Plot 4

img4.jpg

Plot 5

img5.jpg

Plot 6

img6.jpg

Plot 7

img7.jpg

This led us to our first meaningful insight: the temperature from the previous month seems to affect the number of cases in the current month.

As we shift our attention to the social indicator data at hand, we were able to identify a curious behavior. Some critical areas during the outbreak shared a similarity of low IDH coefficient. The plot bellow provide a visual support. The orange circle sizes are relate to the level of IDH, smaller circle/lower IDH and vice-versa.

Plot 8

img8.jpg

The highlighted areas on the plots above correspond to Maré (a neighborhood), the far-north zone and the far-west zone of the city. These areas share a lower level at the social indicators comparing to other parts of the municipality.

Although income seems not to be a social influence affecting the outbreak – as can be seen at the plot bellow – comparing the south zone behavior, the wealthiest part of the city, to the cases in March and April indicate, there is a peculiarity to consider. In this particular area, there is a huge economic disparity. The most exclusive addresses are placed within walking distance to some favelas (slums usually placed on hills around the area), where the IDH are similar to those on the previous plot).

Plot 9

img9.jpg

From this observation, we draw a second meaningful insight: the social indicators (IDH) seems to count as an influence force in the areas with most number of cases during the outbreak.

As we went further on our analysis, other curious behavior caught our attention. Even as the temperature dropped away from the 23 to 28-Celsius threshold, some areas kept appearing as the top score case holders (as it can be seen in the comparisons bellow from May to July).

Plot 10

img10.jpg

 

What these areas have in common is that woods and forests surround them all. This common factor provided the third meaningful insight we delivered: some recurrent disease focus areas seems to grow around or close to woods and forests areas.

Exploratory analysis usually is a good start to predictive modeling because helps to understand a little further the datasets and to summarize their main characteristics. Explore the data and formulate hypotheses that could lead to new data collection and experiments is a major component to extract usable information from data; suggest hypotheses; and support the selection of appropriate statistical tools and techniques. Our main goal at the data expedition were to set a first step that could help to understand the past behavior in order to prepare the ‘seeds’ to a future ‘crop’. Our third place award was a source of pride to ourselves and seems to indicate that this goal was accomplished.

After the Data Expedition

We continue exploring the data and aggregating other variables. Our goal was get some predictive model that could add on the initial exploratory analysis. These new variables were population per neighborhood and rainfall. We also add more data about temperature regarding the final months of 2016 and early january 2017.

The first choice was a simple linear regression using the variable population per neighborhood to predict cases based on population. Below some code chunks in R and statistical readings (we intend to show more info in a markdown file – a type of file where can be shown text, code and plots together).

The model:

img1.jpg

A quickly view of the dataset:

img2.jpg

“bairro” stands for neighborhood; “casos_zika” for Zika cases; and “populacao” for population.

Some statistical Reading from Rstudio console:

img3.jpg

Diagnostic Plots

In plot 1 (Residuals vs. Fitted), at some point there’s equally spread residuals around a horizontal line, but also there are outliers. In plot 2 (Normal Q-Q) the residuals seems to be normally distributed, at least at some extend.

Plot1

plot1.png

 

Plot 2

plot2.png

 

In plot 3 (Scale-Location), complementing plot 1, some residuals are spread equally along the ranges of predictors showing some homoscedasticity. Plot 4 (Residuals Vs. Leverage) identified the influential observation as #120 and #23.

Plot 3

plot3.png

 

Plot 4

plot5.png

Based on the thesis that the mosquito has a faster cycle when there’s a temperature threshold between 23 and 28 degrees Celsius, we tried to check if rainfall also helps in the proliferation. Then, we tried to identify the relationship between these two variables and the number of Zika cases. Our second choice was to use a multiple regression model to meet this goal. This analysis were performed in Phython.

The chart below shows that the months with the highest incidence of the temperature threshold are those between December 2015 and April 2016. We could also notice that trend occurring again in December 2016 and early January 2017.

output_2_0.png

 

Green and red lines: temperature threshold

Pink and grey lines: min and max temperature

The next sequence of plots show that there is a similar positive trend between the curves showing the number of cases per week, the temperature and rainfall. The analysis was performed based on the neighborhoods of Campo Grande (1), Santa Cruz (2), and Guaratiba(3), that were severed affected during the 2016 outbreak.

Blue line: cases per week

Red line: temperature under the threshold

Yellow line: rainfall

(1)

1.jpg

(2)

2.jpg

(3)

3.jpg

We decided to use the multiple regression model to build a predictive application. We tested the model through a series of plots comparing the actual data with a predicted one applied in a test dataset used to fit the model.

Testing values Vs. Predicted values for Rio de Janeiro

output_16_0.png

Green dots: testing values

Gray dots: predicted values

Real Cases Vs. Predicted Cases for Rio de Janeiro

download.png

Green lines: Real Cases

Gray lines: Predicted Cases

Real Cases Vs. Predicted Cases Comparison for Rio de Janeiro – december 2015 & 2016

download (1).png

Green: Real Cases

Gray: Predicted Cases

In this particular case (above plot) we didn’t had available the number of actual Zika cases in December 2016, so we only predicted the number of cases.

Analysis per neighborhood: Campo Grande.

Statistical Readings from Jupyter Notebook console:

s1.jpg

Green line: real cases

Gray line: predicted cases

output_21_5.png

Analysis per neighborhood: Santa Cruz.

Statistical Readings

s2.jpg

Green line: real cases

Gray line: predicted cases

output_21_11.png

Analysis per neighborhood: Guaratiba.

Statistical Readings

s3.jpg

Green lines: testing values

Gray lines: predicted values

output_21_17.png

We created a prototype to apply the model. It’s a website with information about the number of Zika cases per month and graphics showing the actual cases and the predicted ones per neighborhood.

For those who would like to check it out, it’s available here.