Uni Duisburg-Essen

Informationen zu den Lehrveranstaltungen an der Uni Duisburg-Essen

Thesis topics

In general, the topics listed beloew are suitable for Bachelor’s and Master’s theses. However, of course, there are differences in the requirements on the different levels, e.g., regarding depth of literature review, sophistication of solution provided, and discussion.

While the topics in this document are described in English, theses can be either written in English or German.

To make yourself familiar with the research from our group, please see https://searchstudies.org. When applying for a topic, I expect you to have already looked at the relevant literature and work from our group. Many of the topics listed below are related to projects, especially RAT (see https://searchstudies.org/research/rat/).

As a general introduction to the topic of search engines, please see my book “Understanding Search Engines” / “Suchmaschinen verstehen”.

Interested in a topic from this list? These are the next steps

If you are interested in one of the topics below, please send me a short expression of interest (max. 1 page) detailing why you are interested in that topic in particular, and what courses you have taken that provided you with the knowledge needed to work on the chosen topic. Please send your e-mail to dirk.lewandowski@uni-due.org. I will get back to you within one week.

If your short proposal has been accepted, you are required to write a short exposé (4 pages) detailing the outline of the thesis and providing a first literature review. I will provide you with a checklist detailing what should be included in the exposé.

Topics

[reserved] Measuring the readability of web documents

A way of classifying documents and selecting documents appropriate for a particular user group is to classify them according to their readability. To this end, various readability formulas have been developed (e.g., Flesch Reading Ease). In this thesis, you develop a tool that computes selected readability scores for web documents.

Work to be done (preliminary list):

  • Identify appropriate readability formulas for English and German (additional languages optional)
  • Develop software that extracts the main component (i.e., main text) of a html document and analyse it according to its readability
  • Compute readability measures for individual documents and sets of documents (e.g., top 10 results from a search engine)
  • Use test data (which will be provided) to evaluate main component extraction and differences in readability scores using the different formulas

This topic is related to the RAT project. If the work is completed successfully and the student wishes so, the developed software can be implemented into the RAT software toolkit.

[reserved] Sentiment analysis for search result snippets

The aim of sentiment analysis is to find the tone (e.g., positive vs negative) in a document. In search, sentiment can be used to provide a result set that covers different sentiment toward the topic searched for. In the evaluation of commercial search engines, it is interesting to know how the top results are mixed in terms of sentiment, and what overall tone the descriptions (“snippets”) on a search engine result page have. This is important, as most of the time, users only consider the first few results on a result page.

Work to be done (preliminary list):

  • Provide an overview of sentiment analysis algorithms appropriate for short texts (like Tweets, search result snippets, etc.).
  • Develop software that extracts snippets (title, description, URL) from Google’s search result pages, applies one or more selected sentiment analysis algorithm either to individual snippets or to all snippets found up to a certain result position (e.g., 10).
  • Compute statistics over all results for a particular query, over results for a number of queries, over result positions. Provide a visualization of these statistics.

This topic is related to the RAT project. If the work is completed successfully and the student wishes so, the developed software can be implemented into the RAT software toolkit.

[reserved] A classifier for web documents in the health domain

Web documents can be classified for various purposes, e.g., finding website covering different topics or covering different types of websites (encyclopedia, news, blog. etc.). We are interested in classifying documents in the health domain. The test set comprises 4,855 manually labelled websites; classes are clinics, medical office, information (commercial), information (non-commercial), journalistic, information (public authority), online shop, insurance company, pharmacy, video (e.g., YouTube), Social Media, Google service.

In the thesis, a classifier should be trained that learns to classify newly incoming documents according to the taxonomy from the test set. 

Work to be done (preliminary list):

  • Provide a literature review of website classification and learning approaches.
  • Develop a classifier trained on a part of the test data provided.
  • Thoroughly evaluate the classifier on the data provided.
  • Evaluate the classifier with newly incoming unlabeled data.
  • Implement the classifier so that a list of websites with their properties can be uploaded, and the classification outcome downloaded (data structure will be provided).

This topic is related to the RAT project. If the work is completed successfully and the student wishes so, the developed software can be implemented into the RAT software toolkit.

Required: Student should have successfully completed the Information Mining exam.

[reserved] A forum scraper for RAT

In the RAT project, we scrape different types of web documents. Copies of these documents are stored in a database. However, these are copies of the whole webpage, are static and are not updated. In this thesis, a software is to be developed that regularly visits forum pages from German news media and collects the new comments, based on a list of articles (i.e., URLs from these websites). The resulting software can be used to support studies where forum comments should be monitored.

Work to be done (preliminary list):

  • Review the literature on studying forum comments and on extracting forum comments.
  • Develop a software that, given a list of URLs of articles from major German news portals (like Zeit.de, Spiegel.de, faz.net), regularly visits these URLs, extracts new comments, and stores them in a database. Results should be downloadable in a structured format.
  • Evaluate the software.

This topic is related to the RAT project. If the work is completed successfully and the student wishes so, the developed software can be implemented into the RAT software toolkit.

App review scraper for RAT

In the RAT project, we scrape different types of web documents. Copies of these documents are stored in a database. However, these are copies of the whole webpage, are static and are not updated. In this thesis, a software is to be developed that scrapes reviews from App stores (like Apple’s App Store and Google’s Play Store). The software should allow for regularly scraping the given URLs for new reviews and add them to the database. The resulting software will allow researchers to systematically compare reviews and extract relevant information from them.

Work to be done (preliminary list):

  • Review the literature on studying user reviews and on extracting reviews.
  • Develop a software that, given a list of URLs of articles from major App, regularly visits these URLs, extracts new reviews, and stores them in a database. Results should be downloadable in a structured format.
  • Evaluate the software.

This topic is related to the RAT project. If the work is completed successfully and the student wishes so, the developed software can be implemented into the RAT software toolkit.

An investigation of p-Hacking in Interactive Information Retrieval

P-hacking is the malicious practice of tuning results of studies that use hypothesis testing in a way that result reach a certain significance level, so that a study gets “publishable”. While the practice has been researched in psychology, there are no studies investigating sub-disciplines in computer science, like interactive information retrieval (IIR). In this thesis, the phenomenon of p-hacking should be investigated in that area.

Work to be done (preliminary list):

  • Provide a literature review of the problem of p-hacking and studies done so far.
  • Describe methods to detect p-hacking in published studies.
  • Build a corpus of hypothesis testing studies from interactive information retrieval.
  • Test results from this corpus for the occurrence of p-hacking, using standard methods.
  • Discuss the results and make suggestions (based on the literature).

Required: KOMEDIA student

[reserved] Scraping the additional information box from Google’s search engine result pages

On its result pages, Google provides additional information to each result when a user clicks on the three dots shown next to the result’s title. An info box shows relevant information on that result, including information on the source and whether the connection to that website is secure. Furthermore, when a user clicks on “cache”, the document is shown in the version when Google last indexed it. Therefore, the cache page shows a copy of that page, but also the indexing data. The task in this thesis is to develop software that extracts the information from the info box as well as the data the page was last indexed, given a Google search engine result page. The information extracted can be used to systematically evaluated the properties of results shown on the top positions of Google, and also for measuring the freshness of Google’s database of web pages.

Work to be done (preliminary list):

  • Review the literature on extracting information from search engine result pages, with special emphasis on the freshness of search engine databases and methods for measuring freshness.
  • Develop software to extract information from the info boxes and cache pages, given a query. The software should allow for downloading the results (including page URL, result position, information from the infobox (structured), and date last indexed).
  • Evaluate the success of extracting the information based on a test set of queries (will be provided).

This topic is related to the RAT project. If the work is completed successfully and the student wishes so, the developed software can be implemented into the RAT software toolkit.

Extending the classification of SEO indicators

Search engine optimization (SEO), i.e., the optimization of one's own content in order to be listed preferentially by commercial search engines such as Google, has an enormous influence on the results displayed by search engines at the top ranks. In previous work within the framework of the SEO Effect project, a list of indicators with which the presence of SEO on websites can be automatically assessed was compiled and verified in empirical studies.

In this thesis, further indicators already identified should be added to the model and tested for their suitability. Main steps:

  • Identify suitable indicators.
  • Develop a system that recognizes these indicators on HTML pages.
  • Evaluate the system using real search result data. (The data for the evaluation will be provided).
  • Statistical analysis and interpretation of the results.

If the classifier is successfully completed, it can be included as an analysis component in the software Result Assessment Tool (RAT), if this is desired by the student.

Does Google prefer websites with AdSense advertising?

With AdSense, Google offers advertisers the opportunity to monetise their contents via advertising that matches the content. For this purpose, advertising in text form is generated on the basis of the content of a web page that matches this content. This form of advertising is a supplement to Google's successful text ads, which are displayed as the result of a search query in the search engine.

Since Google earns money with every clicked AdSense ad, it would be commercially understandable if documents with such ads were displayed preferentially in the search engine ranking. This thesis will investigate this assumption. For this purpose

  • A system will be developed that recognises whether and in what form AdSense advertising is integrated for any website.
  • To determine, on the basis of a set of search queries and the associated top results, on which result positions documents with AdSense advertising can be found. (Appropriate software for querying the search engines and collecting the results will be provided).
  • The results will be statistically evaluated and interpreted. In the interpretation, the question of causality vs. correlation is to be addressed in particular.

If the processing is successful, the software can be included as an analysis component in the software Result Assessment Tool (RAT), if this is desired by the student.

Replication of Heinström's study on fast surfers, broad scanners and deep divers

Jannica Heinström published an influential study in 2005 in which she grouped people into fast surfers, broad scanners and deep divers based on their information-seeking behaviour. This division is based on a survey in which the five-factor model (Big Five) was used in addition to scales on information behaviour.

In this thesis, Heinström's study will be critically evaluated and empirically replicated. So far, the applicability of the results to other user groups, the validity of the findings over time and the role of situational versus personal factors have not been sufficiently investigated.

[reserved] Commerciality of web documents

On the web, countless websites compete with each other for the users' attention. With regard to commercial search engines, this raises the question of the extent to which commercial offers are shown preferentially on the top result positions and thus displace other, non-commercial or less commercial offers.

In this thesis, a classifier will be developed and evaluated that can classify individual web documents according to their commerciality. For this purpose, different possibilities of classifying documents according to commerciality will be compared. A classifier is to be developed for the classification scheme selected or developed from it, which learns on the basis of a manually classified set of documents. This classifier will be evaluated in the last step.

If the classifier is successfully processed, it can be included as an analysis component in the software Result Assessment Tool (RAT), if this is desired by the student.