Thesis Topics

Johann Mitloehner, Institute for Data, Process and Knowledge Management, 2024

Our institute has a procedure for assigning topics and supervisors, see dpkm/write-a-thesis for details; with me as a supervisor the thesis will be in one of the following areas:

Text Mining and Machine Learning

Text mining aims to turn written natural language into structured data that allow for various types of analysis which are hard or impossible on the text itself; machine learning aims to automate the process using a variety of adaptive methods, such as artificial neural nets which learn from training data. Typical goals of text mining are Classification, Sentiment Detection, and other types of Information Extraction, e.g. Named Entity Recognition: identify people, places, organizations; Relation Extraction, e.g. locations of organizations.

Connectionist methods and deep learning in particular have achieved much attention and success recently; these methods tend to work well on large training datasets which require ample computing power. Our institute has recently acquired high performance GPU units which are available for student use in thesis projects. It is highly recommended to use a framework such as PyTorch or Tensorflow/Keras for developing your deep learning application; the changes required to go from CPU to GPU computing will be minimal. This means that you can start developing using your PC or notebook, or the Jupyter notebook server of the department, with a small subset of the training data; when you later transition to the GPU server more performance will mean that larger datasets become feasible.

On text mining e.g.: Minqing Hu, Bing Liu: Mining and summarizing customer reviews. KDD '04: Proceedings of the tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 168-177, ACM, 2004

For a more recent work and overview e.g.: Percha B. Modern Clinical Text Mining: A Guide and Review. Annu Rev Biomed Data Sci. 2021 Jul 20;4:165-187. doi: 10.1146/annurev-biodatasci-030421-030931. Epub 2021 May 26. PMID: 34465177.

Datasets can be found e.g. at huggingface and kaggle.

Visualizing Data in Virtual and Augmented Reality

How can AR and VR be used to improve exploration of data? Developing new methods for exploring and analyzing data in virtual and augmented reality presents many opportunities and challenges, both in terms of software development and design inspiration. There are various hardware options, starting with Google Cardboard, to more sophisticated and expensive, such as Rift, Quest, and many others. Taking part in this challenge demands programming skills as well as creativity. A basic VR or AR application for exploring a specific type of (open) data will be developed by the student. The use of a platform-independent kit such as A-Frame is essential, as the application will be compared in a small user study to its non-VR version in order to identify advantages and disadvantages of the visualization method implemented. Details will be discussed with supervisor.

Some References:

The aim of the work

When investigating questions, programming (most likely in Python) and basic statistical knowledge will be used to implement the approach you have described and to evaluate results.

You are not expected to achieve a particular performance on your dataset or arrive at some desired result. Instead, you answer your research question - whatever that answer turns out to be. Science deals with what we know. What we want or believe belongs to politics and religion which have no place in scientific work.

The thesis should deal with a problem in a scientific way - no new contributions to science have to be made but the mastery of the scientific working method should be shown, esp.

Structure of the work

The work should be structured roughly like this:

The word "I" is not commonly used in scientific texts; rather, the passive form or some other formulations are used, such as "In this work .. is explored". If absolutely necessary you can use "the author" instead of "I".

The purpose of science is to expand knowledge and provide verifiable explanations about the world. When you write code it is impossible for other people to verify or extend your work without access to your code: create a free public github repository (or some similar publicly accessible site) for your work and provide a link in the thesis. Documentation is also part of software development and must not be neglected.

The goal must be reproducible research which is easier to achieve using non-proprietary and generally available open and free resources rather than commercial software or services (such as software encumbered by non-free licenses or data access only available for paid subscription).

Difference between bachelor and master: For a bachelor thesis, scope and complexity requirements are lower; this also results in different numbers of pages (not including cover, table of content, references, appendices):

LaTeX and BibTeX

LaTeX is used for the thesis, which is the standard in many scientific fields for good reason. Cross-references, tables and graphics are no problem even in very large documents. Latex provides basic citation support, but managing references with a choice of consistent format is much easier with BibTeX.

On the web you will find a large number of short introductions and sample documents for LaTeX and BibTeX; in case you have not used Latex before: the moderate initial effort is worthwhile. You can install Latex on your own PC, but it is probably easier to use on overleaf.com. Single user is free, all conceivable packages including Bibtex are installed, and it works in every web browser. The institute provides a template (section Bachelor Thesis / Formatting); create a project in Overleaf, upload the files from the template folder, and click compile.

English is the language of choice. Pay attention to reasonably correct grammar; it doesn't have to be Shakespeare, but we want to understand what you are saying. Use the spell check: the dotted red lines in Overleaf.

Plagiarism and Copyright

Plagiarism: the University uses software to automatically check for plagiarism when you submit your work for grading. The guidelines for plagiarism contain the following definition: "A work is considered to be plagiarized if texts, contents, or ideas produced by someone else are being passed off as the author's own work."

In addition to the guidelines, take note of the following to avoid problems in this area:

In a thesis you tell your story in your own words. Do not risk a plagiarism charge; even when discovered many years later the consequences for your career can be disastrous.

Copyright protects original works of authorship fixed in any tangible medium of expression. This is roughly the US definition, the Austrian one is similar (tangible expression of independent creative achievement). The requirements for originality tend to be moderate. Copyright exists automatically; a note such as (c) or © or some registration are not necessary (but tend to support a case).

Copyright infringement and plagiarism are two different things. Providing citations for content you are copying does not avoid liability for copyright infringement. In academic publications the subject of copyright is often treated lightly, especially when it comes to using images from other sources. It is recommended that you do not rely on fair use; only use content from other sources when you have explicit permission, e.g. when the copyright owner has explicitly allowed certain types of use by providing a statement to that effect. You can always illustrate something by creating your own diagram (distinctly different from other sources).

AI

The use of AI-based software for text generation such as ChatGPT is not permitted.

Procedure

  1. Assignment of topic and supervisor as per institute procedure, see dpkm/write-a-thesis.
  2. Once you have been assigned to me as supervisor: email me your title and abstract.
  3. Feedback on title and abstract, per email or in person, or telco. Thesis will be entered into the BACH/LPIS system.
  4. Roughly one month later: send me an update of your current work, so I can give you more feedback as you go.
  5. Hopefully not much more than 4-6 months later: upload your completed work to the Learn system for plagiarism check.
  6. Detailed feedback on this version (allow for a week or two).
  7. Update your work as suggested in my comments and submit it again (exactly the same Learn upload procedure).
  8. Grade is entered in the BACH/LPIS system.

Any time in between feel free to contact me for more feedback and hints, especially when you are stuck with technical or programming challenges. Even in distance mode we can have a telco screen sharing session to solve problems.

See also www.wu.ac.at/infobiz/topics/how-to-write-a-thesis

Johann Mitloehner, 2021-24