I did my PhD at Monash University under the supervision of Ingrid Zukerman and Fabian Bohnert. I started in March 2009 and submitted my thesis in August 2012. When excluding time spent on conference trips and three months of an internship with Google, it took about three years of work to complete the PhD, which is not too bad for a 100% research program (no coursework was required at the time).

People often ask me how to become a data scientist. The PhD was my way of doing that, though it was entirely unplanned. In fact, I didn’t even want to do a PhD. My original plan was to come to Australia, do a master degree, and see if I like it here. Ingrid convinced me to do a PhD, because “the time difference to a master isn’t huge”. I don’t regret listening to her. I had the opportunity to work on interesting problems, travel, and generally have fun. The PhD has even made me more employable due to the boom in data-driven work, which wasn’t something I was aiming for. All I was hoping to achieve was being qualified to work on more interesting stuff than vanilla software engineering, which was my focus prior to the PhD.

Broadly speaking, the topics of the PhD were in the areas of user modelling and natural language processing. I’m planning to eventually document the journey and the work done through a series of posts.1 The idea is to give a behind-the-scenes overview of the work that went into publishing the papers, as there are many lessons that may be useful to both PhD students and software engineers who wish to become data scientists. In addition, this website gets much more exposure than my papers ever did, so I hope that using this platform to explain the papers in a friendly language would enable a wider audience to build on my PhD work.

The title of my thesis is Text Mining and Rating Prediction with Topical User Models. The short, human-friendly abstract is:

This thesis develops novel statistical methods to infer implicit information from online user-generated texts. These methods analyse texts to identify and characterise users, detect their sentiments, and predict their preferences for items such as films. The inferred information may be harnessed for improved personalisation of online user experience.

The main publications that resulted from my PhD work are as follows. Links to posts about these publications will be added in the future. Please subscribe to get notified when this happens.

  • Yanir Seroussi, Ingrid Zukerman, and Fabian Bohnert, “Authorship Attribution with Topic Models”. In Computational Linguistics 40(2):269–310, 2014. PDF
    In a sentence: Essentially a condensed version of my thesis
  • Yanir Seroussi, “Text Mining and Rating Prediction with Topical User Models”. PhD thesis, Faculty of Information Technology, Monash University, Clayton, VIC 3800, Australia, 2012. PDF
    In a sentence: The thesis, as described above, which was awarded the Mollie Holman medal for the best thesis in the faculty of IT in 2012
  • Yanir Seroussi, Fabian Bohnert and Ingrid Zukerman, “Authorship attribution with author-aware topic models”. In ACL 2012, pages 264–269, Jeju, Republic of Korea, 2012. PDF
    In a sentence: An authorship attribution model that combines latent Dirichlet allocation and the author-topic model
  • Yanir Seroussi, Russell Smyth and Ingrid Zukerman, “Ghosts from the High Court’s past: Evidence from computational linguistics for Dixon ghosting for McTiernan and Rich”. In University of New South Wales Law Journal, 34(3):984–1005, 2011. PDF | Dataset
    In a sentence: A law journal paper that explores the extent to which Australian high court justice Owen Dixon ghost-wrote judgements for Edward McTiernan and George Rich
  • Yanir Seroussi, Ingrid Zukerman and Fabian Bohnert, “Authorship attribution with latent Dirichlet allocation”. In CoNLL 2011, pages 181–189, Portland, OR, USA, 2011. PDF | Judgement dataset | IMDB62 dataset
    In a sentence: Applying latent Dirichlet allocation to the authorship attribution problem
  • Yanir Seroussi, Fabian Bohnert and Ingrid Zukerman, “Personalised rating prediction for new users using latent factor models”. In HT 2011, pages 47–56, Eindhoven, The Netherlands, 2011. PDF | Dataset
    In a sentence: Extensions to the basic matrix factorisation approach to recommender systems to handle scenarios with new users who have little data associated with them
  • Yanir Seroussi, Ingrid Zukerman and Fabian Bohnert, “Collaborative inference of sentiments from texts”. In UMAP 2010, pages 195–206, Waikoloa, HI, USA, 2010. PDF | Dataset | Blog post
    In a sentence: An application of a model based on neighbour-based collaborative filtering to a variant of the sentiment analysis problem where the authors are known

  1. July 2023 update: Just noticed this plan while tidying up the website. The series of posts never got off the ground. As it’s been eight years, I think it’s safe to say it’s not going to happen. ↩︎