Photo by Emily Morter on Unsplash

How To Generate Hypothesis in Pedagogical Domain

Tetsuya Hirata
5 min readFeb 4, 2019


Inducting the generic way to generate and modulate hypothesis in learning by seeing MOODLE dataset.

Educational Data Mining(EDM)

Description of EDM

Generally, the statistic and machine learning techniques are used as analytical approaches to information generated from pedagogical settings with blended contexts. EDM is the extraction of information from educational records, usually online logs and exam results, and drawing of inferences therefrom. The results are useful for improving teaching practices and testing hypotheses about learning and can be used to inform the implementation of the software to allow for such predictive analytics to be performed routinely. According to Papamitsiou & Economides (2014)’s systematic review of empirical evidence about EDM in practice, data analysis methods were identified as classification, clustering, regression, text mining, association mining, social network analysis, and discovery with methods visualization. Some of them are also categorized as machine learning and statistics depending on the literature. To easily identify the appropriateness of the techniques used to understand online learning behaviors of students, machine learning is categorized as unsupervised learning and supervised learning, and other analysis methods that cannot include those two kinds of machine learning techniques, such as t-test, f-test, ANOVA and etc. is defined as statistics in this article.

Educational Data Mining Procedures

There are a few previous studies that mentioned how researchers did pre-processing and what software they used although the description is crucial for educators to conduct preprocessing. This is because of the limitation of variables of the log data stored in Moodle, and the results would be considerably different depending on the way to preprocess the data. The data mining procedures of the study of previous literature were separated into the preprocessing phase including transforming data and analysis phase that is data mining phase. (Pechenizkiy et al. 2008)

Casany Guerrero et al. (2012) referred to 1) the data pre-processing, 2) the data processing, and 3) the data analysis as the below definitions:

1) The data pre-processing includes the selection of data and cleaning data and during this step, some log data are calculated and aggregated.

2) The data processing includes the transformation of the data into the format following the research questions.

3) The data analysis includes applying data mining algorisms, and gained and interpreted the results, and generated the reports.

As an example, the above flow can be adapted to the actual pre-process flow using moodle data set as described by the below figure.

Figure 1 Educational data mining processes in the present study

Data description and preparation

Learning Contexts on MOODLE module

The datasets used in this work is gathered from a Moodle 3.1 course used by 64 undergraduate and postgraduate students from a university. This module is blending online lectures and activities, the face-to-face teaching that follows offers more intensive, creative sessions of tutorial-based problem solving and learning, in small groups, and extensive laboratory practical. Most parts are conducted on Moodle such as watching lectures and doing multiple-choice or short-answer question. The module consists of 17 sessions and the two kinds of foundation exams that are called SAQ and MCQ consisted of 14 short answer questions and consisted of 30 multiple choice questions, respectively.

Online learning part of this module is based on 17’s different topics. Each topic consisted of learning outcome page, pre-learning section, storyline section, and post-learning section. Those sections include some or all of SCORM contents, quiz, forum, hot question, and system page events depending on each topic and each section (see Figure 2). All of the scores of quizzes are not part of the grade except the foundation exams (SAQ and MCQ), but the participation in these quizzes contributes towards student’s overall online participation grade, while at the end of the session, the scores of the foundation exams (SAQ and MCQ) are graded.

Figure 2 Online learning structure of the module

The log data that can be corrected from the MOODLE database

The log data based on every click by students is stored on Moodle for navigational purposes or at their own pace. The log data anonymized by the third person were pre-processed by generating false names from the Internet and randomly assigning those names to student Moodle ID. There are the two kinds of log data that were extracted from built-in log‐ viewing system of Moodle and saved in CSV (Comma Separated Values) files: 1) the event log data (see Table 1.1) about learning contexts and activities having 118,970 records 2) the score log data(see Table 1.2) about SAQ and MCQ exams having 64 records, which are stored as time-series data.

The learning behaviors on MOODLE can be described based on the log data as described in Table 1.3.

The way to generate a hypothesis based on the features from educational log data

By seeing Moodle dataset, the skeleton of features can be summarized as described by the below figure.

Created by Jesse Tetsuya

Based on this, research questions can be narrowed down to the concrete hypothesis. If there were three research questions: ‘What can be indicators to predict exam outcomes?’ , researchers could generate ‘what is the number of clicks on the new comment button on discussion function in the Math module?’, ‘what rate online learners do click on file upload button on forum function according to teacher’s instruction?’ , or ‘How long high-scored students watch lecture movies of philosophy?’.

The skeleton of each feature in the black box described in the above figure can be generalized on any online learning platform but the details depend on what data structure online platform has.


Casany Guerrero, M.J., Alier Forment, M., Galanis, N., Mayol Sarroca, E. & Piguillem Poch, J.,(2012). Analyzing moodle/lms logs to measure mobile access. In UBICOMM 2012: The Sixth International Conference on Mobile Ubiquitous Computing, Systems, Services and Technologies. pp. 35–40. Available at:

Papamitsiou, Zacharoula & Economides, Anastasios. (2014). Learning Analytics and Educational Data Mining in Practice: A Systematic Literature Review of Empirical Evidence. Educational Technology & Society. 17. 49–64.

Pechenizkiy, M., Puuronen, S. & Tsymbal, A.,(2008). Does Relevance Matter to Data Mining Research?. Data Mining: Foundations and Practice, 118, pp.251–275.



Tetsuya Hirata

Software engineer working mostly at the intersection of data science and engineering. @JesseTetsuya