Generating Synthetic Data

This tutorial describes how to generate synthetic data that can subsequently be used in learning tasks. Specifically, this tutorial describes how to generate data for the model that is described in the BLN learning tutorial.

In the following, we will create a Python generator script that can be used with the BLN learning tool. The script will rely on the datagen Python module which is part of the toolbox.

A database as we will create it consists of a set of objects (with associated attributes) and relations between them. From a unified representation for attributes and relations, databases thus created can be translated into various formats required for the training of statistical relational models.

The python generator file for the trainings database can be found here:

  probcog/examples/grades-ext/gen.py

Table of Contents Creating the World Courses, Attributes Scientists, Relations Students, Remaining Relations Constructing the Complete Database The Full Generator Script

Creating the World

We first need to create a world which contains all the objects and relations of our model. The world of the university model will consist of courses, scientists and students. Having imported the datagen library, the first step is to create the world:

from datagen import *

world = World()

Courses, Attributes

For each type of object, we create a class derived from datagen.Object in the generator, where the constructor adds the instance to the world. For instance, the Course class is defined as follows:

class Course(Object):
	def (self, world, dep, spec, prof):
		Object.(self, "course")				
		self["difficulty"] = AttrDist({"easy": 0.7, "hard": 0.3}).generate()		
		world.addObject(self)
		# member variables for use within the generator
		self.department = dep
		self.specialization = spec
		self.prof = prof

The call to the constructor of Object defines the type of the object as it is known to the database/world (in this case, "course").

The constructor above also sets an attribute of the course object, difficulty, using a value pulled from an attribute distribution. Logically, for an object o, the equivalent attribute assignments

  o[attrName] = value
  o.setAttr(attrName, value)

create am atom attrName(o, value) in the database.

In our case, the difficulty of the course c is randomly decided with the use of the AttrDist class of the DataGen module. The constructor

  AttrDist(dist)

creates a distribution object, where dist is a dictionary representing a distribution. We can sample a value from the distribution using the generate method, which will thus return either easy or hard. The distributions need not be normalized.

Finally, for use within the generator script, the Course constructor also defines three member variables reflecting the department in which the course is offered, the specialization field and the professor teaching it.

Scientists, Relations

In addition to the Course class, the generator script we defines the Scientist subclass of the Object class that is defined in the DataGen module. The Scientist class will store the departmentdep and the specialization spec of the scientist as well as the type of the scientist (i.e. whether it is a professor or an assistant). The attributes dep, spec and type are only used in the generator script and will not be included as predicates in the training database as they are not used in the model description. (Actually, the type of the scientist is used in the model description but in another way - the predicate teaches is only applicable to professors while the likes and advises relations can be used by both assistants and professors.)

class Scientist(Object):
	def (self, world, type, dep, spec):
		Object.(self, "scientist")		
		self.type = type
		self.department = dep
		self.specialization = spec
		self.linkto("likes", self) # every person likes him- or herself
		
		# create the courses taught by professors
		if self.type == "professor":
			numCourses = randint(1,3)
			for i in range(numCourses):
				course = Course(world, self.department, self.specialization, self)
				self.linkto("teaches", course)
				
		world.addObject(self)

The Scientist constructor creates an object of type scientist and adds it to the world.

This class also involves the use of relations: Relations can be added using the linkto method. We link an object o in the world to a number of other objects as follows:

  o.linkto(linkName, other, *moreothers)

The linkto function of the object s essentially creates a predicate named linkName, which is true for the tuple (o, other, *moreothers).

The linkto method returns an object which can be used to set attributes of the relation:

  linkObject = s.linkto(linkname,other,*moreothers) 
  linkObject.setAttr(attrName,value)

The first line creates a Boolean predicate whereas the second defines a function; the atoms linkname(s,other,*moreothers) = True and attrName(s,other,*moreothers) = value result.

The example above contains two applications of linkto. The first establishes reflexivity of the likes relation, the second connects a professor to the courses he/she teaches.

self.linktto("likes", self)
self.linkto("teaches",course)

Note that the constructor of Scientist also constructs all the Course objects that are associated with the respective scientist.

Because the creation of scientists is not dependent on other objects, we create them in the main program: We define a number of departments and specialization fields and then construct scientists for each.

departments = {
    "Philosophy": ["Metaphysics", "Ethics", "Epistemology"],
    "Computer Science" : ["Artificial Intelligence", "Theoretical CS", "Databases"], 
    "Maths": ["Statistics", "Numerics", "Arithmetics"]
}
for dep in departments:
    for spec in dep:
        numProfs = random.randint(1,3)
        numAdvisors = 4-numProfs
        for i in range(numProfs):
            Scientist(world, 'professor', dep, spec)
        for i in range(numAdvisors):
            Scientist(world, 'assistant', dep, spec)

After all scientists and courses are created, the relations likes and similar can be defined completely: We iterate over all scientists and link each of them to a random number of scientists via the likes relation.

scientists = world.getContainer("scientist")
for s in scientists:
    for s2 in scientists.sampleSet(1, len(scientists)):
        s.linkto("likes", s2)

To randomly choose scientists, we use the sampleSet method of the ObjectContainer class. Whenever we add an object to the world, we implicitly add it to a container whose name corresponds to the type of the object. We can use these containers to conveniently access objects of particular types.

Next, we create the course similarity relation. To determine the course similarities we simply make all courses similar to each other that share the same department and specialization. Note that we will not use the similar predicate for learning and thus may choose to omit the linkto functionality and to not include the relation in the training database. However we need to determine the similarity of courses in the generator to define the takesSimilarCourse predicate, for we want to it to influence the grades of students.

courses = world.getContainer("course")
for i, course in enumerate(courses):
    for course2 in courses[i+1:]:
        if course.department == course2.department and course.specialization == course2.specialization:
            course.linkto("similar", course2)
            course2.linkto("similar", course)

Students, Remaining Relations

Finally we implement the class Student(Object) to add students to the world. Like we did for the difficulty of courses we use AttrDist to determine the intelligence of the student:

  self.intelligence = AttrDist({"weak": 0.2, "average": 0.6, "smart": 0.2}).generate()

According to the intelligence of the student we determine a basic distribution of the grades the student can achieve. We randomly choose how many advisors the student has and use the Choser function to draw the student's advisor from the scientist container:

  numAdvisors = AttrDist({0: 0.3, 1: 0.5, 2: 0.2}).generate()
  adviserChoser = Chooser(world,"scientists",0)
  for i in range(numAdvisors):
    adviserChoser.choose().advise(self)

The advise(student) function of the implemented scientist creates the advises(advisor,student) link and adds the advisor to the student's advisor list.

The grades for the individual courses are set in setGrades(). As the grade(student,course) function does not have Boolean domain we need to set an attribute to the link object as described above. To do that we initially create a predicate gotGraded and link the student to all courses. Thereafter we add the attribute grade to the created link to get the desired function. Thus in the training database we will also have a gotGraded predicate which is set to true for all courses the student didn't take:

  gotGraded(Student1,Course1) = True
  grade(Student1,Course1) = None

However, we will not include the gotGraded predicate in the model. The grade distribution for each course the student did take is altered according to the difficulty of the course, whether the teacher of the course likes at least one of the student's advisors and whether the student takes a similar course. After that the grade is once again determined using the AttrDist function and the grade attribute is set to the appropriate value.

Constructing the Complete Database

To obtain a database from the created world, we use the getDatabase method of the world object. The database generation tool expects a variable called db to exist in the global scope.

  db = world.getDatabase()

The Full Generator Script

import sys
from datagen import *
from random import choice, randint, shuffle
        
similarCourse = True

class Course(Object):
    def (self, world, dep, spec, prof):
        Object.(self, "course")     
        
        self.department = dep
        self.specialization = spec
        self.prof = prof
        
        self["difficulty"] = AttrDist({"easy": 0.7, "hard": 0.3}).generate()
        
        world.addObject(self)
        
class Scientist(Object):
    def (self, world, type, dep, spec):
        Object.(self, "scientist")      
        self.type = type
        self.department = dep
        self.specialization = spec
        self.linkto("likes", self) # every person likes him- or herself
        
        # create the courses taught by professors
        if self.type == "professor":
            numCourses = randint(1,3)
            for i in range(numCourses):
                course = Course(world, self.department, self.specialization, self)
                self.linkto("teaches", course)
                
        world.addObject(self)
        
class Student(Object):
    def (self,world):   
        Object.(self, "student")
        
        self.world = world      
        
        self["intelligence"] = AttrDist({"weak": 0.2, "average": 0.6, "smart": 0.2}).generate()
        
        self.coursesTaken = []
        self.initGradeDists = {
            "weak": {"A": 0.1, "B": 0.15, "C": 0.16, "D": 0.045, "F": 0.07}, 
            "average": {"A": 0.145, "B": 0.23, "C": 0.10, "D": 0.02, "F": 0.005},
            "smart": {"A": 0.4, "B": 0.07, "C": 0.02, "D": 0.00, "F": 0.00} }
        self.gradeDist = self.initGradeDists[self["intelligence"]]
        
        # assign advisors
        numAdvisors = AttrDist({0: 0.3, 1: 0.5, 2: 0.2}).generate()
        self.advisors = world.getContainer("scientist").sampleSet(numAdvisors, numAdvisors) 
        for advisor in self.advisors:
            advisor.linkto("advises", self)
            
        self.takeCourses()
        self.setGrades()
        
        world.addObject(self)
        
    def takeCourses(self):
        for course in self.world.getContainer("course").sampleSet(3, 6):
            self.coursesTaken.append(course)
            self.linkto("takes",course)
        # create "takesSimilarCourse" predicate
        for i in range(len(self.coursesTaken)):
            course = self.coursesTaken[i]   
            for course2 in self.coursesTaken[i+1:]:
                if course.department == course2.department and course.specialization == course2.specialization:
                    self.linkto("takesSimilarCourse",course)
                    self.linkto("takesSimilarCourse",course2)
        
    def setGrades(self):
        notTaken = filter(lambda x: x not in self.coursesTaken, world.getContainer("course"))
        for c in notTaken:
            gradeLink = self.linkto("gotGraded",c)
            gradeLink["grade"] = "None"
        for c in self.coursesTaken:
            localGradeDist = self.gradeDist
            likesAdvisor = False
            similarCourse = False
            for advisor in self.advisors:
                if advisor in c.prof.getPartners("likes"):
                    likesAdvisor = True
            for c2 in self.coursesTaken:
                if c2 in c.getPartners("similar"):
                    similarCourse = True
            if c["difficulty"] == "hard":
                localGradeDist["A"] -= 0.1
                localGradeDist["B"] -= 0.05
                localGradeDist["D"] += 0.10
                localGradeDist["F"] += 0.05
            if likesAdvisor:
                localGradeDist["A"] += 0.25
                localGradeDist["B"] += 0.1
                localGradeDist["D"] -= 0.05
                localGradeDist["F"] -= 0.05
            if similarCourse:
                localGradeDist["A"] += 0.1
                localGradeDist["B"] += 0.05
                localGradeDist["D"] -= 0.05
                localGradeDist["F"] -= 0.05
            gradeLink = self.linkto("gotGraded",c)
            gradeLink["grade"] = AttrDist(localGradeDist).generate()        

        
numStudents = 80

world = World()

generate professors and assistants      
departments = {
    "Philosophy": ["Metaphysics", "Ethics", "Epistemology"],
    "Computer Science" : ["Artificial Intelligence", "Theoretical CS", "Databases"], 
    "Maths": ["Statistics", "Numerics", "Arithmetics"]
}
for dep in departments:
    for spec in dep:
        numProfs = random.randint(1,3)
        numAdvisors = 4-numProfs
        for i in range(numProfs):
            Scientist(world, 'professor', dep, spec)
        for i in range(numAdvisors):
            Scientist(world, 'assistant', dep, spec)

create the "likes" relation
scientists = world.getContainer("scientist")
for s in scientists:
    for s2 in scientists.sampleSet(1, len(scientists)):
        s.linkto("likes", s2)

create course similarities
courses = world.getContainer("course")
for i, course in enumerate(courses):
    for course2 in courses[i+1:]:
        if course.department == course2.department and course.specialization == course2.specialization:
            course.linkto("similar", course2)
            course2.linkto("similar", course)

create students           
for i in range(numStudents):
    student = Student(world)
    
create the teacherOfLikesAdvisorOf predicate
courses = world.getContainer("course")
numCourses = len(courses)
for i in range(numCourses):
    prof = courses[i].prof
    for likedProf in prof.getPartners("likes"):
        for stud in likedProf.getPartners("advises"):
            courses[i].linkto("teacherOfLikesAdvisorOf",stud)
    
create the database
db = world.getDatabase()

Provide feedback

Saved searches

Use saved searches to filter your results more quickly