-
Notifications
You must be signed in to change notification settings - Fork 26
Generating Synthetic Data
This tutorial describes how to generate synthetic data that can subsequently be used in learning tasks. Specifically, this tutorial describes how to generate data for the model that is described in the BLN learning tutorial.
In the following, we will create a Python generator script that can be used with the BLN learning tool. The script will rely on the datagen Python module which is part of the toolbox.
A database as we will create it consists of a set of objects (with associated attributes) and relations between them. From a unified representation for attributes and relations, databases thus created can be translated into various formats required for the training of statistical relational models.
The python generator file for the trainings database can be found here:
probcog/examples/grades-ext/gen.py
We first need to create a world which contains all the objects and relations of our model. The world of the university model will consist of courses, scientists and students. Having imported the datagen library, the first step is to create the world:
from datagen import *
world = World()
For each type of object, we create a class derived from datagen.Object in the generator, where the constructor adds the instance to the world. For instance, the Course class is defined as follows:
class Course(Object):
def (self, world, dep, spec, prof):
Object.(self, "course")
self["difficulty"] = AttrDist({"easy": 0.7, "hard": 0.3}).generate()
world.addObject(self)
# member variables for use within the generator
self.department = dep
self.specialization = spec
self.prof = prof
The constructor above also sets an attribute of the course object, difficulty, using a value pulled from an attribute distribution. Logically, for an object o, the equivalent attribute assignments
o[attrName] = value o.setAttr(attrName, value)
create am atom attrName(o, value) in the database.
In our case, the difficulty of the course c is randomly decided with the use of the AttrDist class of the DataGen module. The constructor
AttrDist(dist)
creates a distribution object, where dist is a dictionary representing a distribution. We can sample a value from the distribution using the generate method, which will thus return either easy or hard. The distributions need not be normalized.
Finally, for use within the generator script, the Course constructor also defines three member variables reflecting the department in which the course is offered, the specialization field and the professor teaching it.
In addition to the Course class, the generator script we defines the Scientist subclass of the Object class that is defined in the DataGen module. The Scientist class will store the departmentdep and the specialization spec of the scientist as well as the type of the scientist (i.e. whether it is a professor or an assistant). The attributes dep, spec and type are only used in the generator script and will not be included as predicates in the training database as they are not used in the model description. (Actually, the type of the scientist is used in the model description but in another way - the predicate teaches is only applicable to professors while the likes and advises relations can be used by both assistants and professors.)
class Scientist(Object):
def (self, world, type, dep, spec):
Object.(self, "scientist")
self.type = type
self.department = dep
self.specialization = spec
self.linkto("likes", self) # every person likes him- or herself
# create the courses taught by professors
if self.type == "professor":
numCourses = randint(1,3)
for i in range(numCourses):
course = Course(world, self.department, self.specialization, self)
self.linkto("teaches", course)
world.addObject(self)
The Scientist constructor creates an object of type scientist and adds it to the world.
This class also involves the use of relations: Relations can be added using the linkto method. We link an object o in the world to a number of other objects as follows:
o.linkto(linkName, other, *moreothers)
The linkto function of the object s essentially creates a predicate named linkName, which is true for the tuple (o, other, *moreothers).
The linkto method returns an object which can be used to set attributes of the relation:
linkObject = s.linkto(linkname,other,*moreothers) linkObject.setAttr(attrName,value)
The first line creates a Boolean predicate whereas the second defines a function; the atoms linkname(s,other,*moreothers) = True and attrName(s,other,*moreothers) = value result.
The example above contains two applications of linkto. The first establishes reflexivity of the likes relation, the second connects a professor to the courses he/she teaches.
self.linktto("likes", self)
self.linkto("teaches",course)
Note that the constructor of Scientist also constructs all the Course objects that are associated with the respective scientist.
Because the creation of scientists is not dependent on other objects, we create them in the main program: We define a number of departments and specialization fields and then construct scientists for each.
departments = {
"Philosophy": ["Metaphysics", "Ethics", "Epistemology"],
"Computer Science" : ["Artificial Intelligence", "Theoretical CS", "Databases"],
"Maths": ["Statistics", "Numerics", "Arithmetics"]
}
for dep in departments:
for spec in dep:
numProfs = random.randint(1,3)
numAdvisors = 4-numProfs
for i in range(numProfs):
Scientist(world, 'professor', dep, spec)
for i in range(numAdvisors):
Scientist(world, 'assistant', dep, spec)
After all scientists and courses are created, the relations likes and similar can be defined completely: We iterate over all scientists and link each of them to a random number of scientists via the likes relation.
scientists = world.getContainer("scientist")
for s in scientists:
for s2 in scientists.sampleSet(1, len(scientists)):
s.linkto("likes", s2)
To randomly choose scientists, we use the sampleSet method of the ObjectContainer class. Whenever we add an object to the world, we implicitly add it to a container whose name corresponds to the type of the object. We can use these containers to conveniently access objects of particular types.
Next, we create the course similarity relation. To determine the course similarities we simply make all courses similar to each other that share the same department and specialization. Note that we will not use the similar predicate for learning and thus may choose to omit the linkto functionality and to not include the relation in the training database. However we need to determine the similarity of courses in the generator to define the takesSimilarCourse predicate, for we want to it to influence the grades of students.
courses = world.getContainer("course")
for i, course in enumerate(courses):
for course2 in courses[i+1:]:
if course.department == course2.department and course.specialization == course2.specialization:
course.linkto("similar", course2)
course2.linkto("similar", course)
Finally we implement the class Student(Object) to add students to the world. Like we did for the difficulty of courses we use AttrDist to determine the intelligence of the student:
self.intelligence = AttrDist({"weak": 0.2, "average": 0.6, "smart": 0.2}).generate()
According to the intelligence of the student we determine a basic distribution of the grades the student can achieve. We randomly choose how many advisors the student has and use the Choser function to draw the student's advisor from the scientist container:
numAdvisors = AttrDist({0: 0.3, 1: 0.5, 2: 0.2}).generate() adviserChoser = Chooser(world,"scientists",0) for i in range(numAdvisors): adviserChoser.choose().advise(self)
The advise(student) function of the implemented scientist creates the advises(advisor,student) link and adds the advisor to the student's advisor list.
The grades for the individual courses are set in setGrades(). As the grade(student,course) function does not have Boolean domain we need to set an attribute to the link object as described above. To do that we initially create a predicate gotGraded and link the student to all courses. Thereafter we add the attribute grade to the created link to get the desired function. Thus in the training database we will also have a gotGraded predicate which is set to true for all courses the student didn't take:
gotGraded(Student1,Course1) = True grade(Student1,Course1) = None
However, we will not include the gotGraded predicate in the model. The grade distribution for each course the student did take is altered according to the difficulty of the course, whether the teacher of the course likes at least one of the student's advisors and whether the student takes a similar course. After that the grade is once again determined using the AttrDist function and the grade attribute is set to the appropriate value.
To obtain a database from the created world, we use the getDatabase method of the world object. The database generation tool expects a variable called db to exist in the global scope.
db = world.getDatabase()
import sys
from datagen import *
from random import choice, randint, shuffle
similarCourse = True
class Course(Object):
def (self, world, dep, spec, prof):
Object.(self, "course")
self.department = dep
self.specialization = spec
self.prof = prof
self["difficulty"] = AttrDist({"easy": 0.7, "hard": 0.3}).generate()
world.addObject(self)
class Scientist(Object):
def (self, world, type, dep, spec):
Object.(self, "scientist")
self.type = type
self.department = dep
self.specialization = spec
self.linkto("likes", self) # every person likes him- or herself
# create the courses taught by professors
if self.type == "professor":
numCourses = randint(1,3)
for i in range(numCourses):
course = Course(world, self.department, self.specialization, self)
self.linkto("teaches", course)
world.addObject(self)
class Student(Object):
def (self,world):
Object.(self, "student")
self.world = world
self["intelligence"] = AttrDist({"weak": 0.2, "average": 0.6, "smart": 0.2}).generate()
self.coursesTaken = []
self.initGradeDists = {
"weak": {"A": 0.1, "B": 0.15, "C": 0.16, "D": 0.045, "F": 0.07},
"average": {"A": 0.145, "B": 0.23, "C": 0.10, "D": 0.02, "F": 0.005},
"smart": {"A": 0.4, "B": 0.07, "C": 0.02, "D": 0.00, "F": 0.00} }
self.gradeDist = self.initGradeDists[self["intelligence"]]
# assign advisors
numAdvisors = AttrDist({0: 0.3, 1: 0.5, 2: 0.2}).generate()
self.advisors = world.getContainer("scientist").sampleSet(numAdvisors, numAdvisors)
for advisor in self.advisors:
advisor.linkto("advises", self)
self.takeCourses()
self.setGrades()
world.addObject(self)
def takeCourses(self):
for course in self.world.getContainer("course").sampleSet(3, 6):
self.coursesTaken.append(course)
self.linkto("takes",course)
# create "takesSimilarCourse" predicate
for i in range(len(self.coursesTaken)):
course = self.coursesTaken[i]
for course2 in self.coursesTaken[i+1:]:
if course.department == course2.department and course.specialization == course2.specialization:
self.linkto("takesSimilarCourse",course)
self.linkto("takesSimilarCourse",course2)
def setGrades(self):
notTaken = filter(lambda x: x not in self.coursesTaken, world.getContainer("course"))
for c in notTaken:
gradeLink = self.linkto("gotGraded",c)
gradeLink["grade"] = "None"
for c in self.coursesTaken:
localGradeDist = self.gradeDist
likesAdvisor = False
similarCourse = False
for advisor in self.advisors:
if advisor in c.prof.getPartners("likes"):
likesAdvisor = True
for c2 in self.coursesTaken:
if c2 in c.getPartners("similar"):
similarCourse = True
if c["difficulty"] == "hard":
localGradeDist["A"] -= 0.1
localGradeDist["B"] -= 0.05
localGradeDist["D"] += 0.10
localGradeDist["F"] += 0.05
if likesAdvisor:
localGradeDist["A"] += 0.25
localGradeDist["B"] += 0.1
localGradeDist["D"] -= 0.05
localGradeDist["F"] -= 0.05
if similarCourse:
localGradeDist["A"] += 0.1
localGradeDist["B"] += 0.05
localGradeDist["D"] -= 0.05
localGradeDist["F"] -= 0.05
gradeLink = self.linkto("gotGraded",c)
gradeLink["grade"] = AttrDist(localGradeDist).generate()
numStudents = 80
world = World()
generate professors and assistants
departments = {
"Philosophy": ["Metaphysics", "Ethics", "Epistemology"],
"Computer Science" : ["Artificial Intelligence", "Theoretical CS", "Databases"],
"Maths": ["Statistics", "Numerics", "Arithmetics"]
}
for dep in departments:
for spec in dep:
numProfs = random.randint(1,3)
numAdvisors = 4-numProfs
for i in range(numProfs):
Scientist(world, 'professor', dep, spec)
for i in range(numAdvisors):
Scientist(world, 'assistant', dep, spec)
create the "likes" relation
scientists = world.getContainer("scientist")
for s in scientists:
for s2 in scientists.sampleSet(1, len(scientists)):
s.linkto("likes", s2)
create course similarities
courses = world.getContainer("course")
for i, course in enumerate(courses):
for course2 in courses[i+1:]:
if course.department == course2.department and course.specialization == course2.specialization:
course.linkto("similar", course2)
course2.linkto("similar", course)
create students
for i in range(numStudents):
student = Student(world)
create the teacherOfLikesAdvisorOf predicate
courses = world.getContainer("course")
numCourses = len(courses)
for i in range(numCourses):
prof = courses[i].prof
for likedProf in prof.getPartners("likes"):
for stud in likedProf.getPartners("advises"):
courses[i].linkto("teacherOfLikesAdvisorOf",stud)
create the database
db = world.getDatabase()