Mishra, Varun and Kaul, Manohar
(2018)
Document Simplicial Complex.
Masters thesis, Indian Institute of Technology Hyderabad.
Abstract
A ksimplex is de�ned as kdimensional geometric structure which is the convex hull
of k+1 points. Given k+1 points x0; :::; xk 2 Rk which are a�nely independent, the
set
C =
(
a0x0 + ::: + akxk
����
Xk
i=0
ai = 1 and ai � 0 for all i
)
;
is de�ned as the ksimplex determined by them. Simplex is a very basic building
structure in abstract topology. Collection of simplexes (or simplices) under certain
condition is called geometrical simplicial complex, which further helps to analyze a
geometrical structure on bigger scale. An abstract simplicial complex is a purely
combinatorial description of the geometric notion of a simplicial complex, consisting
of a family of nonempty �nite sets closed under the operation of taking nonempty
subsets.
A text document can be visualized as a geometric structure in topology. A docu
ment is de�ned as a collection of words, where each word is considered to be a part of
vocabulary having a certain meaning. And an ngram is a contiguous sequence of n
items from a given sample of text. Using the ngram concept to de�ne a simplex we
can construct an abstract simplicial complex out of every text document. Thus from
this model, every simplex catches the local structure or behavior while a document
simplicial complex, which is the collection of all n1 simplex, captures the global be
havior of the document. We will study this considering we have a bag of documents
i.e. the universal set of documents.
The aim of this thesis is to understand abstract structure admitted by text doc
uments to �nd more accurately the similar documents from the given family if text
documents. In our discussion, we will visualize a document as a geometrical entity
and will make use of such representation of a text document to fast the process of
querying, where given a query document one can �nd the semantically similar doc
uments more e�ciently in the sense of time and similarity. For example, given a
set of documents as f1.\after clearing high school one joins college", 2.\College can
be joined only after passing high school" and 3.\High school and college must be
attended by everyone"g the document 1 and 2 are more semantically similar that 1
and 3 or 2 and 3.
After a brief glance at abstract topology, we study the topological structure and
behavior of text documents. A novel representation of documents is given in this
thesis. Using this new structure of a text document we represent each document as a
geometrical entity which further can be analyzed using topological tools. Using Earth
Mover's distance and Hausdor� distance we give a new formulation to fetch semantic
documents for a given query. To represent documents as a mathematical structure
in some Rk, we use Word2Vec model to �nd vector representation of each word in a
text document.
