Document Simplicial Complex

Mishra, Varun and Kaul, Manohar (2018) Document Simplicial Complex. Masters thesis, Indian Institute of Technology Hyderabad.

[img]
Preview
Text
Thesis_Mtech_CS_4076.pdf - Published Version

Download (1MB) | Preview

Abstract

A k-simplex is de�ned as k-dimensional geometric structure which is the convex hull of k+1 points. Given k+1 points x0; :::; xk 2 Rk which are a�nely independent, the set C = ( a0x0 + ::: + akxk ���� Xk i=0 ai = 1 and ai � 0 for all i ) ; is de�ned as the k-simplex determined by them. Simplex is a very basic building structure in abstract topology. Collection of simplexes (or simplices) under certain condition is called geometrical simplicial complex, which further helps to analyze a geometrical structure on bigger scale. An abstract simplicial complex is a purely combinatorial description of the geometric notion of a simplicial complex, consisting of a family of non-empty �nite sets closed under the operation of taking non-empty subsets. A text document can be visualized as a geometric structure in topology. A docu- ment is de�ned as a collection of words, where each word is considered to be a part of vocabulary having a certain meaning. And an n-gram is a contiguous sequence of n items from a given sample of text. Using the n-gram concept to de�ne a simplex we can construct an abstract simplicial complex out of every text document. Thus from this model, every simplex catches the local structure or behavior while a document simplicial complex, which is the collection of all n-1 simplex, captures the global be- havior of the document. We will study this considering we have a bag of documents i.e. the universal set of documents. The aim of this thesis is to understand abstract structure admitted by text doc- uments to �nd more accurately the similar documents from the given family if text documents. In our discussion, we will visualize a document as a geometrical entity and will make use of such representation of a text document to fast the process of querying, where given a query document one can �nd the semantically similar doc- uments more e�ciently in the sense of time and similarity. For example, given a set of documents as f1.\after clearing high school one joins college", 2.\College can be joined only after passing high school" and 3.\High school and college must be attended by everyone"g the document 1 and 2 are more semantically similar that 1 and 3 or 2 and 3. After a brief glance at abstract topology, we study the topological structure and behavior of text documents. A novel representation of documents is given in this thesis. Using this new structure of a text document we represent each document as a geometrical entity which further can be analyzed using topological tools. Using Earth Mover's distance and Hausdor� distance we give a new formulation to fetch semantic documents for a given query. To represent documents as a mathematical structure in some Rk, we use Word2Vec model to �nd vector representation of each word in a text document.

[error in script]
IITH Creators:
IITH CreatorsORCiD
Kaul, ManoharUNSPECIFIED
Item Type: Thesis (Masters)
Uncontrolled Keywords: Simplex, ASC, EMD
Subjects: Computer science
Divisions: Department of Computer Science & Engineering
Depositing User: Team Library
Date Deposited: 26 Jun 2018 04:57
Last Modified: 26 Jun 2018 04:57
URI: http://raiith.iith.ac.in/id/eprint/4076
Publisher URL:
Related URLs:

Actions (login required)

View Item View Item
Statistics for RAIITH ePrint 4076 Statistics for this ePrint Item