Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages

  
 
  

home | help
WNDB(5WN)		    WordNettm File Formats		     WNDB(5WN)

NAME
       index.noun,  data.noun, index.verb, data.verb, index.adj, data.adj, in-
       dex.adv,	data.adv - WordNet database files

       noun.exc, verb.exc. adj.exc adv.exc - morphology	exception lists

       sentidx.vrb, sents.vrb -	files used by search code to display sentences
       illustrating the	use of some specific verbs

DESCRIPTION
       For each	syntactic category, two	files are needed to represent the con-
       tents of	the WordNet database - index.pos and data.pos,	where  pos  is
       noun,  verb,  adj  and  adv.  The other auxiliary files are used	by the
       WordNet library's searching functions and are needed to run the various
       WordNet browsers.

       Each index file is an alphabetized list of all the words	found in Word-
       Net in the corresponding	part of	speech.	 On each line,	following  the
       word,  is  a list of byte offsets (synset_offsets) in the corresponding
       data file, one for each synset containing the word.  Words in the index
       file are	in lower case only, regardless of how they were	entered	in the
       lexicographer files.  This folds	various	 orthographic  representations
       of  the word into one line enabling database searches to	be case	insen-
       sitive.	See wninput(5WN) for a detailed	description of the  lexicogra-
       pher files

       A data file for a syntactic category contains information corresponding
       to the synsets that were	specified in the lexicographer files, with re-
       lational	pointers resolved to synset_offsets.  Each line	corresponds to
       a synset.  Pointers are followed	and hierarchies	 traversed  by	moving
       from one	synset to another via the synset_offsets.

       The  exception  list files, pos.exc, are	used to	help the morphological
       processor find base forms from irregular	inflections.

       The files sentidx.vrb and sents.vrb contain sentences illustrating  the
       use  of	specific  senses  of  some verbs.  These files are used	by the
       searching software in response to a request for verb  sentence  frames.
       Generic	sentence frames	are displayed when an illustrative sentence is
       not present.

       The various database files are in ASCII formats that are	easily read by
       both humans and machines.  All fields, unless otherwise noted, are sep-
       arated by one space character, and all lines are	terminated by  a  new-
       line  character.	 Fields	enclosed in italicized square brackets may not
       be present.

       See wngloss(7WN)	for a glossary of WordNet terminology and a discussion
       of the database's content and logical organization.

   Index File Format
       Each  index  file  begins with several lines containing a copyright no-
       tice, version number and	license	agreement.  These lines	all begin with
       two spaces and the line number so they do not interfere with the	binary
       search algorithm	that is	used to	look up	entries	in  the	 index	files.
       All  other  lines  are  in the following	format.	 In the	field descrip-
       tions, number always refers to a	decimal	integer	unless	otherwise  de-
       fined.

       lemma  pos  synset_cnt  p_cnt  [ptr_symbol...]  sense_cnt  tagsense_cnt	 synset_offset	[synset_offset...]

       lemma	      lower  case ASCII	text of	word or	collocation.  Colloca-
		      tions are	formed by joining individual words with	an un-
		      derscore (_) character.

       pos	      Syntactic	 category: n for noun files, v for verb	files,
		      a	for adjective files, r for adverb files.

       All remaining fields are	with respect to	senses of lemma	in pos.

       synset_cnt     Number of	synsets	that lemma is in.  This	is the	number
		      of  senses of the	word in	WordNet. See Sense Numbers be-
		      low for a	discussion of how sense	numbers	 are  assigned
		      and the order of synset_offsets in the index files.

       p_cnt	      Number  of  different  pointers  that  lemma  has	in all
		      synsets containing it.

       ptr_symbol     A	space separated	 list  of  p_cnt  different  types  of
		      pointers	that  lemma  has in all	synsets	containing it.
		      See wninput(5WN) for a list of pointer_symbols.  If  all
		      senses  of lemma have no pointers, this field is omitted
		      and p_cnt	is 0.

       sense_cnt      Same as sense_cnt	above.	This  is  redundant,  but  the
		      field was	preserved for compatibility reasons.

       tagsense_cnt   Number  of  senses of lemma that are ranked according to
		      their frequency of occurrence  in	 semantic  concordance
		      texts.

       synset_offset  Byte  offset  in	data.pos  file	of a synset containing
		      lemma.  Each synset_offset in the	list corresponds to  a
		      different	 sense	of lemma in WordNet.  synset_offset is
		      an 8 digit, zero-filled decimal integer that can be used
		      with fseek(3) to read a synset from the data file.  When
		      passed to	read_synset(3WN) along with the	syntactic cat-
		      egory,  a	data structure containing the parsed synset is
		      returned.

   Data	File Format
       Each data file begins with several lines	containing a copyright notice,
       version	number	and license agreement.	These lines all	begin with two
       spaces and the line number.  All	other lines are	in the following  for-
       mat.  Integer fields are	of fixed length, and are zero-filled.

       synset_offset  lex_filenum  ss_type  w_cnt  word	 lex_id	 [word	lex_id...]  p_cnt  [ptr...]  [frames...]  |  gloss

       synset_offset  Current  byte  offset  in	 the  file represented as an 8
		      digit decimal integer.

       lex_filenum    Two digit	decimal	integer	corresponding to the  lexicog-
		      rapher  file  name  containing  the  synset.   See  lex-
		      names(5WN) for the list of filenames  and	 their	corre-
		      sponding numbers.

       ss_type	      One character code indicating the	synset type:

		      n	   NOUN
		      v	   VERB
		      a	   ADJECTIVE
		      s	   ADJECTIVE SATELLITE
		      r	   ADVERB

       w_cnt	      Two  digit  hexadecimal integer indicating the number of
		      words in the synset.

       word	      ASCII form of a word as entered in  the  synset  by  the
		      lexicographer,  with spaces replaced by underscore char-
		      acters (_).  The text of the word	is case	sensitive,  in
		      contrast	to  its	 form  in  the corresponding index.pos
		      file, that contains only lower-case forms.  In data.adj,
		      a	 word  is  followed  by	 a syntactic marker if one was
		      specified	in the lexicographer file.  A syntactic	marker
		      is  appended,  in	parentheses, onto word without any in-
		      tervening	spaces.	 See wninput(5WN) for a	 list  of  the
		      syntactic	markers	for adjectives.

       lex_id	      One  digit  hexadecimal integer that, when appended onto
		      lemma, uniquely identifies a sense within	 a  lexicogra-
		      pher file.  lex_id numbers usually start with 0, and are
		      incremented as additional	senses of the word  are	 added
		      to  the same file, although there	is no requirement that
		      the numbers be consecutive or begin with 0.  Note	that a
		      value  of	0 is the default, and therefore	is not present
		      in lexicographer files.

       p_cnt	      Three digit decimal integer  indicating  the  number  of
		      pointers from this synset	to other synsets.  If p_cnt is
		      000 the synset has no pointers.

       ptr	      A	pointer	from this synset to another.  ptr  is  of  the
		      form:

		      pointer_symbol  synset_offset  pos  source/target

		      where  synset_offset  is	the  byte offset of the	target
		      synset in	the data file corresponding to pos.

		      The source/target	field distinguishes lexical and	seman-
		      tic  pointers.   It is a four byte field,	containing two
		      two-digit	hexadecimal integers.  The  first  two	digits
		      indicates	 the  word  number  in	the  current  (source)
		      synset, the last two digits indicate the word number  in
		      the   target   synset.   A  value	 of  0000  means  that
		      pointer_symbol represents	a  semantic  relation  between
		      the  current (source) synset and the target synset indi-
		      cated by synset_offset.

		      A	 lexical  relation  between  two  words	 in  different
		      synsets  is represented by non-zero values in the	source
		      and target word numbers.	The first and last  two	 bytes
		      of  this	field  indicate	the word numbers in the	source
		      and target synsets, respectively,	between	which the  re-
		      lation  holds.   Word  numbers  are assigned to the word
		      fields in	a synset, from left to right,  beginning  with
		      1.

		      See  wninput(5WN)	for a list of pointer_symbols, and se-
		      mantic and lexical pointer classifications.

       frames	      In data.verb only, a list	of  numbers  corresponding  to
		      the  generic  verb  sentence  frames  for	 words	in the
		      synset.  frames is of the	form:

		      f_cnt   +	  f_num	 w_num	[ +   f_num  w_num...]

		      where f_cnt a two	digit decimal integer  indicating  the
		      number  of  generic  frames listed, f_num	is a two digit
		      decimal integer frame number, and	w_num is a  two	 digit
		      hexadecimal  integer  indicating	the word in the	synset
		      that the frame applies to.  As with  pointers,  if  this
		      number  is 00, f_num applies to all words	in the synset.
		      If non-zero, it is applicable only  to  the  word	 indi-
		      cated.   Word  numbers  are  assigned  as	 described for
		      pointers.	 Each f_num  w_num pair	is preceded  by	 a  +.
		      See  wninput(5WN)	 for  the text of the generic sentence
		      frames.

       gloss	      Each synset contains a gloss.  A gloss is	represented as
		      a	 vertical bar (|), followed by a text string that con-
		      tinues until the end of the line.	 The gloss may contain
		      a	definition, one	or more	example	sentences, or both.

   Sense Numbers
       Senses  in  WordNet are generally ordered from most to least frequently
       used, with the most common sense	numbered 1.  Frequency of use  is  de-
       termined	by the number of times a sense is tagged in the	various	seman-
       tic concordance texts.  Senses that are not semantically	tagged	follow
       the  ordered  senses.  The tagsense_cnt field for each entry in the in-
       dex.pos files indicates how many	of the senses in the  list  have  been
       tagged.

       The  cntlist(5WN)  file	provided with the database lists the number of
       times each sense	is tagged in the semantic concordances.	 The data from
       cntlist	is  used by grind(1WN) to order	the senses of each word.  When
       the index.pos files are generated, the  synset_offsets  are  output  in
       sense  number  order,  with sense 1 first in the	list.  Senses with the
       same number of semantic tags are	assigned unique	but consecutive	 sense
       numbers.	 The WordNet OVERVIEW search displays all senses of the	speci-
       fied word, in all syntactic categories,	and  indicates	which  of  the
       senses are represented in the semantically tagged texts.

   Exception List File Format
       Exception  lists	are alphabetized lists of inflected forms of words and
       their base forms.  The first field of each line is an  inflected	 form,
       followed	 by  a	space  separated list of one or	more base forms	of the
       word.  There is one exception list file for each	syntactic category.

       Note that the noun and verb exception lists were	 automatically	gener-
       ated  from  a  machine-readable dictionary, and contain many words that
       are not in WordNet.  Also, for many of the inflected forms, base	 forms
       could  be  easily  derived  using the standard rules of detachment pro-
       grammed into Morphy (See	morph(7WN)).  These anomalies are  allowed  to
       remain in the exception list files, as they do no harm.

   Verb	Example	Sentences
       For  some  verb	senses,	 example sentences illustrating	the use	of the
       verb sense can be displayed.  Each line of the  file  sentidx.vrb  con-
       tains a sense_key followed by a space and a comma separated list	of ex-
       ample sentence template numbers,	in decimal.  The file sents.vrb	 lists
       all  of the example sentence templates.	Each line begins with the tem-
       plate number followed by	a space.  The rest of the line is the text  of
       a  template example sentence, with %s used as a placeholder in the text
       for the verb.   Both  files  are	 sorted	 alphabetically	 so  that  the
       sense_key and template sentence number can be used as indices, via bin-
       srch(3WN), into the appropriate file.

       When a request for FRAMES is made, the WordNet search  code  looks  for
       the sense in sentidx.vrb.  If found, the	sentence template(s) listed is
       retrieved from sents.vrb, and the %s is replaced	with the verb.	If the
       sense  is not found, the	applicable generic sentence frame(s) listed in
       frames is displayed.

NOTES
       Information in the data.pos and index.pos files represents all  of  the
       word senses and synsets in the WordNet database.	 The word, lex_id, and
       lex_filenum fields together uniquely identify each word sense in	 Word-
       Net.   These  can  be  encoded  in  a  sense_key	 as  described in sen-
       seidx(5WN).  Each synset	in the database	can be uniquely	identified  by
       combining  the synset_offset for	the synset with	a code for the syntac-
       tic category (since it is possible for synsets  in  different  data.pos
       files to	have the same synset_offset).

       The  WordNet  system provide both command line and window-based browser
       interfaces to the database.  Both interfaces utilize a  common  library
       of search and morphology	code.  The source code for the library and in-
       terfaces	is included in the WordNet package.  See wnintro(3WN)  for  an
       overview	of the WordNet source code.

ENVIRONMENT VARIABLES (UNIX)
       WNHOME		   Base	 directory  for	 WordNet.  Default is /usr/lo-
			   cal/WordNet-3.0.

       WNSEARCHDIR	   Directory in	which the WordNet  database  has  been
			   installed.  Default is WNHOME/dict.

REGISTRY (WINDOWS)
       HKEY_LOCAL_MACHINE\SOFTWARE\WordNet\3.0\WNHome
			   Base	 directory  for	 WordNet.   Default is C:\Pro-
			   gram	Files\WordNet\3.0.

FILES
       index.pos	   database index files

       data.pos		   database data files

       *.vrb		   files of sentences illustrating the use of verbs

       pos.exc		   morphology exception	lists

SEE ALSO
       grind(1WN),  wn(1WN),  wnb(1WN),	 wnintro(3WN),	 binsrch(3WN),	 wnin-
       tro(5WN),  cntlist(5WN),	 lexnames(5WN),	 senseidx(5WN),	 wninput(5WN),
       morphy(7WN), wngloss(7WN), wngroups(7WN), wnstats(7WN).

WordNet	3.0			   Dec 2006			     WNDB(5WN)

NAME | DESCRIPTION | NOTES | ENVIRONMENT VARIABLES (UNIX) | REGISTRY (WINDOWS) | FILES | SEE ALSO

Want to link to this manual page? Use this URL:
<https://www.freebsd.org/cgi/man.cgi?query=wndb&sektion=5wn&manpath=FreeBSD+13.0-RELEASE+and+Ports>

home | help