Biology has become a large-scale comparative science. Significant advances in computational and sequencing techniques have resulted in massive amounts of data that need to be deciphered, organized, and correctly annotated. Existing manual curation methods are often laborious, time-consuming, and error-prone. Furthermore, current automated techniques are either computationally prohibitive for rapid curation or may be misleading for accurate curation. New automated techniques are needed to alleviate the burden of manual curation methods without sacrificing accuracy
This dissertation introduces novel methods and approaches towards the organization, classification, and subsequent annotation of sequences from published and unpublished microbial genomic resources. The presented tools and resources perform these tasks in a fraction of the time required by existing available methods, while still performing at high levels of precision (92%--100%) and recall (89%--99%). In an exposition of these tools on publicly available data, I attempt to identify genes that are uniquely associated with the DNA segregation parA gene family. The tools and resources established here are steps toward a general comparative genomic framework that enables ready access to all available microbial sequence data while integrating them to current genetic knowledge bases