Analysis of genome variation in CEs of meningococcal isolates

Important points were concluded from this analysis related with the presence of the same variable CEs within multiple isolates with different time points being under the selection and the location of SNPs within the conserved functional pattern of CEs. In the 40 isolates, 9 out of 39 variable CEs conducted with multiple isolates, each one affected around 20% or more of population. In the same way, the presence same variable CEs within multiple isolates belonging to same CCs or different CCs of 25 pair isolates detected in 12 out of 137 variable CEs, each one affected around 10% or more out of 25 pair isolates. Interestingly, some CEs showed variation within promoters or IHF patterns or were inserted inside the genes, and were associated with genes that coded mainly for different enzymes within metabolism scheme or transporters within outer membrane scheme in both 40 and 25 pair isolates. This variation may change the level of transcription of genes belonging into previous mentioned schemes and this altered expression may help N. meningitidis to resist the immune system. There were 6 out of 39 and 15 out of 137 variable CEs that showed variation within the conserved functional pattern in 40 isolates and 25 pair isolates respectively. Another function of CEs is as a source of variation in N. meningitidis.


Introduction
N. meningitidis has high abundance of different types of repeats. The repeats have many functions some of which may aid survival of N. meningitidis in the host. There are many different types of repeat patterns such as SSR, CEs, NIME and different types of repetitive extragenic palindromes (REP2, 3, 4, 5) (Bentley et al., 2007). CEs have a major role in controlling expression of different genes through sequence motifs that include the Snyder promoter, Black promoter and binding site for integration host factor (IHF) (Siddique et al., 2011;Snyder et al., 2009). In addition, they can terminate the transcriptional process when inserted inside genes or can be considered as a target for intergenic recombination (Snyder et al., 2009). It has also been reported that CE may control expression of some small non-coding RNAs (Siddique et al., 2011). CEs have an inverted repeat of 26 bp at both ends. CEs are distributed randomly across the non-coding RNA (Roberts et al., 2016) but some researchers suggested that they are mainly found in IGRs of virulence, metabolic and transporter genes (Liu et al., 2002;Snyder et al., 2009). Some studies have shown that they can transfer between species by horizontal gene transfer that are a source of variation (Buisine et al., 2002). It has been reported that there are eight types of CEs. The classification of CE depends on two aspects. Firstly, a CE that has complete sequence is around 153 to 157 bp and contains two promoters and IHF site. A partial sequence CE, around 104-108 bp and contains only two promoters as there is a deletion of 50 bp in the middle of the sequence covering the IHF. This first aspect divided the CEs into alpha or beta with complete sequence or alpha prime and beta prime with partial sequence. Secondly, CEs contain three regions: left, middle and right arm. CEs divide into alpha and beta sequences, which are conserved patterns, located within the left arm or right arm of CEs. These conserved can be combined to give the eight types of CEs (Siddique et al., 2011). The general aim was to analyse the variation in CE elements within two sets of persistence isolates. CEs are a source of variation between different strains therefore, it is necessary to analyse variation in CEs within our strains in their host that represents evolution over a small-time scale (for period of months).

Materials and Methods Description of Isolates
The first group of next generation sequencing data (NGS) consists of 40 meningococcal isolates collected from one asymptomatic carrier (V59) with ten isolates collected at four different time points: 0, 1, 3 and 6 months (with the latter three representing at least 1-6 months host persistence; the first time point was in November 2008 and hence the later three time points collected in December, February and May 2009) (See Table 1).   Table 2). These two sets of isolates are a sub-set of isolates taken from a carriage study performed on 190 students from Nottingham University between November 2008 and May 2009 (Bidmos et al., 2011). The samples were collected by taking a nasopharyngeal swab and spreading it on one half of a chocolate GC selective agar plate. In order to obtain multiple single colonies, the inoculum was spread with a loop on the other half of the selective agar plate. The plate was incubated at 37 oC to allow growth of N. meningitidis colonies. Single colonies were restreaked onto chocolate agar plates (Oxoid) and incubated overnight under the same conditions (Bidmos et al., 2011).

Description the type and source of NGS of isolates
Genomic DNA was extracted from one isolate per carrier per time point using DNeasy purification kit (Qiagen) and stored at 4oC for subsequent analysis. In collaboration with Prof. Martin Maiden (University of Oxford), genome sequence data was generated by Illumina Hiseq and was assembled using Velvet (version 1.1) (Zerbino, 2010).
Assembled sequence data was loaded into the pubMLST.org/neisseria database powered by the BIGSdb genomics platform (Bayliss, Harrison and Maiden, unpublished data). Prokka (Seemann, 2014) was used for further annotation of representative genomes (performed by Dr. M. Blades, BBASH, and University of Leicester).

Perl scripts for extracting and manipulating CEs in the intergenic DNA sequence
Perl scripts for extracting and manipulating CEs in the intergenic DNA sequence CEs vary in length. Complete CEs span 153 to 157 bp with a 26 bp inverted repeat at both ends. Partial CEs span 104-108 bp and have a deletion of around 50bp in the middle.
Alternatively, partial CEs span 60-62 bp and have a big deletion in the middle with a 26 bp inverted repeat at one side (Snyder et al., 2009). CEs were searched within the variable IGRs. The fasta_format.pl script (See previous appendix 1) was used to concatenate sequences for all variable IGRs in alignment format for each isolate. Then the extr_gen_isolate script (See previous appendix 2) was used to extract variable IGRs for each isolate from this alignment file and save them separate multifasta files.
To detect the CEs, the Significant_CE.pl script (Appendix 3) was designed to achieve a BLAST search for CE template (complete CE 157 bp) as a query against the multifasta file of every gene and IGR sequence of each isolate (output of extr_gen_isolate.pl script). The script reports the significance values of matching sequences (Evalue > 1e-20, identity > 90%, length of alignment > 50pb) and the starting and ending points of matches within each IGR. The variation_location.pl script (Appendix 4) was used to detect and report the position of varied nucleotides in all variable IGRs for each pair of isolates. Finally, the positions of varied nucleotides of IGRs that showed significant matching with template CEs were further checked using locate_CE.pl script (Appendix 5). The input files for the locate_CE.pl script was the output of both Significant_CE.pl and variation_location.pl scripts. If the position of varied nucleotides detected by variation_location.pl script was located between the starting and ending points of a CE, the variation was reported as due to the presence of CE patterns. These variable CEs were then examined for the presence of inverted repeats, IHF patterns, Black, Snyder promoters and (AC or AT) ending sites.

Results and Discussion
Analysing the dynamics of variation in CE of IGRs of a pool of 40 isolates from four different time points of one carrier The CEs were identified in the IGRs of the 40 carriage isolates using a series of scripts as set out previously. The majority of variation was located within different repeat tracts especially CE with 71 intergenic loci. Some of the variation in CEs in the 40 isolates was not real but was due to poor assembly in the sequences of isolates. The spurious variation in CEs was observed through a BLAST search between variable CE patterns in old assembly (Using velvet software with single and paired end date) and new assembly (Using spades software with paired end date only). In cases where there is no identity between old and new assembly in the position of variation, the presence of spurious variation in CEs was confirmed. There were 32 loci out of 71 (45%) that carried CEs with poor assembly.
In first instance, most variation of CEs was found in partial CEs (14 out of 21) 66.6%, (13 out of 22) 59% and (13 out of 25) 52% in first, second and third time periods. Conversely, (15 out of 28) 54.5% variable CEs detected within the complete CEs in fourth time period. Variation in the CEs increased with time point with the fourth time period having a higher number of variable CEs in IGRs compared with other time periods. In addition, the variation of CEs patterns conducted with 3 out of 8 types of CEs in first time period while the panel of variation was shown in 5 out of 8 types of CEs in fourth time period (Fig. 1). The Poisson mean test (P < 0.001) revealed that there was a significant difference in the number of variable CEs for the first against fourth time points.   Table 3: List of genes carried IGRs with variation located in promoters and IHF of CE and genes carried variable CEs within their sequence in the 40 isolates from one carrier. The variation in these CEs may have an important role in persistence of N. meningitidis for months. Blue color: complete alpha-alpha, green color: complete alpha-beta, light green color: complete beta-beta, purple color: complete beta-alpha, red color: partial alpha-alpha´, orange color: partial beta-alpha´ and light blue color: partial beta-beta´. The number of CEs in IGRs of isolates were 197, 201, 203 and 200 while the number of variable CEs in IGRs were 21, 22, 25 and 28 in first, second, third and fourth time point respectively.
Although, there were no significant trends for increasing variation in CEs with time point, it would be that selection was only acting on specific CEs due to the fact that multiple isolates are affected for the same variable CE patterns with time periods. 20% or more of the population was affected by the same variable CEs with 9 out of 39 CEs being variable in multiple isolates (Fig. 2).
Panel A, B, C, D represents the CEs having variation in one, two, three and four time point's respectively. The number of isolates is 10 for each time point. The CEs having variation in eight or more isolates affected 20% or more of population (40 isolates).
The location of variation within CEs is crucial as may be within the conserved functional patterns of CE. Variation was found within the -35 position of the first promoter (Snyder promoter) in 3 out of 39 (NEIS0474, NEIS1629, NEIS2006). Variation was also detected within the -35 position of the second promoter (Black promoter) in 1 out of 39 (NEIS1630). The changes in these positions may lead to a change in expression of genes. These genes mainly are transporter genes being variable in multiple isolates with time points. Interestingly, variation was also found within the IHF pattern of CE in 2 out of 39 (NEIS0237 and NEIS0008) which may change the expression of genes through changing the strength of binding between transcriptional binding factors. These genes are metabolism genes being variable in multiple isolates with time points. However, the distance between CEs that have variation within promoters regions or IHF binding site and the starting codon of nearby gene may indicate some variable CEs were not enrolled in changing the expression of their genes. Nevertheless, practical work has to be carried to confirm their role in effecting the level of transcription. High frequency of SNPs was seen in other positions within the CE patterns may indicate the change in these positions lead into effecting the gene expression. These positions are two nucleotides after the -10 pattern of Snyder promoter changing T to C, two nucleotides after the -35 pattern of Snyder promoter changing G to A, two nucleotides before the -10 pattern of Black promoter changing A to G and two nucleotides before the -35 pattern of Black promoter changing C to T (Appendix 6, 7 Figures). From the mention figure, the dynamic of variation showed that the overall variation in CEs in the IGRs was due to SNPs rather than indels. CEs were conserved in the IGRs of the four different isolates; each one selected arbitrarily from a particular time point (isolates: 20879, 20896, 20905 and 20911). Some IGRs that carried CEs were missing particularly those at the end of the contig and due to the sequence not being complete in seven loci (NEIS0060, NEIS1354, NEIS1457, NEIS1943, NEIS1944, NEIS1945, and NEIS1963). Therefore, there was no evidence of movement of CE between the different positions. The CEs can move into genes in the same way transposon like elements and can terminate the transcription of gene by forming loop like structure. The inserted variable CEs within the genic regions were detected in two genes that are NEIS1702 and NEIS0500 (Table 3). Both genes encode for hypothetical proteins, which may not have a role in persistence of N. meningitidis. However, inserted CEs within the genic regions of NEIS1702 and NEIS0500 found in all the 40 isolates with different time periods therefore there was no movement of CEs within the time points among 40 isolates representing at least 1-6 months host persistence.

Analysing the dynamics of variation in CE of IGRs of the pairs isolates from 25 carriers representing 2/3 to 5/6 months carriages
The CEs were identified in the IGRs of each pair of 25 isolates. There were 174 variable CEs in the combined analysis of the 25 pair isolates. However, some of this variation was due to poor assembly of the sequences of isolates. Again, no identity between old and new assembly in the position of variation in a BLAST search of variable CE patterns was used to detect the spurious variation. There were 37 loci out of 174(21.2%) that carried CEs with poor assembly. The types of variable CEs were analysed for each CC for two different periods of carriage (2/3 and 5/6 months). In the first period, two to three months, variation was found mainly complete with alpha-alpha type for all CCs. Variation in partial CEs with alpha-alpha' type was found in CC-167 and CC-60 while no variable CEs was observed in CC-1157-32-269. In the second time period, five to six months, the variable CEs were also mainly complete with alpha-alpha type in all CC. The partial CEs with alpha-alpha' type was found in CC-174 and CC-60 (Fig. 3).
Panel A: variation within different CCs for three months carriage, panel B: variation within different CCs for six months carriages.
After normalization between numbers of isolates in two different time periods, the number of variable CEs was 20.25 and 27.1 for two different periods of carriage (2/3 and 5/6 months). There was significant different with (P value 0.008) for the compared carriage 2/3 against 5/6 months of carriage using Poisson mean test. The same variable CEs found with multiple isolates within the 25 pair isolates may introduce more advantage for the persistence of N. meningitidis than other CEs. 10 % or more of 25 pair isolates affected by the same variable CEs recorded in 12 out of 137 CEs. However, location of variable nucleotides within the conserved functional patterns of CE (promoters and IHF binding site) may give more evidence on contributing CEs in controlling the gene expression. Variation was found within the -35 and -10 positions of the first promoter (Snyder promoter) in 3 out of 137 and within the -35 and -10 positions of the second promoter (Black promoter) in 9 out of 137. Variation was also found within the IHF binding site in 3 out of 137. One variable CE was inserted in NEIS0202 gene, however there was also no evidence on the movement of CEs within the 25 pair isolates in a small-time scale (for period of months). Although, more than 60% of variable CEs that have variation in the conserved functional patterns (two promoters and IHF binding site) located upstream of genes and encode for hypothetical proteins or located in tail to tail between adjacent genes (Table 4). However, the variation within these CEs may enroll in physiological activities of N. meningitidis mostly with metabolism schemes. Moreover, variation was also detected in one nucleotide before or after conserved functional patterns in ten variable CEs. Furthermore, a high frequency of variation was seen in different positions. These positions are two nucleotides before the -10 pattern of Black promoter changing A to G, six and seven nucleotides before the -35 pattern of Black promoter changing C to T and A to G, twelve and twenty six nucleotides before the -35 pattern of Black promoter changing G to T and T to C. All these positions may indicate that changing of variable CEs correlated with changing gene expression levels ( Appendix 8,9 Figures). Again, practical work has to be achieved to confirm these results. The overall dynamic of variation in CEs of 25 pair isolates showed that variation was caused mostly by one SNP 106 out of 137 (77.3%) with rare occurrence of other types of variation. Finally, it has been shown that there were only 2 variations within NIME and one REP2 in the intergenic belong into NEIS0510 genic region in overall 25 pair isolates. Table 4: List of genes carried IGRs with variation located in promoters and IHF of CE and genes carried variable CEs within their sequence in the 25 pair isolates.
The variation in different parts of CEs and Snyder promoter within the 25 pair isolates. Green color: -10, -35 patterns Snyder promoter, IHF binding site and and R (high variable position detected by (Siddique et al., 2011)(. Red color: variable nucleotides in IGRs having variation within -10,-35 and IHF of CE. Yellow color: varied nucleotides in other positions of CEs. Pink color: varied nucleotides in position of one nucleotide before and after the IHF binding site. The number of isolates is the numbers that have a varied nucleotide out of 25 pair. The number of isolates in blue color: same varied CEs with more than 10% of population.