I think this should be common problem in Computer science.
I have a data like this
List<Map<Integer,List<String>>> Qdata = new ArrayList<Map<Integer,List<String>>>();
List<Map<Integer,List<String>>> Tdata = new ArrayList<Map<Integer,List<String>>>();
List<String> Qg1 = Arrays.asList("C", "A", "PC", "R");
List<String> Qg2 = Arrays.asList("DQ", "EQ", "KC", "AC");
List<String> Qg3 = Arrays.asList("KQ", "AT");
List<String> Qg4 = Arrays.asList("KQ", "AT", "DQ", "KC","AC","KQ", "AT", "KC","AC","KQ", "AT", "DQ", "KC","AC");
List<String> Qg5 = Arrays.asList("KQ", "AT", "DQ", "KC","AC");
List<String> Qg6 = Arrays.asList("KQ", "AT", "DQ", "KC","AC");
List<String> Qg7 = Arrays.asList("AC","KQ", "AT","AT", "DQ", "KC","AC");
Map<Integer,List<String>> Qmap = new HashMap<Integer, List<String>>();
Qmap.put(1, Qg1);
Qmap.put(2, Qg2);
Qmap.put(3, Qg3);
Qmap.put(4, Qg4);
Qmap.put(5, Qg5);
Qmap.put(6, Qg6);
Qmap.put(7, Qg7);
List<String> Tg1 = Arrays.asList("C", "A", "PC", "?");
List<String> Tg2 = Arrays.asList("KQ", "AT","DQ", "EQ", "KC", "AC");
List<String> Tg3 = Arrays.asList("AT", "DQ", "KC","AC");
List<String> Tg4 = Arrays.asList("KQ", "AT", "DQ", "KC","AC","KQ", "AT", "KC","AC");
List<String> Tg5 = Arrays.asList("KQ", "AT", "DQ", "KC","AC");
List<String> Tg6 = Arrays.asList("KQ", "AT", "DQ", "KC","AC");
List<String> Tg7 = Arrays.asList("AT","AT", "DQ", "KC","AC");
List<String> Tg8 = Arrays.asList("AC");
List<String> Tg9 = Arrays.asList("ACL","AC","C","A","PC");
Map<Integer,List<String>> Tmap = new HashMap<Integer, List<String>>();
Tmap.put(1, Tg1);
Tmap.put(2, Tg2);
Tmap.put(3, Tg3);
Tmap.put(4, Tg4);
Tmap.put(5, Tg5);
Tmap.put(6, Tg6);
Tmap.put(7, Tg7);
Tmap.put(8, Tg8);
Tmap.put(9, Tg9);
Qdata.add(Qmap);
Tdata.add(Tmap)
want to match Qdata with Tdata, the tricky part here is, if you observe the data
Qg3+Qg2 forms Tg2
Tg4+Tg3 forms Qg4 with "kQ" missing between Tg4 and Tg3
Tg8+Tg9 forms Qg7
and with the rest, it is pretty straight forward. I don't know how to deal with this tricky part.
I used map to store the data because it is more desired for the algorithm to finds matching in the same position in Tdata and Qdata like
Qg5 has a complete match with Tg5 Qg6 has a complete match with Tg6
The final ideal output that I expect in this case is:
Qg1 matches with Tg1 with a wild card("?") (some penalty for wild card)
Qg3+Qg2 matches Tg2
Qg4 matches Tg4+Tg3 (some penalty for missing word
Qg5 has a complete match with Tg5
Qg6 has a complete match with Tg6
Qg7 matches with Tg8+Tg9 (penalty)
and penalty for extra Tg9.
I already tried longest common subsequence and needleman wunsch algorithms they are good in aligning with gaps but I don't know how to mix two parts and match them like the tricky part I mentioned and how to teach algorithm when to mix parts and start matching and when not to?
Sorry for my bad english
I'm currently coding in java.
Any suggestions will very much keep me alive and get going.
Thanks in advance
|