基于微软的免费的接近真人效果的TTS java实现 微软的tts语音合成发音接近真人。效果非常好,本仓库基于微软官方的demo实现了免费的tts示例,使用了java语言实现。
微软官方demo解析 微软的语音合成demo使用了websocket连接,一次合成会通过websocket想服务器发起三次请求,服务器通过若干个响应信息返回mp3格式音频
三次请求内容格式
第一次请求内容: 1 2 3 4 5 6 Path: speech.config X-RequestId: 81C8781545B84F5394A3949B28716251 X-Timestamp: 2023-03-05T02:44:26.068Z Content-Type: application/json {"context":{"system":{"name":"SpeechSDK","version":"1.19.0","build":"JavaScript","lang":"JavaScript"},"os":{"platform":"Browser/Linux x86_64","name":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36","version":"5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36"}}}
通过第一次请求告知服务器当前客户端的sdk版本,操作浏览器版本等信息。请求路径为:speech.config
第二次请求内容: 1 2 3 4 5 6 Path: synthesis.context X-RequestId: 81C8781545B84F5394A3949B28716251 X-Timestamp: 2023-03-05T02:44:26.069Z Content-Type: application/json {"synthesis":{"audio":{"metadataOptions":{"bookmarkEnabled":false,"sentenceBoundaryEnabled":false,"visemeEnabled":false,"wordBoundaryEnabled":false},"outputFormat":"audio-24khz-96kbitrate-mono-mp3"},"language":{"autoDetection":false}}}
第二次请求告知服务器一些需要合成的音频的格式元信息
第三次请求内容: 1 2 3 4 5 6 Path: ssml X-RequestId: 81C8781545B84F5394A3949B28716251 X-Timestamp: 2023-03-05T02:44:26.070Z Content-Type: application/ssml+xml <speak xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xmlns:emo="http://www.w3.org/2009/10/emotionml" version="1.0" xml:lang="en-US"><voice name="zh-CN-XiaoxiaoNeural"><prosody rate="0%" pitch="0%">你好</prosody></voice></speak>
第三次请求,告知服务器合成音频的字符串,生成的音频语言,说话的人物,说话的风格,说话的语速和语调
所有请求和响应通过X-RequestId标记为一次tts转换过程。 X-RequestId使用uuid无横杠的方式生成。
响应格式: 微软的一次tts过程的响应分成三次文本响应和若干次mp3的二进制消息体响应
第一次文本响应: 1 2 3 4 5 6 7 8 9 X-RequestId:81C8781545B84F5394A3949B28716251 Content-Type:application/json; charset=utf-8 Path:turn.start { "context": { "serviceTag": "ce54b7da7fd74576a552e8632a098144" } }
第二次文本响应 1 2 3 4 5 X-RequestId:81C8781545B84F5394A3949B28716251 Content-Type:application/json; charset=utf-8 Path:response {"context":{"serviceTag":"ce54b7da7fd74576a552e8632a098144"},"audio":{"type":"inline","streamId":"69042AA81C3546EBA87B3A2ADD17C17B"}}
第三次文本响应 1 2 3 4 5 X-RequestId:81C8781545B84F5394A3949B28716251 Content-Type:application/json; charset=utf-8 Path:turn.end {}
当前两次文本响应后就会开始mp3的二进制消息体的响应,当所有mp3二进制消息体响应结束后会返回第三次文本响应的内容。 所以当接收到Path:turn.end的文本消息响应时表示mp3音频数据传输完成了。
java代码实现 依赖 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 <dependencies > <dependency > <groupId > junit</groupId > <artifactId > junit</artifactId > <version > 4.13</version > <scope > test</scope > </dependency > <dependency > <groupId > org.java-websocket</groupId > <artifactId > Java-WebSocket</artifactId > <version > 1.5.2</version > </dependency > <dependency > <groupId > org.slf4j</groupId > <artifactId > slf4j-api</artifactId > <version > 1.7.32</version > </dependency > <dependency > <groupId > com.google.guava</groupId > <artifactId > guava</artifactId > <version > 30.1-jre</version > </dependency > <dependency > <groupId > org.projectlombok</groupId > <artifactId > lombok</artifactId > <scope > provided</scope > <version > 1.18.16</version > </dependency > </dependencies >
使用Java-WebSocket作为websocket的客户端
创建websocket连接 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 public class MTTS { private void init () throws URISyntaxException, InterruptedException { String uuid = UUID.randomUUID().toString().replace("-" , "" ).toUpperCase(); String urlStr = "wss://eastus.api.speech.microsoft.com/cognitiveservices/websocket/v1?TrafficType=AzureDemo&Authorization=bearer%20undefined&X-ConnectionId=" + uuid; log.info("ws url {}" , urlStr); wsClient = new WebSocketClient (new URI (urlStr)) { @Override public void onOpen (ServerHandshake serverHandshake) { log.info("ws client is open" ); isRely = true ; } @Override public void onMessage (ByteBuffer bytes) { AudioMp3Part part = new AudioMp3Part (bytes); String xRequestId = part.getXRequestId(); lisentHashMap.get(xRequestId).mp3Part(part); } @Override public void onMessage (String message) { AudioMessage audioMessage = new AudioMessage (message); if ("turn.end" .equals(audioMessage.getPath())) { lisentHashMap.get(audioMessage.getXRequestId()).end(message); } } @Override public void onClose (int i, String s, boolean b) { System.out.println("onclose:" + s); isRely = false ; } @Override public void onError (Exception e) { log.warn("onError" , e); } }; wsClient.addHeader("Origin" , "https://azure.microsoft.com" ); wsClient.addHeader("User-Agent" , "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:102.0) Gecko/20100101 Firefox/102.0" ); wsClient.addHeader("Accept" , "*/*" ); wsClient.addHeader("Accept-Language" , "zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2" ); wsClient.addHeader("Upgrade" , "websocket" ); this .wsClient.connect(); } }
请求的X-ConnectionId使用uuid去横杠随机生成,hender头一定要记得添加”Origin”, “https://azure.microsoft.com"。否则微软会拒绝服务 public void onMessage(ByteBuffer bytes)用于监听接收服务器返回的mp3二进制消息 public void onMessage(String bytes)用于监听接收服务器返回的文本消息
java实现三次请求 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 public class Sender { private TTSConfig ttsConfig=new TTSConfig (); private void sendText (WebSocketClient client, String text) { String requestID = UUID.randomUUID().toString().replace("-" , "" ).toUpperCase(); send1(client,requestID); send2(client,requestID); send3(client,requestID,text); } private void send1 (WebSocketClient client, String requestID) { String timestamp = dateFormat.format(new Date ()); String test = "Path: speech.config\r\n" + "X-RequestId: " + requestID + "\r\n" + "X-Timestamp: " + timestamp + "\r\n" + "Content-Type: application/json\r\n" + "\r\n" + "{\"context\":{\"system\":{\"name\":\"SpeechSDK\",\"version\":\"1.19.0\",\"build\":\"JavaScript\",\"lang\":\"JavaScript\"},\"os\":{\"platform\":\"Browser/Linux x86_64\",\"name\":\"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:102.0) Gecko/20100101 Firefox/102.0\",\"version\":\"5.0 (X11)\"}}}" ; log.debug("send1:{}" , test); client.send(test); } private void send2 (WebSocketClient client, String requestID) { String timestamp = dateFormat.format(new Date ()); String test = "Path: synthesis.context\r\n" + "X-RequestId: " + requestID + "\r\n" + "X-Timestamp: " + timestamp + "\r\n" + "Content-Type: application/json\r\n" + "\r\n" + "{\"synthesis\":{\"audio\":{\"metadataOptions\":{\"bookmarkEnabled\":false,\"sentenceBoundaryEnabled\":false,\"visemeEnabled\":false,\"wordBoundaryEnabled\":false},\"outputFormat\":\"audio-24khz-96kbitrate-mono-mp3\"},\"language\":{\"autoDetection\":false}}}" ; log.debug("send2:{}" , test); client.send(test); } private void send3 (WebSocketClient client, String requestID, String text) { String timestamp = dateFormat.format(new Date ()); String test = "Path: ssml\r\n" + "X-RequestId: " + requestID + "\r\n" + "X-Timestamp: " + timestamp + "\r\n" + "Content-Type: application/ssml+xml\r\n" + "\r\n" + "<speak xmlns=\"http://www.w3.org/2001/10/synthesis\" xmlns:mstts=\"http://www.w3.org/2001/mstts\" xmlns:emo=\"http://www.w3.org/2009/10/emotionml\" version=\"1.0\" xml:lang=\"" +ttsConfig.getLanguage().getCode()+"\"><voice name=\"" +ttsConfig.getVoice().getCode()+"\"><mstts:express-as style=\"" +ttsConfig.getVoiceStyle().getCode()+"\" ><prosody rate=\"0%\" pitch=\"0%\">" + text + "</prosody></mstts:express-as></voice></speak>" ; log.debug("send3:{}" , test); client.send(test); } }
文本响应体解析 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 public class AudioMessage { private String sourceMsg; private String xRequestId; private String contentType; private String path; public AudioMessage (String msg) { this .sourceMsg = msg; initByByte(msg.getBytes(StandardCharsets.UTF_8)); } private void initByByte (byte [] bytes) { try { BufferedReader br = new BufferedReader (new InputStreamReader (new ByteArrayInputStream (bytes))); String line; while ((line = br.readLine()) != null ) { if (!line.trim().equals("" )) { String[] split = line.split(":" ); if ("X-RequestId" .equals(split[0 ])) { this .xRequestId = split[1 ]; } else if ("Content-Type" .equals(split[0 ])) { this .contentType = split[1 ]; } else if ("Path" .equals(split[0 ])) { this .path = split[1 ]; } else { continue ; } } else { continue ; } } } catch (Exception e) { e.printStackTrace(); } } }
mp3二进制响应头解析 微软会将一个合成mp3音频分割成几部分通过websocket返回,一次二进制响应消息返回部分mp3数据。
1 2 3 4 5 X-RequestId:77D9CB751ADD42E095F3E36E7AA9F49B Content-Type:audio/mpeg X-StreamId:FF8159FEC89C455D9A6710E6A0759D5F Path:audio
以上文本信息(算上换行符\r\n)一共有128个字节,0080十六进制表示十进制的128
1 2 3 4 X-RequestId:77D9CB751ADD42E095F3E36E7AA9F49B X-StreamId:FF8159FEC89C455D9A6710E6A0759D5F Path:audio
以上文本信息(算上换行符\r\n)一共有103个字节,0067十六进制表示十进制的103 所以猜测响应体包括3部分:
2个字节的文本大小说明
文本数据
mp3数据
通过分析发现: 二进制响应体总共1570个字节,前两个字节的十六进制是0080,转换十进制后就是128,说明接下去的128字节是文本说明信息,之后的1440个字节就是mp3片段的mp3数据
解析mp3片段代码:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 @Getter public class AudioMp3Part { private byte [] bodyByte; private byte [] mp3Part; private short headSize; private AudioMp3MessageHead head; private static int i = 0 ; public AudioMp3Part (ByteBuffer buffer) { bodyByte = buffer.array(); if (bodyByte == null || bodyByte.length == 0 ) { throw new RuntimeException ("body bytes is null" ); } headSize = buffer.getShort(0 ); byte [] headBytes = Arrays.copyOfRange(bodyByte, 2 , headSize + 2 ); if (headBytes == null || headBytes.length == 0 ) { throw new RuntimeException ("head bytes is null" ); } head = new AudioMp3MessageHead (headBytes); mp3Part = Arrays.copyOfRange(bodyByte, headSize + 2 , bodyByte.length); } public String getXRequestId () { return head.getXRequestId(); } }
合并mp3片段: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 public class AudioMp3 { private final ArrayList<AudioMp3Part> parts = new ArrayList <>(); private String requestId; private String streamId; public byte [] getMp3Byte() { byte [] b = new byte [0 ]; for (AudioMp3Part b1 : parts) { b = Bytes.concat(b, b1.getMp3Part()); } return b; } public void add (AudioMp3Part part) { parts.add(part); } }
以上就是免费的tts java完整代码请查看:https://github.com/nathanhex/mstts-demo.git 完